Unsupervised Domain Adaptation using Feature-Whitening and Consensus Loss

03/07/2019 ∙ by Subhankar Roy, et al. ∙ Mapillary Università di Trento 0

A classifier trained on a dataset seldom works on other datasets obtained under different conditions due to domain shift. This problem is commonly addressed by domain adaptation methods. In this work we introduce a novel deep learning framework which unifies different paradigms in unsupervised domain adaptation. Specifically, we propose domain alignment layers which implement feature whitening for the purpose of matching source and target feature distributions. Additionally, we leverage the unlabeled target data by proposing the Min-Entropy Consensus loss, which regularizes training while avoiding the adoption of many user-defined hyper-parameters. We report results on publicly available datasets, considering both digit classification and object recognition tasks. We show that, in most of our experiments, our approach improves upon previous methods, setting new state-of-the-art performances.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning methods have been successfully applied to different visual recognition tasks, demonstrating an excellent generalization ability. However, analogously to other statistical machine learning techniques, deep neural networks also suffer from the problem of

domain shift [49], which is observed when predictors trained on a dataset do not perform well when applied to novel domains.

Since collecting annotated training data from every possible domain is expensive and sometimes even impossible, over the years several Domain Adaptation (DA) methods [34, 5] have been proposed. DA approaches leverage labeled data in a source domain in order to learn an accurate prediction model for a target domain. Specifically, in the special case of Unsupervised Domain Adaptation (UDA), no annotated target data are available at training time. Note that, even if target-sample labels are not available, unlabeled data can and usually are exploited at training time.

Figure 1:

Overview of the proposed deep architecture embedding our DWT layers and trained with the proposed MEC loss. (a) Due to domain shift the source and the target data have different marginal feature distributions. Our DWT estimates these distributions using dedicated sample batches and then “whitens” them projecting them into a common, spherical distribution. (b) The proposed MEC loss univocally selects a pseudo-label

that maximizes the agreement between two perturbed versions and of the same target sample.

Most UDA methods attempt to reduce the domain shift by directly aligning the source and target marginal distributions. Notably, approaches based on the Correlation Alignment

paradigm model domain data distributions in terms of their second-order statistics. Specifically, they match distributions by minimizing a loss function which corresponds to the difference between the source and the target covariance matrices obtained using the network’s last-layer activations

[45, 46, 32]. Another recent and successful UDA paradigm exploits domain-specific alignment layers

, derived from Batch Normalization (BN)

[19], which are directly embedded within the deep network [3, 25, 31]. Other prominent research directions in UDA correspond to those methods which also exploit the target data posterior distribution. For instance, the entropy minimization paradigm adopted in [3, 37, 13]

, enforces the network’s prediction probability distribution on each target sample to be peaked with respect to some (unknown) class, thus penalizing high-entropy target predictions. On the other hand, the

consistency-enforcing paradigm [38, 7, 48] is based on specific loss functions which penalize inconsistent predictions over perturbed copies of the same target samples.

In this paper we propose to unify the above paradigms by introducing two main novelties. First, we align the source and the target data distributions using covariance matrices similarly to [45, 46, 32]. However, instead of using a loss function computed on the last-layer activations, we use domain-specific alignment layers which compute domain-specific covariance matrices of intermediate features. These layers “whiten” the source and the target features and project them into a common spherical distribution (see Fig. 1 (a), blue box). We call this alignment strategy Domain-specific Whitening Transform (DWT). Notably, our approach generalizes previous BN-based DA methods [3, 25, 30] which do not consider inter-feature correlations and rely only on feature standardization.

The second novelty we introduce is a novel loss function, the Min-Entropy Consensus (MEC) loss, which merges both the entropy [3, 37, 13] and the consistency [7] loss function. The motivation behind our proposal is to avoid the tuning of the many hyper-parameters which are typically required when considering several loss terms and, specifically, the confidence-threshold hyper-parameters [7]. Indeed, due to the mismatch between the source and the target domain, and because of the unlabeled target-data assumption, hyper-parameters are hard to be tuned in UDA [32]. The proposed MEC loss simultaneously encourages coherent predictions between two perturbed versions of the same target sample and exploits these predictions as pseudo-labels for training. (Fig. 1 (b), purple box).

We plug our proposed DWT and the MEC loss into different network architectures and we empirically show a significant boost in performance. In particular, we achieve state-of-the-art results in different UDA benchmarks: MNIST [23], USPS [8], SVHN [33], CIFAR-10, STL10 [4] and Office-Home [53]. Our code will be made publicly available soon.

2 Related Work

Unsupervised Domain Adaptation. Several previous works have addressed the problem of DA, considering both shallow models and deep architectures. In this section we focus on only deep learning methods for UDA, as these are the closest to our proposal.

UDA methods mostly differ in the strategy used to reduce the discrepancy between the source and the target feature distributions and can be grouped in different categories. The first category includes methods modeling the domain distributions in terms of their first and second order statistics. For instance, some works aim at reducing the domain shift by minimizing the Maximum Mean Discrepancy [28, 29, 53] and describe distributions in terms of their first order statistics. Other works consider also second-order statistics using the correlation alignment paradigm (Sec. 1) [46, 32]. Instead of introducing additional loss functions, more recent works deal with the domain-shift problem by directly embedding into a deep network domain alignment layers which exploit BN [25, 3, 31].

A second category of methods include approaches which learn domain-invariant deep representations. For instance, in [9] a gradient reversal layer learns discriminative domain-agnostic representations. Similarly, in [51] a domain-confusion loss is introduced, encouraging the network to learn features robust to the domain shift. Haeusser et al. [14] present Associative Domain Adaptation, an approach which also learns domain-invariant embeddings.

A third category includes methods based on Generative Adversarial Networks (GANs) [35, 1, 47, 41, 39]. The main idea behind these approaches is to directly transform images from the target domain to the source domain. While GAN-based methods are especially successful in adaptation from synthetic to real images and in case of non-complex datasets, they have limited capabilities for complex images.

Entropy minimization, first introduced in [12]

, is a common strategy in semi-supervised learning

[54]. In a nutshell, it consists in exploiting the high-confidence predictions of unlabeled samples as pseudo-labels. Due to its effectiveness, several popular UDA methods [35, 3, 37, 29] have adopted the entropy-loss for training deep networks.

Another popular paradigm in UDA, which we refer to as the consistency-enforcing paradigm, is realized by perturbing the target samples and then imposing some consistency between the predictions of two perturbed versions of the same target input. Consistency is imposed by defining appropriate loss functions, as shown in [37, 7, 38]. The consistency loss paradigm is effective but it becomes uninformative if the network produces uniform probability distributions for corresponding target samples. Thus, previous methods also integrate a Confidence Thresholding (CT) technique [7], in order to discard unreliable predictions. Unfortunately, CT introduces additional user-defined and dataset-specific hyper-parameters which are difficult to tune in an UDA scenario [32]. Differently, as demonstrated in our experiments, our MEC loss eliminates the need of CT and the corresponding hyper-parameters.

Feature Decorrelation. Recently, Huang et al. [17] and Siarohin et al. [43] proposed to replace BN with feature whitening in a discriminative and generative setting, respectively. However, none of these works consider a DA problem. We show in this paper that feature whitening can be used to align the source and the target marginal distributions using layer-specific covariance matrices without the need of a dedicated loss function as in previous correlation alignment methods.

3 Method

In this section we present the proposed UDA approach. Specifically, after introducing some preliminaries, we describe our Domain-Specific Whitening Transform and, finally, the proposed Min-Entropy Consensus loss.

3.1 Preliminaries

Let be the labeled source dataset, where is an image and its associated label, and be the unlabeled target dataset. The goal of UDA is to learn a predictor for the target domain by using samples from both and . Learning a predictor for the target domain is not trivial because of the issues discussed in Sec. 1.

A common technique to reduce domain shift is to use BN-based layers inside a network, such as to project the source and target feature distributions to a reference distribution through feature standarization. As mentioned in Sec. 1, in this work we propose to replace feature standardization with whitening, where the whitening operation is domain-specific. Before introducing the proposed whitening-based distribution alignment, we recap below BN. Let be a mini-batch of input samples to a given network layer, where each element is a

-dimensional feature vector,

i.e. . Given , in BN each is transformed as follows:

(1)

where () indicates the -th dimension of the data, and

are, respectively, the mean and the standard deviation computed with respect to the

-th dimension of the samples in and is a constant used to prevent numerical instability. Finally, and are scaling and shifting learnable parameters.

In the next section we present our DWT, while in Sec. 3.3 we present the proposed MEC loss. It is worth noting that each proposed component can be plugged independently in a network without having to rely on each other.

3.2 Domain-specific Whitening Transform

As stated above, BN is based on a per-dimension standardization of each sample . Hence, once normalized, the batch samples may still have correlated feature values. Since our goal is to use feature normalization in order to alleviate the domain-shift problem (see below), we argue that plain standardization is not enough to align the source and the target marginal distributions. For this reason we propose to use Batch Whitening (BW) instead of BN, which is defined as:

(2)
(3)

In Eq. (3), the vector is the mean of the elements in (being its -th component) while the matrix is such that: , where is the covariance matrix computed using . are the batch-dependent first and second-order statistics. Eq. (3) performs the whitening of and the resulting set of vectors

lie in a spherical distribution (i.e., with a covariance matrix equal to the identity matrix).

Our network takes as input two different batches of data, randomly extracted from and , respectively. Specifically, given any arbitrary layer in the network, let and denote the batch of intermediate output activations, from layer , for the source and target domain, respectively. Using Eq. (2)-(3) we can now define our Domain-specific Whitening Transform (DWT). Let and denote the inputs to the DWT layer from the source and the target domain, respectively. Our DWT is defined as follows (we drop the sample index and dimension index for the sake of clarity):

(4)
(5)

We estimate separate statistics ( and ) for and and use them for whitening the corresponding activations, projecting the two batches into a common spherical distribution (Fig. 1 (a)).

and are computed following the approach described in [43], which is based on the Cholesky decomposition [6]. The latter is faster [43] than the ZCA-based whitening [20] adopted in [17]. In the Supplementary Material we provide more details on how and are computed. Differently from [43] we replace the “coloring” step after whitening with simple scale and shift operations, thereby preventing the introduction of extra parameters in the network. Also, differently from [43] we use feature grouping [17] (Sec. 3.2.1) in order to make the batch-statistics estimate more robust when is small and is large. During training, the DWT layers accumulate the statistics for the target domain using a moving average of the batch statistics ().

In summary, the proposed DWT layers replace the correlation alignment of the last-layer feature activations with the intermediate-layer feature whitening, performed at different levels of abstraction. In Sec. 3.2.1 we show that BN-based domain alignment layers [25, 3] can be seen as a special case of DWT layers.

3.2.1 Implementation Details

Given a typical block (Conv layer BN ReLU) of a CNN, we replace the BN layer with our proposed DWT layer (see in Fig. 1), obtaining: (Conv layer DWT ReLU). Ideally, in order to project the source and target feature distributions to a reference one, the DWT layers should perform full-feature whitening using a whitening matrix, where is the number of features. However, computing the covariance matrix can be ill-conditioned if is large and is small. For this reason, unlike [43] and similar to [17] we use feature grouping, where the features are grouped into subsets of size . This results in better-conditioned covariance matrices but into partially whitened features. In this way we reach a compromise between full-feature whitening and numerical stability. Interestingly, when , the whitening matrices reduce to diagonal matrices, thus realizing feature standardization as in [3, 25].

3.3 Min-Entropy Consensus Loss

The impossibility of using the cross-entropy loss on the unlabeled target samples is commonly circumvented using some common unsupervised loss, such as the entropy [3, 37] or the consistency loss [7, 38]. While minimizing the entropy loss ensures that the predictor maximally separates the target data, minimization of the consistency loss forces the predictor to deliver consistent predictions for target samples coming from identical (yet unknown) category. Therefore, given the importance of exploiting better the unlabeled target data and the limitations of the above two losses (see Sec. 1), we propose a novel Min-Entropy Consensus (MEC) loss within the framework of UDA. We explain below how MEC loss merges both the entropy and the consistency loss into a single unified function.

Similar to the consistency loss, the proposed MEC loss requires input data perturbations. Unless otherwise explicitly specified, we apply common data-perturbation techniques on both and using affine transformations and Gaussian blurring operations. When we use the MEC loss, the network is fed with three batches instead of two. Specifically, apart from , we use two different target batches ( and ), which contain duplicate pairs of images differing only with respect to the adopted image perturbation.

Conceptually, we can think of this pipeline as three different networks with three separate domain-specific statistics , and but with shared network weights. However, since both and are drawn from the same distribution, we estimate a single using both the target batches (). As an additional advantage, this makes it possible to use samples for computing .

Let , and be three batches of the last-layer activations. Since the source samples are labeled, the cross-entropy loss () can be used in case of :

(6)

where is the (soft-max-based) probability prediction assigned by the network to a sample with respect to its ground-truth label . However, ground-truth labels are not available for target samples. For this reason, we propose the following MEC loss ():

(7)
(8)

In Eq. (8), and are activations of two corresponding perturbed target samples.

The intuitive idea behind our proposal is that, similarly to consistency-based losses [7, 38], since and correspond to the same image, the network should provide similar predictions. However, unlike the aforementioned methods which compute the L2-norm or the binary cross-entropy between these predictions, the proposed MEC loss finds the class such that . is the class in which the posteriors corresponding to and maximally agree. We then use as the pseudo-label, which can be selected without ad-hoc confidence thresholds. In other words, instead of using high-confidence thresholds to discard unreliable target samples [7]

, we use all the samples but we backpropagate the error with respect to only

.

The dynamics of MEC loss is the following. First, similarly to the consistency losses, it forces the network to provide coherent predictions. Second, differently from consistency losses, which are prone to attain a near zero value with uniform posterior distributions, it enforces peaked predictions. See the Supplementary Material for a more formal relation between the MEC loss and both entropy and consistency loss.

The final loss is a weighted sum of and : .

3.4 Discussion

The proposed DWT generalizes the BN-based DA approaches by decorrelating the batch features. Besides the analogy with the correlation-alignment methods mentioned in Sec. 1, in which covariance matrices are used to estimate and align the source and the target distributions, a second reason for which we believe that full-whitening is important is due to the relation between feature normalization and the smoothness of the loss [42, 22, 17, 24, 36]. For instance, previous works [24, 36] showed that better conditioning of the input-feature covariance matrix leads to better conditioning of the Hessian of the loss function, making the gradient descent weight updates closer to Newton updates. However, BN only performs standardization, which barely improves the conditioning of the covariance matrix when the features are correlated [17]. Conversely, feature whitening completely decorrelates the batch samples, thus potentially improving the smoothness of the landscape of the loss function.

The importance of a smoothed loss function is even higher when entropy-like losses on unlabeled data are used. For instance, Shu et al. [42] showed that minimizing the entropy forces the classifier to be confident on the unlabeled target data, thus potentially driving the classifier’s decision boundaries away from the target data. However, without a locally-Lipschitz constraint on the loss function (i.e. with a non smoothed loss landscape), the decision boundaries can be placed close to the training samples even when the entropy is minimized [42]. Since our MEC loss is related with both the entropy and the consistency loss, we employ DWT also to improve the smoothness of our loss function in order to alleviate overfitting phenomena related to the use of unlabeled data.

4 Experiments

In this section we provide details about our implementation and training protocols and we report our experimental evaluation. We conduct experiments on both small and large-scale datasets and we compare our method with state-of-the-art approaches. We also present an ablation study to analyze the impact of each of our contributions on the classification accuracy.

4.1 Datasets

We conduct experiments on the following datasets:

MNIST USPS. The MNIST dataset [23] contains grayscale images (28 28 pixels) depicting handwritten digits ranging from 0 to 9. The USPS [8] dataset is similar to MNIST, but images have smaller resolution (16 16 pixels). The domain shift between USPS and MNIST datasets can be visually observed from Fig. 2(a).

MNIST SVHN. Street View House Number (SVHN) [33] images are 32 32 pixels RGB images. Similarly to the MNIST dataset digits range from 0 to 9. However, in SVHN images have variable colour intensities and depict non-centered digits. Thus, there is a significant domain shift with respect to MNIST (Fig. 2(b))

CIFAR-10 STL: CIFAR-10 is a 10 class dataset of RGB images depicting generic objects and with resolution 32 32 pixels. STL [4] is similar to the CIFAR-10, except it has fewer labelled training images per class and has images of resolution 96 96 pixels. The non-overlapping classes - “frog” and “monkey” are removed from CIFAR-10 and STL, respectively. Samples are shown in Fig. 2.(c).

(a) MNIST USPS
(b) SVHN MNIST
(c) CIFAR-10 STL
Figure 2: Small image datasets used in our experiments.

Office-Home: The Office-Home [53] dataset comprises 4 distinct domains, each corresponding to 65 different categories (Fig. 3). There are 15,500 images in the dataset, thus this represents large-scale benchmark for testing domain adaptation methods. The domains are: Art(Ar), Clipart (Cl), Product (Pr) and Real World (Rw).

Figure 3: Sample images from the Office-Home dataset.

4.2 Experimental Setup

To fairly compare our method with other UDA approaches, in the digits experiments we adopt the same base networks proposed in [10]. For the CIFAR-10STL experiments we use the network described in [7]. We train the networks using the Adam optimizer [21] with a mini-batch of cardinality = 64 samples, an initial learning rate of 0.001 and weight decay of 5

. The networks are trained for a total of 120 epochs with learning rate being decreased by a factor of 10 after 50 and 90 epochs. We use the SVHN

MNIST setting to fix the value of the hyperparameter

to 0.1 and to set group size () equal to 4. These hyperparameters values are used for all the datasets. The accuracy values reported in Tab.  1, 4 and 3 are averaged over five runs.

In the Office-Home dataset experiments we use a ResNet-50 [15] architecture following [27]

. In our experiments we modify ResNet-50 by replacing the first BN layer and the BN layers in the first residual block (with 64 features) with DWT layers. The network is initialized with weights taken from a pre-trained model trained on the ILSVRC-2012 dataset. We discard the final fully-connected layer and we replace it with a randomly initialized fully-connected layer with 65 output logits. During training, each domain-specific batch is limited to

= 20 samples (due to GPU memory constraints). The Adam optimizer is used with an initial learning rate of for the randomly initialized final layer and for the rest of the trainable parameters of the network. The network is trained for a total of 60 epochs where one “epoch” is the pass through the entire data set having the lower number of training samples. The learning rates are then decayed by a factor of 10 after 54 epochs. Differently from the small-scale datasets experiments, where target samples have predefined train and test splits, in the Office-Home experiments, all the target samples (without labels) are used during training and evaluation.

To demonstrate the effect our contributions, we consider three different variants for the proposed method. In the first variant (denoted as DWT in Sec. 3.2), we only consider DWT layers without the proposed MEC loss. In practice, in the considered network architectures we replace the BN layers which follows the convolutional layers with DWT layers. Supervised cross-entropy loss is used for the labeled source samples and the entropy-loss as in [3] is used for the unlabeled target samples. No data-augmentation is used here. In the second variant, denoted as DWT-MEC, we also exploit the proposed MEC loss (this corresponds to our full method). In this case we need perturbations of the input data, which are obtained by following some basic data-perturbation schemes like image translation by a factor of [0.05, 0.05], Gaussian blur () and random affine transformation as proposed in [7]. Finally, in the third variant (DWT-MEC (MT)) we plug our proposed DWT layers and the MEC loss in the Mean-Teacher (MT) training paradigm [48].

4.3 Results

In this section we present an extensive experimental analysis of our approach, showing both the results of an ablation study and a comparison with state-of-the-art methods.

Methods Source Target MNIST USPS USPS MNIST SVHN MNIST MNIST SVHN
Source Only 78.9 57.11.7 60.11.1 20.231.8
w/o augmentation
CORAL [45] 81.7 - 63.1 -
MMD [51] 81.1 - 71.1 -
DANN [10] 85.1 73.02.0 73.9 35.7
DSN [2] 91.3 - 82.7 -
CoGAN [26] 91.2 89.10.8 - -
ADDA [52] 89.40.2 90.10.8 76.01.8 -
DRCN [11] 91.80.1 73.70.1 82.00.2 40.10.1
ATT [37] - - 86.20 52.8
ADA [13] - - 97.6 -
AutoDIAL [3] 97.96 97.51 89.12 10.78
SBADA-GAN [35] 97.6 95.0 76.1 61.1
GAM [16] 95.70.5 98.00.5 74.61.1 -
MECA [32] - - 95.2 -
DWT 99.090.09 98.790.05 97.750.10 28.92 1.9
Target Only 96.5 99.2 99.5 96.7
w/ augmentation
SE a [7] 88.140.34 92.358.61 93.335.88 33.874.02
SE b [7] 98.230.13 99.540.04 99.260.05 37.492.44
SE b [7] 99.290.16 99.260.04 97.880.03 24.090.33
DWT-MECb 99.010.06 99.020.05 97.800.07 30.200.92
DWT-MEC (MT)b 99.300.19 99.150.05 99.140.02 31.582.34
Table 1: Accuracy (%) on the digits datasets: comparison with state of the art. a indicates minimal usage of data augmentation and b considers augmented source and target data. indicates our implementation of SE [7].

4.3.1 Ablation Study

We first conduct a thorough analysis of our method assessing, in isolation, the impact of our two main contributions: (i) aligning source and target distributions by embedded DWT layers; and (ii) leveraging target data through our threshold-free MEC loss.

First, we consider the SVHNMNIST setting and we show the benefit of feature whitening over BN. We vary the number of whitening layers from 1 to 3 and simultaneously change the group size () from 1 to 8 (see Sec. 3.2.1). With group size equal to 1, DWT layers reduces to DA layers as proposed in [3, 25]. Our results are shown in Fig. 4 and from the figure it is clear that when the accuracy stays consistently below 90 . This behaviour can be ascribed to the sub-optimal alignment of source and target data distributions achieved with previous BN-based DA layers. When the group size increases, the feature decorrelation performed by the DWT layers comes into play and results into a significant improvement in terms of performance. The accuracy increases monotonically as the group size grows until the value of , then it start to decrease. This final drop in accuracy is probably due to an inaccurate estimation of covariance matrices. Indeed, a covariance matrix with size 8 8 is perhaps poorly estimated due to the lack of samples in a batch (Sec. 3.2.1). Importantly, Fig. 4 also shows that increasing the number of DWT layers has a positive impact on the accuracy. This is in contrast with [17], where feature decorrelation is used only in the first layer of the network.

Figure 4: SVHN MNIST experiment: accuracy at varying number of DWT layers and group size. Different colors are used to improve readability.

In Tab. 2 we evaluate the effectiveness of the proposed MEC loss and we compare our approach with the consistency based loss recently adopted by French et al. [7]. We use Self-Ensembling (SE) [7] with and without confidence thresholding (CT) on the network predictions of the teacher network. To fairly compare our approach with SE we also consider a mean-teacher scheme in our framework. We observe that SE have excellent performance when the CT is set to a very high value (0.936 as reported in [7]) but it performance drops when CT is set equal to 0, especially in the SVHNMNIST setting. This shows that the consistency loss in [7] may be harmful when the network is not confident on the target domain samples. On the contrary, the proposed MEC loss leads to results which are on par to SE in the MNISTUSPS settings and to higher accuracy in the SVHNMNIST setting. This clearly demonstrates that our proposed loss avoids the need of introducing the CT hyper-parameter and, at the same time, yields to better performance. It is important to remark that, in the case of UDA, tuning hyper-parameters is hard as target samples are unlabeled and cross-validation on source data is unreliable because of the domain shift problem [32].

Method Source Target MNIST USPS USPS MNIST SVHN MNIST
SE (w/ CT) [7] 99.29 99.26 97.88
SE (w/o CT) [7] 98.71 97.63 26.80
DWT-MEC (MT) 99.30 99.15 99.14
Table 2: Accuracy (%) on the digits datasets. Comparison between the consistency loss in SE method [7] (with and without CT) and our threshold-free MEC loss.
Method Source Target Ar Cl Ar Pr Ar Rw Cl Ar Cl Pr Cl Rw Pr Ar Pr Cl Pr Rw Rw Ar Rw Cl Rw Pr Avg
ResNet-50 [15] 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1
DAN [28] 43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3
DANN [10] 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6
JAN [29] 45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8 58.3
CDAN-RM [27] 49.2 64.8 72.9 53.8 63.9 62.9 49.8 48.8 71.5 65.8 56.4 79.2 61.6
CDAN-M [27] 50.6 65.9 73.4 55.7 62.7 64.2 51.8 49.1 74.5 68.2 56.9 80.7 62.8
DWT 50.8 72.0 75.8 58.9 65.6 60.2 57.2 49.5 78.3 70.1 55.3 78.2 64.3
SE [7] 48.8 61.8 72.8 54.1 63.2 65.1 50.6 49.2 72.3 66.1 55.9 78.7 61.5
DWT-MEC 54.7 72.3 77.2 56.9 68.5 69.8 54.8 47.9 78.1 68.6 54.9 81.2 65.4
Table 3: Accuracy(%) on Office-Home dataset with Resnet-50 as base network and comparison with state-of-the-art methods.
Source Target CIFAR-10 STL STL CIFAR-10
Source Only 60.35 51.88
w/o augmentation
DANN [10] 66.12 56.91
DRCN [11] 66.37 58.65
AutoDIAL [3] 79.10 70.15
DWT 79.750.25 71.180.56
Target Only 67.75 88.86
w/ augmentation
SE a [7] 77.530.11 71.650.67
SE b [7] 80.090.31 69.861.97
DWT-MECb 80.390.31 72.520.94
DWT-MEC (MT)b 81.830.14 71.310.22
Table 4: Accuracy (%) on the CIFAR-10STL: comparison with state of the art. a indicates minimal data augmentation and b considers augmented source and target data.

4.3.2 Comparison with State-of-the-Art Methods

In this section we present the results of our comparison with previous UDA methods. Tab. 1 reports the results obtained on the digits datasets. We compare with several baselines: Correlation Alignment (CORAL) [45], Simultaneous Deep Transfer (MMD) [51], Domain-Adversarial Training of Neural Networks (DANN) [10], Domain separation networks [2], Coupled generative adversarial net-works (CoGAN) [26], Adversarial discriminative domain adaptation (ADDA) [52], Deep reconstruction-classification networks (DRCN), [11], Asymmetric tri-training [37], Associative domain adaptation (ADA) [13], AutoDIAL [3], SBADA-GAN [35], Domain transferthrough deep activation matching (GAM) [16], Minimal-entropy correlation alignment (MECA) [32] and SE [7]. Note that the Virtual Adversarial Domain Adaptation (VADA) [42] use a different network, thus cannot be compared with the other methods (including ours) which are based on a different capacity network. For this reason, [42] is not reported in Tab. 1. Results associated with each method are taken from the corresponding papers. We re-implemented SE as the numbers reported in the original paper [7] refer to a different deep architecture. We also report results where the network is trained only on labeled source and target data.

Tab. 1 is split in two sections, separating those methods that exploit data augmentation from those which use only the original training data. Compared with no-data augmentation methods, our DWT performs better than previous UDA methods in the three settings. Our method is less effective in the MNISTSVHN due to the strong domain shift between the two domains. In this setting, GAN-based methods [35] are more effective. Looking at methods which consider data augmentation, we compare our approach with SE [7]. To be consistent with other methods, we plug the architectures described in [9] in SE. Comparing the proposed approach with our re-implementation of SE (SEb) we observe that DWT-MEC (MT) is almost on par with SE in the MNISTUSPS setting and better than SE in the SVHNMNIST. For the sake of completeness, we also report the performance of SE taken from the original paper [7], considering SE with minimal augmentation (only gaussian blur) and SE with full augmentation (translation, horizontal flip, affine transformations).

With the rapid progress of deep DA methods, the results in the digits datasets have saturated. This makes it difficult to gauge the merit of the proposed contributions. Therefore, we also consider the CIFAR10 STL setting. Our results are reported in Tab. 4. Similarly to the experiments in Tab. 1, we separate those methods exploiting data augmentation from those not using target-sample perturbations. Tab. 4 shows that our method (DWT), outperforms all previous baselines which also do not consider augmentation. Furthermore, by exploiting data perturbation and the proposed MEC loss our approach (with and without Mean-Teacher) reaches higher accuracy than SE.111In this case the accuracy values reported for SE are taken directly from the original paper as the underlying network architecture is the same.

Finally, we also perform experiments on the large-scale Office-Home dataset and we compare with the baselines methods as reported in the very recent work of Long et al. [27]. The results reported in Tab. 3 show that our approach outperforms all the other methods. On average, the proposed approach improves over Conditional Domain Adversarial Networks (CDAN) by 2.4 and it is also more accurate than SE.

5 Conclusions

In this work we address UDA by proposing domain-specific feature whitening with DWT layers and the MEC loss. On the one hand, whitening of intermediate features enables the alignment of the source and the target distributions at intermediate feature levels and increases the smoothness of the loss landscape. On the other hand, our MEC loss better exploits the target data. Both these components can be easily integrated in any standard CNN. Our experiments on standard benchmarks show state-of-the-art performance on digits categorization and object recognition tasks. As future work, we plan to extend our method to handle multiple source and target domains.

References

  • [1] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with gans. In CVPR, 2017.
  • [2] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, 2016.
  • [3] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò. Autodial: Automatic domain alignment layers. In ICCV, pages 5077–5085, 2017.
  • [4] A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    , pages 215–223, 2011.
  • [5] G. Csurka, editor.

    Domain Adaptation in Computer Vision Applications

    .

    Advances in Computer Vision and Pattern Recognition. Springer, 2017.

  • [6] D. Dereniowski and K. Marek. Cholesky factorization of matrices in parallel and ranking of graphs. In 5th Int. Conference on Parallel Processing and Applied Mathematics, 2004.
  • [7] G. French, M. Mackiewicz, and M. Fisher. Self-ensembling for visual domain adaptation. ICLR, 2018.
  • [8] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. 2001.
  • [9] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. ICML, 2015.
  • [10] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016.
  • [11] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstruction-classification networks for unsupervised domain adaptation. In ECCV, 2016.
  • [12] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In NIPS, 2004.
  • [13] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative domain adaptation. In ICCV, volume 2, page 6, 2017.
  • [14] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative domain adaptation. In ICCV, 2017.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [16] H. Huang, Q. Huang, and P. Krahenbuhl. Domain transfer through deep activation matching. In ECCV, pages 590–605, 2018.
  • [17] L. Huang, D. Yang, B. Lang, and J. Deng. Decorrelated batch normalization. In CVPR, 2018.
  • [18] L. Huang, D. Yang, B. Lang, and J. Deng. Decorrelated batch normalization. In CVPR, 2018.
  • [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [20] A. Kessy, A. Lewin, and K. Strimmer. Optimal whitening and decorrelation. The American Statistician, 2017.
  • [21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv :1412.6980, 2014.
  • [22] J. Kohler, H. Daneshmand, A. Lucchi, M. Zhou, K. Neymeyr, and T. Hofmann. Towards a Theoretical Understanding of Batch Normalization. arXiv:1805.10694, 2018.
  • [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [24] Y. LeCun, L. Bottou, G. B. Orr, and K. Müller. Efficient backprop. In Neural Networks: Tricks of the Trade - Second Edition, pages 9–48. 2012.
  • [25] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou. Revisiting batch normalization for practical domain adaptation. arXiv:1603.04779, 2016.
  • [26] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In NIPS, pages 469–477, 2016.
  • [27] M. Long, Z. Cao, J. Wang, and M. I. Jordan. Conditional adversarial domain adaptation. NIPS, 2018.
  • [28] M. Long and J. Wang. Learning transferable features with deep adaptation networks. In ICML, 2015.
  • [29] M. Long, H. Zhu, J. Wang, and M. I. Jordan.

    Deep transfer learning with joint adaptation networks.

    ICML, 2017.
  • [30] M. Mancini, H. Karaoguz, E. Ricci, P. Jensfelt, and B. Caputo. Kitting in the wild through online domain adaptation. IROS, 2018.
  • [31] M. Mancini, L. Porzi, S. R. Bulò, B. Caputo, and E. Ricci. Boosting domain adaptation by discovering latent domains. CVPR, 2018.
  • [32] P. Morerio, J. Cavazza, and V. Murino. Minimal-entropy correlation alignment for unsupervised deep domain adaptation. ICLR, 2018.
  • [33] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
  • [34] S. J. Pan, Q. Yang, et al. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • [35] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo. From source to target and back: symmetric bi-directional adaptive gan. In CVPR, 2018.
  • [36] H. N. S. Wiesler. A convergence analysis of log-linear training. In NIPS, 2011.
  • [37] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training for unsupervised domain adaptation. arXiv:1702.08400, 2017.
  • [38] M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In NIPS, pages 1163–1171, 2016.
  • [39] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In CVPR, 2018.
  • [40] J. Schäfer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4(1), 2005.
  • [41] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. arXiv:1612.07828, 2016.
  • [42] R. Shu, H. H. Bui, H. Narui, and S. Ermon. A dirt-t approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735, 2018.
  • [43] A. Siarohin, E. Sangineto, and N. Sebe. Whitening and Coloring transform for GANs. arXiv:1806.00420, 2018.
  • [44] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.
  • [45] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
  • [46] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. ECCV, 2016.
  • [47] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. ICLR, 2017.
  • [48] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS, pages 1195–1204, 2017.
  • [49] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, pages 1521–1528. IEEE, 2011.
  • [50] L. N. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, 1997.
  • [51] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.
  • [52] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, volume 1, page 4, 2017.
  • [53] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised domain adaptation. In CVPR, 2017.
  • [54] X. Zhu. Semi-supervised learning literature survey. 2005.

6 Computing the whitening matrix

The whitening matrix in Eq. (3) of the main paper can be computed in different ways. For instance, Huang et al. [18] use the ZCA whitening [20], while Siarohin et al. [43] use the Cholesky decomposition [6]. Both tecniques are unique (given a covariance matrix) and differentiable, however we adopted the method proposed in [43] because it is faster [50] and more stable [43] than the ZCA-based whitening. Moreover, many modern platforms for deep-network developing include tools for computing the Cholesky decomposition, thus this solution makes our approach easier to be reproduced.

We describe below the main steps we used to compute . Since and , respectively used in Eq. (4) and in Eq. (5) of the main paper, and depending on and , are computed exacly in the same way, in the following we refer to the generic matrix in Eq. (3) which depends on the batch statistics .

The first step consists in computing the covariance matrix . To avoid instability issues, we blend the empirical covariance matrix with , the identity matrix [40]:

(9)

where:

(10)

Once is computed, we use the approach proposed in [43] to compute such that :

  1. Let , where is a lower triangular matrix.

  2. Using the Cholesky decomposition we compute and from .

  3. We invert and we obtain: .

For more details, we refer to [43].

7 Relation between the MEC loss and the Entropy and the Consistency losses

We show below a formal relation between our MEC loss and the Entropy and the Consistency losses.

Proposition 1.

Let be an hypothesis space of predictors of infinite capacity. Then the minimization of the consensus loss yields a predictor that is consistent, i.e. for any pairs of perturbed datapoints and confident, i.e. for all and some depending on .

Proof.

The pointwise loss is lower bounded by and it attains if and only if the conditions on listed in the theorem are satisfied. The result follows noting that predictors of infinite capacity can always attain loss. ∎

8 Additional experiments using synthetic-to-real adaptation settings

In this section we report results of additional UDA experiments using synthetic source images and real target images and we compare our method with the state-of-the-art approaches in these settings.

8.1 Datasets and experimental setup

Synthetic numbers SVHN. It is a common practice in UDA to train a predictor on annotated synthetic images and then test on real images. In this setting we use the SYN NUMBERS [10] as the source dataset and SVHN [33] as the target dataset. The former (SYN NUMBERS) is composed of images which are software-generated (e.g., using different orientations, stroke colors, etc.), in order to simulate the latter (SVHN). Despite some geometric similarities between the two datasets, there exists a significant domain shift between them because, for instance, the cluttered background in SVHN, which is absent in SYN NUMBERS images (see Fig. 5 (a)). There are approximately 500,000 annotated images in the SYN NUMBERS dataset.

(a) SYN NUMBERS SVHN
(b) SYN SIGNS GTSRB
Figure 5: Samples from Synthetic Images dataset (source) and Real Image dataset (target)

Synthetic Signs GSTRB. In this setting, which is analogous to the SYN NUMBERS SVHN experiment, the source dataset (SYN SIGNS [10]) is composed of synthetic traffic signs, while the target dataset is the German Traffic Sign Recognition Benchmark (GTSRB [44]). The SYN SIGNS dataset is composed of 100,000 synthetic images belonging to 43 different traffic signs categories, while the GTSRB dataset is composed of 39,209 real images, partitioned using the same 43 categories. As shown in Fig. 5 (b), the real target domain exhibits a domain shift because of different illumination conditions, background clutter, etc.

In the experiments conducted on both settings we adopt the standard evaluation protocols and the corresponding training/testing splits [10], using identical experimental setups as reported in Sec. 4.2 of the main paper.

Method Source Target Syn Numbers SVHN Syn Signs GTSRB
Source Only 86.70.8 80.60.6
w/o augmentation
DANN [10] 91.0 88.6
ATT [37] 92.9 96.2
ADA [13] 91.8 97.6
AutoDIAL [3] 87.9 97.8
DWT 93.700.21 98.110.13
Target Only 92.2 99.8
w/ augmentation
SE a [7] 96.010.08* 98.530.15*
SE b [7] 97.110.04* 99.370.09*
SE a [7] 91.920.09 97.730.10
SE b [7] 95.620.12 99.010.04
DWT-MEC 94.620.13 99.300.07
DWT-MEC (MT) 94.100.21 99.220.16
Table 5: Accuracy (%) using Synthetic image Real image settings.* denotes values extracted from [7]; a means minimal augmentation; b means full augmentation of both the source and the target data; and denotes methods using base networks which are identical to our proposed method.

8.2 Comparison with state-of-the-art methods

In Tab 5 we report the results of our method compared with other UDA methods. We compare with the following baselines: Domain-Adversarial Training of Neural Networks (DANN) [10], Asymmetric tri-training (ATT) [37], Associative Domain Adaptation (ADA) [13], AutoDIAL [3] and Self-Ensembling (SE) [7]. The results of most of the methods reported in Tab. 5 are taken from the original papers. In the same table we also show SE and AutoDIAL results obtained using comparable base-network architectures as those used by our method. Moreover, similarly to the main paper, and for a fair comparison, we split Tab. 5 into two sections in order to differentiate the methods which use data augmentation from those methods which do not exploit data augmentation.

When DWT is compared with the methods using no-data augmentation, it outperforms all the baselines in both the SYN NUMBERS SVHN and the SYN DIGITS GTSRB setting. When data augmentation is considered, DWT-MEC outperforms all the other approaches in the second setting but performs worse by 1% when compared with SE [7] in the first setting. The superior performance of SE in SYN NUMBERS SVHN can be attributed to the use of a very conservative threshold on the target predictions, which helps to filter-out noisy predictions during training. However, as demonstrated in Sec 4.3.1 of the main paper (Tab. 2), the absence of a confidence threshold, tuned on the specific setting, might lead SE to a drastic performance degradation.

9 CNN Architectures

In this section we report the network architectures used in all the small-image experiments shown in both the main paper and in this Supplementary Material (Tab. 6, 7, 9, 8).

Description
Input: 28 28
Conv 5 5

32, pad 2

Max-pool 2

2, stride 2

Conv 5 5 48, pad 2
Max-pool 2 2, stride 2
Fully connected, 100 units
Fully connected, 100 units
Fully connected, 10 units, softmax
Table 6: MNIST USPS base architecture as used in [10].
Description
Input: 32 32 3
Conv 5 5 64, pad 2
Max-pool 3 3, stride 2
Conv 5 5 64, pad 2
Max-pool 3 3, stride 2
Conv 5 5 128, pad 2
Fully connected, 3072 units
Dropout, 50%
Fully connected, 2048 units
Dropout, 50%
Fully connected, 10 units, softmax
Table 7: SVHN MNIST and SYN NUMBERS SVHN base architecture as used in [10].
Description
Input: 40 40 3
Conv 5 5 96, pad 2
Max-pool 2 2, stride 2
Conv 3 3 144, pad 1
Max-pool 2 2, stride 2
Conv 5 5 256, pad 2
Max-pool 2 2, stride 2
Fully connected, 512 units
Dropout, 50%
Fully connected, 43 units, softmax
Table 8: SYN SIGNS GTSRB base architecture as used in [10].
Description
Input: 32 32 3
Conv 3 3 128, pad 1
Conv 3 3 128, pad 1
Conv 3 3 128, pad 1
Max-pool 2 2, stride 2
Dropout, 50%
Conv 3 3 256, pad 1
Conv 3 3 256, pad 1
Conv 3 3 256, pad 1
Max-pool 2 2, stride 2
Dropout, 50%
Conv 3 3 512, pad 0
Conv 1 1 256, pad 0
Conv 1 1 128, pad 0
Global Average Pooling
Fully connected, 9 units, softmax
Table 9: CIFAR-10 STL base architecture as used in [7].