Deep Discriminative Learning for Unsupervised Domain Adaptation

11/17/2018 ∙ by Rohith AP, et al. ∙ 0

The primary objective of domain adaptation methods is to transfer knowledge from a source domain to a target domain that has similar but different data distributions. Thus, in order to correctly classify the unlabeled target domain samples, the standard approach is to learn a common representation for both source and target domain, thereby indirectly addressing the problem of learning a classifier in the target domain. However, such an approach does not address the task of classification in the target domain directly. In contrast, we propose an approach that directly addresses the problem of learning a classifier in the unlabeled target domain. In particular, we train a classifier to correctly classify the training samples while simultaneously classifying the samples in the target domain in an unsupervised manner. The corresponding model is referred to as Discriminative Encoding for Domain Adaptation (DEDA). We show that this simple approach for performing unsupervised domain adaptation is indeed quite powerful. Our method achieves state of the art results in unsupervised adaptation tasks on various image classification benchmarks. We also obtained state of the art performance on domain adaptation in Amazon reviews sentiment classification dataset. We perform additional experiments when the source data has less labeled examples and also on zero-shot domain adaptation task where no target domain samples are used for training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning methods have become very successful in supervised learning tasks in areas such as vision, speech, language etc. However, their performance can be attributed to access to large amount of labeled data or data with ground truth, which sometimes can be very expensive or even not possible. Unsupervised domain adaptation aims to learn representations for unlabeled target dataset by transferring knowledge from a labeled source dataset. Thus domain adaption helps to reduce the distribution shift between two datasets by mapping into common feature space.

Domain adaptation is a well researched topic with a large collection of literature using various approaches to tackle the problem. Recently, deep learning based approaches have dominated the domain adaptation literature and that will be our focus in this paper. When there is no labeled data in the target domain, one of the main approaches is to reduce the difference between the source and target distributions. The most widely used statistic criterion for aligning the distribution is the Maximum Mean Discrepancy (MMD) (Gretton et al., 2009)

loss that reduces the norm of the difference in expectations of two domains. Higher order moments are also used to align feature representations of source and target domains. Correlation alignment or CORAL 

(Sun & Saenko, 2016) proposed mean and covariance matching between distributions. Different kernels can be added to MMD such as Gaussian kernel (Gretton et al., 2012) that has been utilized in (Louizos et al., 2015) for domain adaptation. Modified versions of Kullback-Leibler (KL) divergence are also used as a metric to minimize the difference between domains as in (Zhuang et al., 2015). Central Moment Discrepancy (CMD) (Zellinger et al., 2017) metric can be used to match higher order central moments which performs well in various domain adaptation tasks including text datasets.

Adversarial based domain adaptation techniques employs a discriminator network that classifies the input representation to source or target domain. An adversarial objective is used to confuse the model about the domain thereby capturing the most discriminative information about classification. Domain Adversarial Neural Network (DANN) 

(Ganin & Lempitsky, 2015) introduces gradient reversal layer to make the source and target distribution similar.

Adversarial Discriminative Domain Adaptation (ADDA) (Tzeng et al., 2017) uses independent source and target mapping with weights untied. Pretrained source weights are used to initialize the target model. A domain discriminator is trained simultaneously with the inverted GAN loss (Goodfellow et al., 2014). Methods such as Deep Reconstruction Classification Network (DRCN) (Ghifary et al., 2016) and Domain Separation Networks (DSNs) (Bousmalis et al., 2016) use decoder to reconstruct the target data in an unsupervised manner. CoGAN(Liu & Tuzel, 2016) consists of a pair of GANs that generates source and target data. The weights are shared for the first few layers of generator and the last few layers of discriminator to get a domain invariant feature representation.

The aforementioned approaches for domain adaptation have a common unifying theme: they attempt to morph the target and source distributions so as to make them indistinguishable from each other. Once a perfect alignment between the two distributions is obtained, one can use the classifier for the labeled source domain to discriminate between the classes of the unlabeled target domain. Hence, the performance of the classifier on the target domain depends crucially on the alignment learnt in the previous step. As a result, the actual task of discriminating among the target classes is solved indirectly.

In this paper, we propose a method that directly addresses the problem of learning a classifier in the unlabeled target space. Hence, instead of focusing on aligning the source and target distributions, we learn a classifier jointly on the two distributions. On the labeled source data, the classifier is trained in a supervised manner. We propose a discriminatory model for training the classifier on the unlabeled source data. The resultant model is referred to as Discriminative Encoder for Domain Adaptation (DEDA). In our experiments, we show that by jointly training the classifier on the source and target domains, we can achieve reasonable improvement in performance on the unlabeled target data. Surprisingly, this simple scheme of joint training results in improvement over the state-of-the-art for several challenging datasets.

The main contribution of the paper is to directly learn a classifier in the target domain to solve the problem of unsupervised domain adaptation. We adopt a training procedure similar to ADDA (Tzeng et al., 2017)

, where we first perform pre-training using samples from source domain alone for some epochs, after which the unlabeled target samples are fed to the DEDA for training. We obtain state of the art performance on domain adaptation across various image classification tasks for datasets like MNIST 

(Lecun et al., 1998), USPS, SVHN (Netzer et al., 2011), CIFAR-10 and STL-10. Furthermore, we also achieve improvement in performance on Amazon product review dataset.

(a)
(b)
(c)
(d)
Figure 5:

(a) and (b) shows the steps involved in training the DEDA for image datasets. (c) and (d) corresponds to domain adaptation in text datasets. In (a) and (c) the pre-training is done using source inputs alone. The loss function (i) is supervised loss or cross entropy loss for labeled source domain. Unsupervised loss (ii) is given by (

2.4) after doing label selection in (2). A loss to confuse about fake samples (iii) given by (2.5), and KMMD Loss (iv) for training generator given by (2.5). In the case of text, negative sampling is used to generate fake samples. (Best viewed in color)

2 Domain Adaptation with Discriminative Features

2.1 Problem Formulation

Formally, a domain is specified by its feature space , the label space and the distribution , where and . Domain adaptation consists of two domains and that are referred to as the source and target domains respectively. A common assumption in domain adaptation is to assume that the feature space as well as the label space remains unchanged across the source and the target domain, that is and . Hence, the only difference between the source and target domain is the distribution over the input space, that is . This is referred to as the domain shift.

Given a labeled data in the source domain, it is straightforward to learn a classifier by maximizing the probability

over the labeled samples. In this paper, we consider the problem of unlabeled domain adaptation, where the labels of the target domain are unavailable. Hence, the problem is to transfer the knowledge from the labeled source domain to the unlabeled target domain. Here, we propose a discriminative model to address this problem.

Specifically, the training data consists of images from the source domain and their corresponding labels . Furthermore, a set of unlabeled images is also provided during training. The aim is to learn a mapping from the images of the target domain to their corresponding labels without the presence of labeled images in the source domain. In particular, the labeled images in the source domain should help in learning a classifier for the unlabeled images in the target domain.

2.2 Overview

As mentioned above, we need to utilize the labeled images in the source domain for learning a classifier for the unlabeled images in the target domain. We achieve this by learning a mapping from the source images to the corresponding labels. Simultaneously, we train the same mapping on the unlabeled images in an end-to-end manner by using a discriminative loss function for unlabeled data proposed in (Pandey & Dukkipati, 2017a). The resultant model is referred to as Discriminative Encoder for Domain Adaptation (DEDA). The DEDA architecture primarily consists of an encoder and a classifier. We observed that using a common encoder for the source and target domains does not result in any loss in performance. Hence, in the rest of the paper, we assume the source and target encoders to be exactly the same.

To prevent DEDA from overfitting, we use adversarial regularization that is discussed in detail in Secton 2.5. A detailed diagram of the steps involved in training the model is given in Figure 5. The various steps involved in training DEDA are discussed below. First, we discuss the training of the encoder and the classifier for the source and target domains. Subsequently, we discuss the adversarial regularization strategy used in this paper.

2.3 Supervised Source Classification

For the image-target pairs in the source domain , we train the encoder and the classifier by maximizing the log-probability of observing the targets given the images. Let denote the parameters of the encoding and classification networks. Then, the conditional log-likelihood of the source data can be written as: L_s(θ) = ∑_i=1^mlog(p_θ(y^(i)_s—x^(i)_s))

2.4 Unsupervised Target Classification

For the images in the target domain , the corresponding labels are unknown. Hence, for a given target image, , one possibility is to train the encoding and classification networks so as to optimize the conditional log-likelihood with respect to the parameters as well as the unknown labels . However, such a model may collapse to a single label with for all .

To prevent this from happening, we enforce a prior on the target space. To obtain the prior, we assume that the distribution of the items in the source domain over the labels is approximately the same as the distribution of the items in the target domain. Hence, to obtain the prior over the labels in the target domain, we compute the frequency of each label in the source domain and normalize it as follows:

(1)

The resultant distribution is given by q_θ(x_t,y_t) = pθ(ytxt) p(yt)∑j=1npθ(ytxt(j)) The denominator ensures that if we marginalize the distribution over all the training inputs , the resultant distribution equals the prior. Note that unlike ,

defines a non-trivial joint distribution on

pairs. We maximize the joint log-likelihood of the above distribution with respect to the parameters as well as the unknown labels . L_t(θ, y^(1)_t, …, y^(n)_t) = ∑_i=1^mlogq_θ(x^(i)_t, y^(i)_t) The maximization of the above objective happens in two steps. In the first step, we maximize the above objective with respect to the label for each target . This corresponds to an assignment of labels to each of the points in the target domain. This is achieved by computing the objective in (2.4) for each label and selecting the label that maximizes the objective. The label selection is performed in parallel for each point in the batch as follows:

(2)

Since it is expensive to compute the denominator for each iteration of training, we evaluate the denominator over a minibatch. This step is referred to as the label selection step in the sequel

Once the label has been obtained for each point, we fix the labels, and train the encoding and classification networks so as to maximize (2.4) with respect to the parameters . We refer to this step as the maximization step in the sequel. The label selection step as well as the maximization step are performed during each iteration of training.

2.5 Adversarial Regularization

In general, the labels chosen for the target images at the beginning of training can be very noisy. In order to prevent the model from overfitting to the chosen target labels, we use adversarially regularization. In particular, we train the model to be confused about set of fake samples by maximizing the probability over target labels for fake samples. Hence, we add the following loss over the encoding network objective : ∑_i=1^n∑_yp(y)logp_θ(y—^x_fake^(i))

In case of images, the fake samples are generated using a generator network which takes Gaussian noise as input, ie . The generator is trained by minimizing a modified version of MMD loss between the features of penultimate layer of the encoding network for the fake batch and the real batch. Original MMD loss is given by : MMD^2 = ∥ 1n∑_i=1^n ρ(^x_i) - 1m∑_j=1^m ρ(x_j) ∥^2

In our experiments, we use MMD with Gaussian kernel as the metric : KMMD^2 = 1n2∑_i=1^n∑_j=1^n k(ρ(^x_i),^x_j))
+ 1m2∑_i=1^m∑_j=1^m k(ρ(x_i),ρ(x_j))
- 2mn∑_i=1^m∑_j=1^n k(ρ(x_i),ρ(^x_j)) where Gaussian kernel is given by . For text documents, fake samples are generated using negative sampling technique where words are randomly sampled according to the frequencies of occurrence of words in the corpus.

(a)
(b)
(c)
Figure 9: Digits dataset (a) MNIST, (b) USPS, (c) SVHN

The entire architecture of DEDA comprising the encoder, classifier and the generator is shown in Figure 5. Initially the labeled source domain samples are fed as input to the encoder and classifier networks which optimize both supervised as well as unsupervised loss terms in the objective function described in equation 2.4. This step performs the pre-training of DEDA with the source domain. After doing number of epochs of the above step, we then feed the unlabeled target domain to the Encoder and Classifier network along with the labeled source domain images. The target domain samples optimizes the unsupervised objective function whereas source samples optimizes the supervised objective simultaneously. Note that for image inputs, the generator networks are trained along with the encoder and classifier. We can choose to share the weights between the source and target generator networks. This design choice depends on the dataset used for adaptation.

3 Experiments on Image Datasets

The first set of experiments is performed on digit classification datasets, MNIST, USPS and SVHN. All of them have 10 class for classification. MNIST consists of about 60000 training images and 10000 test images, whereas USPS is smaller dataset comprising of 7438 training images and 1860 test images, both having size . SVHN dataset consists of RGB digits images with 73257 and 26032 train and test images respectively. These images are converted into grayscale with size. We consider three ways of adaptation: MNIST USPS, USPS MNIST and SVHN MNIST. The Encoder and Classifier network for the digits datasets is the same as the one used in ADDA(Tzeng et al., 2017). Weights are not shared between the generators and for digits adaptation. We train the model with all the labeled source images from the train set. The results obtained are summarized in Table 1. The proposed model achieves superior performance in comparison with recent unsupervised domain adaptation methods.

Method MNIST USPS USPS MNIST SVHN MNIST
RevGrad(Ganin & Lempitsky, 2015) 77.10 73.00 73.90
CoGAN (Liu & Tuzel, 2016) 91.20 89.10 -
ADDA(Tzeng et al., 2017) 89.40 90.10 76.00
CycADA (Hoffman et al., 2017) 95.60 96.50 90.40
DEDA (ours) 98.79 97.26 91.61
Table 1: Domain Adaptation on Digits Datasets
(a)
(b)
Figure 12: Nine classes of (a) CIFAR and (b) STL datasets

Another set of experiments were performed on domain adaptation between CIFAR-10 and STL datasets which are also image datasets with 10 classes each. STL images are downscaled to size to match with CIFAR-10 dataset. Also The ‘frog’class in CIFAR-10 and the ‘monkey’class in STL were removed as they have no equivalent in the other dataset thereby reducing to a 9-class classification task. Therefore the total number of train and test set images for CIFAR dataset is 45000 and 9000 respectively; and for STL, the number of samples is 4500 train and 7200 test respectively. The network architecture used here is the same used in (Pandey & Dukkipati, 2017b). Here we use single generator for adaptation, ie . Note that no additional data augmentation is performed on these datasets. The results obtained when trained on full source data are summarized in Table 2.

Method CIFAR STL STL CIFAR
RevGrad (Ganin & Lempitsky, 2015) 66.12 56.91
DRCN (Ghifary et al., 2016) 66.37 58.65
DEDA (ours) 76.35 73.56
Table 2: Domain Adaptation on CIFAR STL

We compute the baseline of transfer learning task as the target domain test accuracy when trained only on source domain using the supervised loss. This is reported as ’Source Only’ in Table

3 and 4. In the ’Source Only’ task, the target domain images are used only for testing, i.e both source and target domain images are not trained using unsupervised loss or adversarial regularization. Our next task is to test the accuracy of target domain, when the proposed model is trained only with source domain, but using both supervised and unsupervised loss. We call this as ’Zero-shot’ Domain Adaptation as reported in Table 3 and 4. The target accuracy has improved over the baseline when unsupervised loss is introduced in the source domain.

We also evaluated the performance of Domain Adaptation when only 10% of source images have labels. The results are reported in Table 3 and 4

. Note that for the digits dataset, the proposed model gives state of the art results in Domain Adaptation for very few labels in source domain. However for CIFAR and STL, the performance degrades when only few labels are provided in source domain. This may be because of the large variance in samples in CIFAR and STL datasets whereas there is lesser domain shift in digits dataset. When the source domain have large number of samples, we obtain relatively high accuracy for adaptation as observed in the case of MNIST

USPS and CIFAR STL. The effect of adversarial regularization is explored by removing the generator network and its associated loss functions while training DEDA. The results given in Table 3 and 4 indicate improvement in performance when adversarial regularisation is used.

Figure 22 shows t-SNE embeddings of the final layer of the classifier of the proposed model before applying softmax function. We plot the embeddings for 500 source and target domain test images. The classes are represented with different colors while domains have different shaped markers in the plot. Initially all the embeddings are random for both the domains as shown in Figure 22.(a,d,g). We show the embeddings of the baseline experiment ’Source Only’ in Figure 22.(b,e,h). After training our model, t-SNE embeddings of source and target domain are overlapped with clear separation between classes as shown in Figure 22.(c,f,i). This shows that the the proposed model forms tight clusters in the latent space where same class samples are placed together in latent space invariant of the domains.

Method:DEDA MNIST USPS USPS MNIST SVHN MNIST
Source Only 93.98 76.60 70.73
Zero-shot 96.24 88.67 74.10
10% labeled source 98.49 93.06 92.66
w/o Adversarial Reg. 96.02 95.38 87.91
Table 3: Domain Adaptation Experiments on Digits Datasets using the proposed model
Method:DEDA CIFAR STL STL CIFAR
Source Only 70.79 48.37
Zero-shot 74.08 52.19
10% labeled source 58.74 62.86
w/o Adversarial Reg. 73.01 54.46
Table 4: Domain Adaptation Experiments on CIFAR STL using the proposed model
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Figure 22: t-SNE embeddings of Domain Adaptation of image datasets. The plots (a) (d) and (g) show the initial plot before training. The plots (b) (e) and (h) show the embeddings when trained in ‘source only’ mode with supervised loss. The plots (c) (f) and (i) show the representations after performing full training with DEDA. (Best viewed in color)

The network architecture used for both the domain adaptation experiments are listed in Table 5 and 6. The network is trained with Adam optimizer with learning rate of using batch size 128. Training is performed for a maximum number of epochs

with early stopping. Another hyperparameter used for training is the epoch

at which the target data is introduced. For the digits adaptation, and for CIFAR STL adaptation.

Generator
noise input
ConvTranspose2d ,

, stride=1, pad=0, BN

ConvTranspose2d ,, stride=2, pad=1, BN
ConvTranspose2d ,, stride=2, pad=1, BN
ConvTranspose2d ,, stride=2, pad=1, Tanh
Encoder + Classifier
input image
Conv2d , MaxPool

, ReLU

Conv2d ,Dropout=0.5,MaxPool ,ReLU
Linear 500, ReLU, dropout =0.5
Linear 10

*BN = BatchNorm

Table 5: Network architecture used for domain adaptation on digits dataset
Generator
noise input
ConvTranspose2d ,, stride=1, pad=0, BN
ConvTranspose2d ,, stride=2, pad=1, BN
ConvTranspose2d ,, stride=2, pad=1, BN
ConvTranspose2d ,, stride=2, pad=1, Tanh
Encoder + Classifier
input image
Dropout = 0.2
Conv2d , stride=1, WN, LeakyReLU
Conv2d , stride=1, WN, LeakyReLU
Conv2d , stride=2, WN, LeakyReLU
Dropout = 0.5
Conv2d , stride=1, WN, LeakyReLU
Conv2d , stride=1, WN, LeakyReLU
Conv2d , stride=2, WN, LeakyReLU
Dropout = 0.5
Conv2d , stride=1, WN, LeakyReLU
Conv2d , stride=1, WN, LeakyReLU
Conv2d , stride=1, WN, LeakyReLU
Global Avg Pool
Dropout = 0.5
Linear 9, WN

*BN = BatchNorm, WN = WeightNorm

Table 6: Network architecture used for domain adaptation on CIFAR STL dataset
Source Target DANN (Ganin & Lempitsky, 2015) CMD(Zellinger et al., 2017) Source Only: DEDA (ours) Zero-shot: DEDA (ours) DEDA (ours): without neg. sampling DEDA (ours): with neg. sampling
Books Dvd .784 .810 .816 .819 .836
BooksElectronics .733 .746 .742 .795 .832
BooksKitchen .779 .776 .778 .817 .851
DvdBooks .723 .764 .782 .795 .818
DvdElectronics .754 .751 .755 .807 .837
DvdKitchen .783 .798 .801 .833 .855
ElectronicsBooks .713 .732 .741 .749 .753
ElectronicsDvd .738 .737 .733 .744 .774
ElectronicsKitchen .854 .863 .856 .875 .880
KitchenBooks .709 .721 .723 .740 .772
KitchenDvd .740 .754 .753 .780 .799
KitchenElectronics .843 .855 .865 .859 .874
Table 7: Domain Adaptation Experiments on Amazon Reviews Dataset using DEDA
Source Target DANN (mSDA) DEDA (mSDA)
BooksDvd .829 .839
BooksElectronics .804 .829
BooksKitchen .843 .860
DvdBooks .825 .829
DvdElectronics .809 .839
DvdKitchen .849 .860
ElectronicsBooks .774 .775
ElectronicsDvd .781 .799
ElectronicsKitchen .881 .883
KitchenBooks .718 .796
KitchenDvd .789 .807
KitchenElectronics .856 .861
Table 8: Domain Adaptation Experiments on Amazon Reviews Dataset using DEDA (ours) with mSDA representation
(a)
(b)
Figure 25: Final layer representation for Domain Adaptation of Books Electronics domains of Amazon review dataset. (a). shows the initial plot before adaptation, and (b). shows the representations after adaptation with DEDA

4 Experiments on Text Dataset

We performed experiments on Amazon reviews dataset which includes product reviews in four different domains (books, dvd, electronics and kitchen appliances). After performing preprocessing according to (Chen et al., 2012)

, the product reviews are encoded in a 5000 dimensional feature vector of unigrams and bigrams bag-of-words. The labels are binary with ’0’ when the products are rated from 1 - 3 stars and ’1’ when rated 4 or 5 stars. We perform twelve domain adaptation tasks with each of the four domains as the source while the other domains as targets. We have 2000 labeled source examples and 2000 unlabeled target examples. The target test set has samples between 3000 and 6000 for each domain. The encoder and classifier network is the same as used in DANN

(Ganin & Lempitsky, 2015)

having fully connected hidden layer of 50 nodes, sigmoid activation functions and softmax output function. We performed the following Domain Adaptation tasks on Amazon review dataset. The results of these experiments are compared in Table

7. ’Source Only’ experiment is the baseline of transfer learning where DEDA is trained to optimize supervised loss only using labeled source domain inputs, and tested on the target input. Further, the DEDA is trained on the labeled source domain using the supervised as well as the unsupervised objective and also combines negative sampling of the source domain inputs. We only use the target domain samples while testing. This is equivalent to Zero shot Domain Adaptation and is reported as ’Zero-shot’ in Table 7. Unlike image experiments, not much improvement was observed in Zero-shot compared to baseline. Note that the Source only training has better performance compared to domain adapted version of DANN. In the next experiment, we introduce unlabeled target domain samples and train them using unsupervised loss. This is performed without using negative sampling which obtained better target test accuracy than CMD(Zellinger et al., 2017) in most of the domain adaptation tasks. Up next, we combine negative sampling along with the supervised and unsupervised loss. This has caused significant improvement in accuracy over the previous methods as reported in Table 7. Figure 25 shows the scatter plot of the two dimensional final layer of DEDA classifier before softmax function is applied. Figure 25(a) and (b) corresponds to the initial representation before training and ’Source Only’ baseline respectively. Figure 25(c). shows that after training with DEDA, the source (Books) and target (Electronics) domains are aligned and the classes are well separated.

We tested the domain adaptation performance on the Marginalized Stacked Denoising Autoencoder (mSDA)

(Chen et al., 2012) representation similar to the experiment perfomed in DANN. The samples are now encoded in a 30,000 dimensional vector of real values in mSDA representation. Here we train the DEDA with supervised loss for the source and unsupervised loss for both source and target domain. Negative sampling cannot be used in this task due to real valued representation. It is also infeasible to use adversarial regularization for domain adaptation in high dimensional mSDA representation. The results are reported in Table 8.

We choose the hyperparameters including learning rate = and batch size of . The total number of epochs with early stopping used. The pre-training step is done for epochs.

5 Conclusion

In this paper, we proposed an approach for unsupervised domain adaptation that directly addresses the problem of learning a classifier in the unlabelled target domain. This is in contrast with previously employed models that forced the source and target domains to have a common representations. Despite being relatively straightforward, the proposed model achieves state of the art performance on popular domain adaptation datasets belonging to both text and images. However, the robustness of DEDA on datasets with larger number of classes is currently not known.

References