The recent success of deep neural networks in supervised learning tasks over several areas like computer vision, speech, natural language processing can be attributed to the models that are trained on large amounts of labeled data. However, acquiring large amounts of labeled data in some domains can be very expensive or not possible at all. Additionally, the amount of time required for labeling the data to use existing deep learning techniques can be very high initially for the new domain. This is referred ascold-start. On the contrary, cost-effective unlabeled data can be easily obtained in large amounts for most new domains. So, one can aim to transfer the knowledge from a labeled source domain to perform tasks on an unlabeled target domain.
To study this, under the purview of transductive transfer learning, several approaches like domain adaptation, sample selection bias, co-variance shift have been explored in recent times. In this work, we study unsupervised domain adaptation by learning contrastive features in the unlabeled target domain in a fully unsupervised manner utilizing pre-existing informative knowledge from the labeled source domain.Existing domain adaptation approaches mostly rely on domain alignment, i.e., align both domains so that they are superimposed and indistinguishable. This domain alignment can be achieved in three main ways:(a) discrepancy-based methods [DBLP:conf/iccv/HausserFMC17, 8578490, french2018selfensembling, DBLP:journals/corr/LouizosSLWZ15, 2017arXiv170208811Z], (b) reconstruction-based methods [10.1007/978-3-319-46493-0_36, Bousmalis:2016:DSN:3157096.3157135], and (c) adversarial adaptation methods [pmlr-v37-ganin15, NIPS2016_6544, 8099799, DBLP:conf/cvpr/Sankaranarayanan18a, DBLP:conf/cvpr/LiuYFWCW18, Russo_2018_CVPR, pmlr-v80-hoffman18a, xie2018learning, NIPS2018_7436, DBLP:conf/aaai/ChenCJJ19, shu2018a, hosseini-asl2018augmented].
Unlike above methods, our main motivation comes from the human ability to ‘contradistinguish’ and the fundamental idea of statistical learning as described by V. Vapnik [vapnik1999overview] that indicates any desired problem should be tried to solve in a most possible direct way rather than solving a more general intermediate task. In the context of domain adaptation, the desired problem is classification on the unlabeled target domain and domain alignment followed by most standard methods is the general intermediate. This motivates us to propose an approach that does not require domain alignment.
Our main contributions in this paper are as follows:
We propose a simple method that directly addresses the problem of domain adaptation by learning a single classifier, which we refer to as Contradistinguisher (CTDR), jointly in an unsupervised manner over the unlabeled target space and in a supervised manner over the labeled source space. Hence, overcoming the drawbacks of distribution alignment based techniques.
We formulate a ‘contradistinguish loss’ to directly utilize unlabeled target domain and address the classification task using unsupervised feature learning. A similar approach called DisCoder [Pandey2017UnsupervisedFL] was used for a much simpler task of semi-supervised feature learning on a single domain with no domain distribution shift.
From our experiments, we show that by jointly training CTDR on the source and target domain distributions, we can achieve above/on-par results over several methods. Surprisingly, this simple method results in improvement over the state-of-the-art for eight challenging benchmark datasets in visual domains (USPS [lecun1989backpropagation], MNIST [lecun1998gradient], SVHN , SYNNUMBERS [pmlr-v37-ganin15], CIFAR-10 [krizhevsky2009learning], STL-10 [coates2011analysis], SYNSIGNS [pmlr-v37-ganin15] and GTSRB [Stallkamp-IJCNN-2011]) and four benchmark language domains (Books, DVDs, Electronics, and Kitchen Appliances) of Amazon customer reviews sentiment analysis dataset [blitzer2006domain].
, we discuss the problem formulation, architecture, loss function definitions, algorithms, and complexity analysis of our proposed method CUDA. SectionIV deals with the discussion of the experimental setup, results and analysis on vision and language domains. Finally in Section V, we conclude by highlighting the key contributions of CUDA.
Ii Related Work
As mentioned earlier, almost all domain adaptation approaches rely on domain alignment techniques. Here we briefly discuss three main techniques of domain alignment.
with heavy reliance on data augmentation to minimize the discrepancy between student and teacher network predictions. Variational Fair Autoencoder (VFAE)[DBLP:journals/corr/LouizosSLWZ15] uses Variational Autoencoder (VAE) [DBLP:journals/corr/KingmaW13]
with MMD to obtain domain invariant features. Central Moment Discrepancy (CMD)[2017arXiv170208811Z] proposes to match higher order moments of source and target domain distributions. (b) Reconstruction-based methods: Deep Reconstruction-Classification Networks (DRCN) [10.1007/978-3-319-46493-0_36] and Domain Separation Networks (DSN) [Bousmalis:2016:DSN:3157096.3157135] approaches learn a shared encodings of source and target domains using reconstruction networks. (c) Adversarial adaptation methods: Reverse Gradient (RevGrad/DANN) [pmlr-v37-ganin15, ganin2016domain] uses domain discriminator to learn domain invariant representations of both the domains. Coupled Generative Adversarial Network (CoGAN) [NIPS2016_6544] uses Generative Adversarial Network (GAN) [Goodfellow:2014:GAN:2969033.2969125] to obtain domain invariant features used for classification. Adversarial Discriminative Domain Adaptation (ADDA)  uses GANs along with weight sharing to learn domain invariant features. Generate to Adapt (G2A) [DBLP:conf/cvpr/Sankaranarayanan18a] learns to generate equivalent image in the other domain for a given image, thereby learning common domain invariant embeddings. Cross-Domain Representation Disentangler (CDRD) [DBLP:conf/cvpr/LiuYFWCW18] learns cross-domain disentangled features for domain adaptation. Symmetric Bi-Directional Adaptive GAN (SBADA-GAN) [Russo_2018_CVPR] aims to learn symmetric bidirectional mappings among the domains by trying to mimic a target image given a source image. Cycle-Consistent Adversarial Domain Adaptation (CyCADA) [pmlr-v80-hoffman18a] adapts representations at both the pixel-level and feature-level over the domains. Moving Semantic Transfer Network (MSTN) [xie2018learning] proposes moving semantic transfer network that learn semantic representations for the unlabeled target samples by aligning labeled source centroids and pseudo-labeled target centroids. Conditional Domain Adversarial Network (CDAN) [NIPS2018_7436] conditions the adversarial adaptation models on discriminative information conveyed in the classifier predictions. Joint Discriminative Domain Adaptation (JDDA) [DBLP:conf/aaai/ChenCJJ19] proposes joint domain alignment along with discriminative feature learning. Decision-boundary Iterative Refinement Training with a Teacher (DIRT-T) [shu2018a] and Augmented Cyclic Adversarial Learning (ACAL) [hosseini-asl2018augmented] learn by using a domain discriminator along with data augmentation for domain adaptation.
Apart from these standard ways, a slight deviant method explored is Tri-Training. Tri-Training algorithms use three classifiers trained on the labeled source domain and refine them for unlabeled target domain. To be precise, in each round of tri-training, a target sample is pseudo-labeled if the other two classifiers agree on the labeling, under certain conditions such as confidence thresholding. Asymmetric Tri-Training (ATT) [pmlr-v70-saito17a] uses three classifiers to bootstrap high confidence target domain samples by confidence thresholding. This way of bootstrapping works only if the source classifier has very high accuracy. In case of of low source classifier accuracy, target samples are never obtained to bootstrap, resulting in a bad model. Multi-Task Tri-training (MT-Tri) [DBLP:conf/acl/PlankR18] explores the tri-training technique on the language domain adaptation tasks.
All the domain adaptation approaches mentioned earlier have a common unifying theme: they attempt to morph the target and source distributions so as to make them indistinguishable. Once the two distributions are perfectly aligned, they use a classifier trained on labeled source domain to classify the unlabeled target domain. Hence, the performance of the classifier on the target domain depends crucially on the domain alignment. As a result, the actual task of target domain classification is solved indirectly using domain alignment rather than using the unlabeled target data in an unsupervised manner which is a more logical and direct way.
In this paper, we propose a completely different approach: instead of focusing on aligning the source and target distributions, we learn a single classifier referred as Contradistinguisher (CTDR), jointly on both the domain distributions using contradistinguish loss for the unlabeled target data and supervised loss for the labeled source data.
Iii Proposed Method: CUDA
A domain is specified by its input feature space , the label space
and the joint probability distribution, where and . Let be the number of class labels such that for any instance . In particular, Domain adaptation consists of two domains and that are referred as the source and target domains respectively. A common assumption in domain adaptation is that the input feature space as well as the label space remains unchanged across the source and the target domain, i.e., and . Hence, the only difference between the source and target domain is input-label space distributions, i.e., . This is referred as domain shift in the standard literature of domain adaptation.
In particular, in an unsupervised domain adaptation, the training data consists of labeled source domain instances and unlabeled target domain instances . Given a labeled data in the source domain, it is straightforward to learn a classifier by maximizing the conditional probability over the labeled samples. However, the task at hand is to learn a classifier on the unlabeled target domain by transferring the knowledge from the labeled source domain.
Figure 1 indicates the model architecture of our proposed method CUDA, i.e., Contradistinguisher (CTDR) and the respective losses involved in CUDA training.
The objective of CTDR is to find a clustering scheme using the most contrastive features on unlabeled target in such a way that it also satisfies the target domain prior, i.e., prior enforcing. We achieve this by jointly training labeled source samples in a supervised manner and unlabeled target samples in an unsupervised end-to-end manner by using a contradistinguish loss same as [Pandey2017UnsupervisedFL]. This fine-tunes the classifier learnt from source domain to the target domain. The main important feature of our approach is the contradistinguish loss (5) which is discussed in detail in Section III-C.
Note that the objective of the CTDR is not same as a classifier, i.e., distinguishing is not same as classifying. Suppose there are two contrastive entities and , where are two classes. The aim of a classifier is to classify and , where to train a classifier one requires labeled data. On the contrary, the job of a CTDR is to just identify , i.e., CTDR can classify (or ) and (or ) indifferently. To train CTDR, we do not need any class information but only need unlabeled entities and . Using unlabeled target data, CTDR is able to distinguish the samples in an unsupervised way. However, since the final task is classification, one would require a selective incorporation of the pre-existing informative knowledge required for the task of classification. This knowledge is obtained by jointly training, thus classifying and .
Iii-B Supervised Source Classification
For the labeled source domain instances , we define the conditional-likelihood of observing given as, , where denotes the parameters of CTDR.
We estimateby maximizing the conditional log-likelihood of observing the labels given the labeled source domain samples. The source domain supervised objective to maximize
Alternatively, one can minimize the cross-entropy loss
where is the softmax output of CTDR that represents the probability of class for the given sample .
Iii-C Unsupervised Target Classification
For the unlabeled target domain instances , as the corresponding labels are unknown, a naive way of predicting the target labels is to directly use the classifier trained only with supervised loss (2). Though this gives some good results, it fails to achieve high accuracies due to two reasons: (i) is defined over and not . (ii) is not a valid probability distribution because .
Enforcing these two conditions, we model a non-trivial joint distributionparameterized by over target domain as,
However (3) is not exactly a joint distribution yet because , i.e., marginalizing over all should yield the target prior distribution . We modify (3) so as to include the marginalization condition. We refer to this as target domain prior enforcing.
Note that defines a non-trivial approximate of joint distribution over the target domain as a function of learnt over source domain. The resultant unsupervised maximization objective for the target domain is given by maximizing the log-probability of the joint distribution which is
Next, we discuss how the objective (5) is solved and the reason why (5) is referred as contradistinguish loss. Since the target labels are unknown, one needs to maximize (5) over the parameters as well as the unknown target labels . As there are two parameters for maximization, we follow a two step approach to maximize (5). The two optimization steps are as follows.
Ideally, one would like to compute the third term in (7) using the complete target training data for each input sample. Since it is expensive to compute the third term over the entire for each individual sample during training, one evaluates the third term in (7) over a mini-batch. In our experiments, we have observed that mini-batch strategy does not cause any problem during training as far as it includes at least one sample from each class which is guaranteed for a reasonably large mini-batch size of . For numerical stability, we use trick to optimize third term in (7).
Iii-D Adversarial Regularization
In order to prevent CTDR from over-fitting to the chosen pseudo labels during the training, we use adversarial regularization. In particular, we train CTDR to be confused about set of fake samples by maximizing the conditional log-probability over the given fake sample such that the sample belongs to all classes simultaneously. The objective of the adversarial regularization is to multi-label the fake sample (e.g., noisy image that looks like a cat and a dog) equally to all classes as labeling to any unique class introduces more noise in pseudo labels. This strategy is similar to entropy regularization [grandvalet2005semi] in the sense that instead of minimizing the entropy for the real target samples, we maximize the conditional log-probability over the fake samples. Therefore, we add the following maximization objective to the total CTDR objective as a regularizer.
for all . As maximization of (8) is analogous to minimize the binary cross-entropy loss (9) of a multi-class multi-label classification task, in our practical implementation, we minimize (9) for assigning labels to all the classes for every samples.
where is the softmax output of CTDR which represents the probability of class for the given sample .
The fake samples
can be directly sampled from, say a Gaussian distribution in the input feature space
with the mean and standard deviation of the samples. For the language domain, fake samples are generated randomly as mentioned above. In case of image datasets, as the feature space is high dimensional, the fake images are generated using a generator network with parameter
that takes Gaussian noise vectoras input to produce a fake sample , i.e., . Generator is trained by minimizing kernel MMD loss [DBLP:conf/nips/LiCCYP17], i.e., a modified version of MMD loss between the encoder output and of fake images and real target domain images respectively.
where is the Gaussian kernel.
Note that the objective of the generator is not to generate realistic image but to generate fake noisy images with mixed image attributes from the target domain. This reduces the effort of training powerful generators which is the focus in adversarial based domain adaptation approaches [DBLP:conf/cvpr/Sankaranarayanan18a, DBLP:conf/cvpr/LiuYFWCW18, Russo_2018_CVPR, pmlr-v80-hoffman18a, xie2018learning] used for domain alignment.
Iii-E Algorithms and Complexity Analysis
mainly depends on the model complexity involving many factors such as input feature dimension, number of neural network layers, type of normalization, type of activation functions etc. CTDR is a simple network with a single encoder and classifier unlike MCD-DA that uses a single encoder with two classifier. This makes MCD-DA time complexityinstead of just . Similarly, SE uses 2 copies of network of encoder and classifier one for student and other for teacher network. This makes SE time complexity instead of . In general, as domain alignment approaches use additional circuitry either in-terms of multiple classifiers or GANs, the model complexity increases at least by a factor of 2. This increased model complexity requires more data augmentation to prevent under-fitting leading to further increases in time complexity at the expense of only a slight improvement, if any, compared to CUDA as indicated by our state-of-the-art results without any data augmentation in both visual and language domain adaptation tasks. We observed empirically, most of the computational complexity is for the forward and backward propagation to obtain the classifier softmax output and the gradients, i.e., . Hence the use of GPUs to accelerate . We believe the trade-off achieved by the simplicity of CUDA, as evident from our results, is very desirable compared to most domain alignment approaches that use data augmentation and complex neural networks for a slight improvement, if any.
|Dataset||# Train||# Test||# Classes||Target||Resolution||Channels|
|USPS ()||7,291||2,007||10||Digits||16 16||Mono|
|MNIST ()||60,000||10,000||10||Digits||28 28||Mono|
|SVHN ()||73,257||26,032||10||Digits||32 32||RGB|
|SYNNUMBERS ()||479,400||9,553||10||Digits||32 32||RGB|
|CIFAR-9 ()||45,000||9,000||9||Object ID||32 32||RGB|
|STL-9 ()||4,500||7,200||9||Object ID||96 96||RGB|
|SYNSIGNS ()||100,000||-||43||Traffic Signs||40 40||RGB|
|GTSRB ()||39,209||12,630||43||Traffic Signs||varies||RGB|
|Domain||# Train||# Test|
|Kitchen Appliances ()||2,000||5,945|
Iv-a Experimental Setup
Iv-A1 Visual Domain Adaptation
We consider eight benchmark visual datasets with 3 different nature of images for our visual domain experiments. (a) Digits: USPS () [lecun1989backpropagation] and MNIST () [lecun1998gradient] are a pair of gray-scale digits datasets. SVHN ()  and SYNNUMBERS () [pmlr-v37-ganin15] are another pair of RGB digits datasets. (b) Objects: CIFAR () [krizhevsky2009learning] and STL () [coates2011analysis] are a dataset pair of objects/animals RGB images by considering only the 9 overlapping classes from the original datasets. (c) Traffic Signs: SYNSIGNS () [pmlr-v37-ganin15] and GTSRB () [Stallkamp-IJCNN-2011] are a dataset pair with traffic signs. Table II provides visual dataset details and Figure 2 indicates some random samples from all eight datasets.
On these datasets, we consider eight main domain adaptation tasks studied in [pmlr-v37-ganin15, french2018selfensembling]. These eight visual tasks and the data processing considered are as follows,
(i) : images are up-scaled using bi-linear interpolation from 16
images are up-scaled using bi-linear interpolation from 16161 to 28281 to match the size of , (ii) : images are up-scaled using bi-linear interpolation to 32321. The RGB channels of are converted to Mono image resulting in 32321 size. Several other combinations were tried and this was chosen since the results are the best, (iii) : No pre-processing required as these domains have same image size, (iv) : Only the 9 overlapping classes from datasets as the label space should be same for both the domain. images are down-scaled from 96963 to 32323 to match the size of . (v) : Crop the images to 40403 based on the region of interest in the images in both datasets.
Note that we do not perform any image data augmentation in our experiments unlike [french2018selfensembling]. Our aim in this paper is to demonstrate that the proposed method performs above/on-par without data augmentation as data augmentation is expensive and not always possible as seen in language tasks.
Iv-A2 Language Domain Adaptation
We consider four benchmark language domains (i) Books (), (ii) DVDs (), (iii) Electronics (), and (iv) Kitchen Appliances () from Amazon customer reviews [blitzer2006domain] dataset. The dataset includes product reviews in four different domains for sentiment analysis as indicated in Table II.
On these domains, we consider all twelve tasks studied in [ganin2016domain, DBLP:journals/corr/LouizosSLWZ15, 2017arXiv170208811Z, pmlr-v70-saito17a, DBLP:conf/acl/PlankR18]. We use the same neural networks and text pre-processing used in [Chen:2012:MDA:3042573.3042781, ganin2016domain, DBLP:conf/acl/PlankR18] to get 5000 dimensional feature vector. We assign binary label ‘0’ for the products rated from stars and ‘1’ for star ratings.
We select the best existing neural networks without major modifications to hyper-parameters so as to demonstrate the effectiveness of CUDA. All the experiments are done using PyTorch[paszke2017automatic] with mini-batch size of 64 per GPU distributed over four GPUs, Adam optimizer with an initial learning rate and decay rate of
every 30 epochs.
Iv-B Experimental Results
We use the same metric used for evaluation as in [pmlr-v37-ganin15, NIPS2016_6544, 10.1007/978-3-319-46493-0_36, pmlr-v70-saito17a, DBLP:conf/iccv/HausserFMC17, 8099799, DBLP:conf/cvpr/Sankaranarayanan18a, DBLP:conf/cvpr/LiuYFWCW18, 8578490, Russo_2018_CVPR, pmlr-v80-hoffman18a, xie2018learning, NIPS2018_7436, DBLP:conf/aaai/ChenCJJ19, french2018selfensembling, shu2018a, hosseini-asl2018augmented, ganin2016domain, DBLP:journals/corr/LouizosSLWZ15, 2017arXiv170208811Z, DBLP:conf/acl/PlankR18, Bousmalis:2016:DSN:3157096.3157135], i.e., the accuracy on target domain test set. Table III indicates the target domain test accuracy across all the eight main domain adaptation tasks compared with several state-of-the-art domain alignment methods [pmlr-v37-ganin15, NIPS2016_6544, 10.1007/978-3-319-46493-0_36, pmlr-v70-saito17a, DBLP:conf/iccv/HausserFMC17, 8099799, DBLP:conf/cvpr/Sankaranarayanan18a, DBLP:conf/cvpr/LiuYFWCW18, 8578490, Russo_2018_CVPR, pmlr-v80-hoffman18a, xie2018learning, NIPS2018_7436, DBLP:conf/aaai/ChenCJJ19, french2018selfensembling, shu2018a, hosseini-asl2018augmented, Bousmalis:2016:DSN:3157096.3157135]. Table IV indicates the target domain test accuracy across all the twelve domain adaptation tasks compared with different state-of-the-art methods [ganin2016domain, DBLP:journals/corr/LouizosSLWZ15, 2017arXiv170208811Z, pmlr-v70-saito17a, DBLP:conf/acl/PlankR18].
Apart from the standard domain alignment methods used for comparison, we report two baselines and of our own, reported in Tables III and IV, by fixing the CTDR neural network architecture and varying only the training losses used to demonstrate the effectiveness of CUDA. indicates training CTDR using only the target domain in a fully supervised way. indicates training CTDR using only the source domain in a fully supervised way. and respectively indicates the maximum and minimum target domain test accuracy that can be attained with chosen CTDR neural network.
As our method is mainly dependent on the contradistinguish loss (5), experimenting with better neural networks along with our contradistinguish loss (5), we observed better results in both visual and language domain adaptation task over the neural networks used in [8099799, 8578490] on visual experiments and MAN [DBLP:conf/naacl/ChenC18] on language experiments.
Iv-C Analysis of Experimental Results
Iv-C1 Visual Domain Adaptation
In tasks and , CUDA in Table III. is poor because causing under-fitting during only target domain supervised loss training. The improved results of CUDA indicates that CTDR is able contradistinguish on the target domain along with the transfer of informative knowledge required for the classification from a larger source domain. This indicates that CTDR is indeed successful in contradistinguishing on a relatively small set of unlabeled target domain using larger source domain information. Other interesting observation is in the task , where is slightly better than CUDA. This is due to slight over-fitting on the target domain training examples which are actually non-informative for classification leading to a small decrease in the target domain test accuracy. indicates source domain has more information than target domain due to large source and small target training sets. Figure 7(a-d) shows t-SNE plots for as the training progresses using CUDA. We indicate these plots as this is the most difficult among all the visual experiments due contrasting domains. Figure 16(a-h) shows t-SNE plots on the test sample outputs of CTDR for all eight visual experiments and they show clear class-wise clustering on both source and target domains indicating the efficacy of CUDA.
Iv-C2 Language Domain Adaptation
In task , CUDA because of slight over-fitting on source domain. Figure 21(a-d) show the t-SNE plots of top four language tasks indicating classes being oriented on either half of the line like clustering.
In this paper, we have proposed a simple and direct approach that addresses the problem of unsupervised domain adaptation that is different from the standard distribution alignment approaches. In our approach, we jointly learn a Contradistinguisher (CTDR) on the source and target domain distribution in the same input feature space using contradistinguish loss for unsupervised target domain to identify contrastive features. We have shown that the contrastive learning overcomes the need and drawbacks of domain alignment, especially in tasks where domain shift is very high (e.g., language domains) and data augmentation techniques cannot be applied. Due to the inclusion of prior enforcing in the contradistinguish loss, the proposed unsupervised domain adaptation method CUDA could incorporate any known target domain prior to overcome the drawbacks of skewness in the target domain, thereby resulting in a skew-robust model. We demonstrated the effectiveness of our model by achieving state-of-the-art results on all the visual domain adaptation tasks over eight different benchmark visual datasets and nine language domain adaptation tasks out of twelve along with the best mean test accuracy of all the twelve tasks on benchmark Amazon customer reviews sentiment analysis dataset. Specifically, the results in language domains reinforced the efficacy of CUDA on being robust to high sparsity or high domain shift tasks that pose challenges to standard domain alignment approaches.
The authors would like to thank Ministry of Human Resource Development (MHRD), Government of India, for their generous funding towards this work through UAY Project: IISc 001 and IISc 010.