Contradistinguisher: Applying Vapnik's Philosophy to Unsupervised Domain Adaptation

05/25/2020 ∙ by Sourabh Balgi, et al. ∙ indian institute of science 2

A complex combination of simultaneous supervised-unsupervised learning is believed to be the key to humans performing tasks seamlessly across multiple domains or tasks. This phenomenon of cross-domain learning has been very well studied in domain adaptation literature. Recent domain adaptation works rely on an indirect way of first aligning the source and target domain distributions and then train a classifier on the labeled source domain to classify the target domain. However, this approach has the main drawback that obtaining a near-perfect alignment of the domains in itself might be difficult or impossible (e.g., language domains). To address this, we follow Vapnik's idea of statistical learning that states any desired problem should be solved in the most direct way rather than solving a more general intermediate task and propose a direct approach to domain adaptation that does not require domain alignment. We propose a model referred Contradistinguisher that learns contrastive features and whose objective is to jointly learn to contradistinguish the unlabeled target domain in an unsupervised way and classify it in a supervised way on the source domain. We demonstrate the superiority of our approach by achieving state-of-the-art on eleven visual and four language benchmark datasets in both single-source and multi-source domain adaptation settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 8

page 9

page 10

page 11

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent success of deep neural networks for supervised learning tasks in several areas like computer vision, speech, natural language processing can be attributed to the models that are trained on large amounts of labeled data. However, acquiring massive amounts of labeled data in some domains can be very expensive or not possible at all. Additionally, the amount of time required for labeling the data to use existing deep learning techniques can be very high initially for a new domain. This is known as

cold-start

. On the contrary, cost-effective unlabeled data can be easily obtained in large amounts for most new domains. So, one can aim to transfer the knowledge from a labeled source domain to perform tasks on an unlabeled target domain. To study this, under the purview of transductive transfer learning, several approaches like domain adaptation, sample selection bias, co-variance shift have been explored in recent times.

Existing domain adaptation approaches mostly rely on domain alignment, i.e., align both domains so that they are superimposed and indistinguishable in the latent space. This domain alignment can be achieved in three main ways: (a) discrepancy-based methods [DBLP:conf/icml/LongC0J15, DBLP:conf/nips/LongZ0J16, DBLP:conf/icml/LongZ0J17, DBLP:conf/iccv/HausserFMC17, 8578490, french2018selfensembling, DBLP:journals/corr/LouizosSLWZ15, 2017arXiv170208811Z, rozantsev2018beyond, 8792192, mancini2018boosting, cariucci2017autodial, carlucci2017just], (b) reconstruction-based methods [10.1007/978-3-319-46493-0_36, Bousmalis:2016:DSN:3157096.3157135], and (c) adversarial adaptation methods [pmlr-v37-ganin15, NIPS2016_6544, 8099799, DBLP:conf/cvpr/Sankaranarayanan18a, DBLP:conf/cvpr/LiuYFWCW18, Russo_2018_CVPR, pmlr-v80-hoffman18a, xie2018learning, NIPS2018_7436, DBLP:conf/aaai/ChenCJJ19, shu2018a, hosseini-asl2018augmented, liang2018aggregating, xu2018deep].

These domain alignment strategies of indirectly addressing the task of unlabeled target domain classification have three main drawbacks. (i) The sub-task of obtaining a perfect alignment of the domain in itself might be impossible or very difficult due to large domain shift (e.g., language domains). (ii) The use of multiple classifiers and/or GANs to align the distributions unnecessarily increases the complexity of the neural networks leading to over-fitting in many cases. (iii) Due to distribution alignment, the domain-specific information is lost as the domains get morphed.

A particular case where the domain alignment and the classifier trained on the source domain might fail is in the case that the target domain is more suited to classification task than the source domain which has lower classification performance. In this case, it is advised to perform the classification directly on the unlabeled target domain in an unsupervised manner as domain alignment onto less suited source domain only leads to loss of information. It is reasonable to assume that for the main objective of unlabeled target domain classification, one can use all the information in the target domain and optionally incorporate any useful information from the labeled source domain and not the other way around. These drawbacks push us to challenge the idea of solving domain adaptation problems without solving the general problem of domain alignment.

In this work, we study unsupervised domain adaptation by learning contrastive features in the unlabeled target domain in a fully unsupervised manner with the help of classifier simultaneously trained on the labeled source domain. We derive our motivation from the philosophy of Vapnik [DBLP:books/sp/95/V1995, vapnik1999overview, DBLP:books/daglib/0026015, gong2007machine] that states any desired problem should be solved in a most possible direct way rather than solving a more general intermediate task. Considering the various drawback of domain alignment approach and based on Vapnik’s philosophy, in this paper, we propose a method for domain adaptation that does not require domain alignment and approach the problem directly.

This work extends our earlier conference paper [DBLP:conf/icdm/BalgiD19, DBLP:journals/corr/abs-1909-03442] in the following way. (i) We provide additional experimental results on more complex domain adaptation dataset Office-31 [DBLP:conf/eccv/SaenkoKFD10] which includes images from three different sources, AMAZON (), DSLR (), and WEBCAM () categorized into three domains respectively with only few labeled high resolution images. (ii) We provide several ablation studies and demonstrations that will provide insights into the working of our proposed method CUDA [DBLP:conf/icdm/BalgiD19, DBLP:journals/corr/abs-1909-03442]. (iii) We extend our algorithm to the case of multi-source domain adaptation and establish benchmark results.

A summary of our contributions in this paper are as follows.

  1. We propose a simple method Contradistinguisher for Unsupervised Domain Adaptation (CUDA) that directly addresses the problem of domain adaptation by learning a single classifier, which we refer to as Contradistinguisher, jointly in an unsupervised manner over the unlabeled target domain and in a supervised manner over the labeled source domain. Hence, overcoming the drawbacks of distribution alignment based techniques.

  2. We formulate a ‘contradistinguish loss’ to directly utilize unlabeled target domain and address the classification task using unsupervised feature learning. Note that a similar approach called DisCoder [Pandey2017UnsupervisedFL] was used for a much simpler task of semi-supervised feature learning on a single domain with no domain distribution shift.

  3. We extend our experiments to more complex domain adaptation dataset Office-31 [DBLP:conf/eccv/SaenkoKFD10] which includes images from three different sources, AMAZON (), DSLR (), and WEBCAM () categorized into three domains respectively. Unlike simpler datasets ( USPS ([lecun1989backpropagation], MNIST ([lecun1998gradient], SVHN ([37648], SYNNUMBERS ([pmlr-v37-ganin15], CIFAR-10 ([krizhevsky2009learning], STL-10 ([coates2011analysis], SYNSIGNS ([pmlr-v37-ganin15], and GTSRB ([Stallkamp-IJCNN-2011]) explored in [DBLP:conf/icdm/BalgiD19, DBLP:journals/corr/abs-1909-03442], Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset includes very few images of the order of hundreds with high resolution images and varying backgrounds. From our experiments, we show that by jointly training contradistinguisher on the source domain and the target domain distributions, we can achieve above/on-par results over several domain adaptation methods.

  4. We further demonstrate the simplicity and effectiveness of our proposed method by easily extending single-source domain adaptation to a more general multi-source domain adaptation. We demonstrate the effectiveness of the multi-source domain adaptation extension by performing experiments on Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset in a multi-source setting.

  5. Apart from these real-world benchmark datasets, we also validate the proposed method using the synthetically created toy-datasets. We use  [pedregosa2011scikit] to generate blobs (point clouds) with different source and target domain distribution shapes and orientations and perform simulation of our proposed method.

(a) using standard
domain alignment method.
(b) using standard
domain alignment method.
(c) using our
proposed method CUDA.
(d) using our
proposed method CUDA.
Fig. 5: Demonstration of difference in domain alignment and proposed method CUDA on the -dimensional blobs synthetic toy-dataset for domain distributions from popular  [pedregosa2011scikit]. Top row corresponds to domain alignment approach with two different domains on both domain adaptation tasks. The yellow dotted lines indicate domain alignment to morph both the domains. Bottom row corresponds to proposed method CUDA in comparison with their respective domain alignment in top row. The two columns indicates the experiments with swapped source and target domains. Unlike domain alignment approach, where the classifier is learnt only on source domain, CUDA demonstrates the contradistinguisher jointly learnt to classify on both the domains. As seen above, swapping domains affects the classifier learnt in domain alignment because the classifier depends on the source domain. However, because of joint learning on both the domains simultaneously, contradistinguisher shows almost the same decision boundary irrespective of the source domain. (Best viewed in color.)

In Fig. 5 we demonstrate the difference between domain alignment and the proposed method CUDA by swapping the domains. One can see that while domain alignment approaches learn classifier only on source domain, the Contradistinguisher jointly learn to classify both the domains. Due to this joint learning we observe an added nice behavior of obtaining similar classifiers irrespective of the domain being used as the source domain.

The rest of this paper is structured as follows. Section 2 discusses related works in domain adaptation. In Section 3

, we elaborate on the problem formulation, neural network architecture used by us, loss functions, model training and inference algorithms, and complexity analysis of our proposed method. Section

4 deals with the discussion of the experimental setup, results and analysis on vision and language domains. Finally in Section 5, we conclude by highlighting the key contributions of CUDA.

2 Related Work

As mentioned earlier, almost all domain adaptation approaches rely on domain alignment techniques. Here we briefly discuss three main techniques of domain alignment. (a) Discrepancy-based methods: Deep Adaptation Network (DAN) [DBLP:conf/icml/LongC0J15] proposes mean-embedding matching of multi-layer representations across domain by minimizing Maximum Mean Discrepancy (MMD) [Gretton:2009:FCK:2984093.2984169, gretton2012kernel, sejdinovic2013equivalence] in a reproducing kernel Hilbert space (RKHS). Residual Transfer Network (RTN) [DBLP:conf/nips/LongZ0J16] introduces separate source and target domain classifiers differing by a small residual function along with fusing the features of multiple layers in a reproducing kernel Hilbert space (RKHS) to match the domain distributions. Joint Adaptation Network (JAN) [DBLP:conf/icml/LongZ0J17]

proposes to optimize Joint Maximum Mean Discrepancy (JMMD), which measures the Hilbert-Schmidt norm between kernel mean embedding of empirical joint distributions of source and target domain. Associative Domain Adaptation (ADA) 

[DBLP:conf/iccv/HausserFMC17] learns statistically domain invariant embeddings by associating the embeddings of the final fully-connected layer before applying softmax as an alternative to Maximum Mean Discrepancy (MMD) [Gretton:2009:FCK:2984093.2984169, gretton2012kernel, sejdinovic2013equivalence] loss. Maximum Classifier Discrepancy (MCD) [8578490] aligns source and target distributions by maximizing the discrepancy between two separate classifiers. Self Ensembling (SE) [french2018selfensembling] uses mean teacher variant [DBLP:conf/nips/TarvainenV17] of temporal ensembling [DBLP:conf/iclr/LaineA17]

with heavy reliance on data augmentation to minimize the discrepancy between student and teacher network predictions. Variational Fair Autoencoder (VFAE) 

[DBLP:journals/corr/LouizosSLWZ15] uses Variational Autoencoder (VAE) [DBLP:journals/corr/KingmaW13]

with MMD to obtain domain invariant features. Central Moment Discrepancy (CMD) 

[2017arXiv170208811Z] proposes to match higher order moments of source and target domain distributions. Rozantsev et. al. [rozantsev2018beyond] propose to explicitly model the domain shift using two-stream architecture, one for each domain along with MMD to align the source and target representations. A more recent approach multi-domain Domain Adaptation layer (mDA-layer) [8792192, mancini2018boosting] proposes a novel idea of replacing standard Batch-Norm layers [ioffe2015batch] with specialized Domain Alignment layers [cariucci2017autodial, carlucci2017just] thereby reducing the domain shift by discovering and handling multiple latent domains. Geodesic Flow Subspaces (GFS/SGF) [gopalan2011domain]

performs domain adaptation by first generating two subspaces of the source and the target domains by performing PCA, followed by learning finite number of the interpolated subspaces between source and target subspaces based on the geometric properties of the Grassmann manifold. In the presence of multi-source domains, this method is very effective as this identifies the optimal subspace for domain adaptation. sFRAME (sparse Filters, Random fields, And Maximum Entropy) 

[xie2015learning] models are defined as Markov random field model that model data distributions based as maximum entropy distribution to fit the observed data by identifying the patterns in the observed data. (b) Reconstruction-based methods: Deep Reconstruction-Classification Networks (DRCN) [10.1007/978-3-319-46493-0_36] and Domain Separation Networks (DSN) [Bousmalis:2016:DSN:3157096.3157135] approaches learn a shared encodings of source and target domains using reconstruction networks. (c) Adversarial adaptation methods: Reverse Gradient (RevGrad) [pmlr-v37-ganin15] or Domain Adversarial Neural Network (DANN) [ganin2016domain] uses domain discriminator to learn domain invariant representations of both the domains. Coupled Generative Adversarial Network (CoGAN) [NIPS2016_6544] uses Generative Adversarial Network (GAN) [Goodfellow:2014:GAN:2969033.2969125] to obtain domain invariant features used for classification. Adversarial Discriminative Domain Adaptation (ADDA) [8099799] uses GANs along with weight sharing to learn domain invariant features. Generate to Adapt (G2A) [DBLP:conf/cvpr/Sankaranarayanan18a] learns to generate equivalent image in the other domain for a given image, thereby learning common domain invariant embeddings. Cross-Domain Representation Disentangler (CDRD) [DBLP:conf/cvpr/LiuYFWCW18] learns cross-domain disentangled features for domain adaptation. Symmetric Bi-Directional Adaptive GAN (SBADA-GAN) [Russo_2018_CVPR] aims to learn symmetric bidirectional mappings among the domains by trying to mimic a target image given a source image. Cycle-Consistent Adversarial Domain Adaptation (CyCADA) [pmlr-v80-hoffman18a] adapts representations at both the pixel-level and feature-level over the domains. Moving Semantic Transfer Network (MSTN) [xie2018learning] proposes moving semantic transfer network that learn semantic representations for the unlabeled target samples by aligning labeled source centroids and pseudo-labeled target centroids. Conditional Domain Adversarial Network (CDAN) [NIPS2018_7436] conditions the adversarial adaptation models on discriminative information conveyed in the classifier predictions. Joint Discriminative Domain Adaptation (JDDA) [DBLP:conf/aaai/ChenCJJ19] proposes joint domain alignment along with discriminative feature learning. Decision-boundary Iterative Refinement Training with a Teacher (DIRT-T) [shu2018a] and Augmented Cyclic Adversarial Learning (ACAL) [hosseini-asl2018augmented] learn by using a domain discriminator along with data augmentation for domain adaptation. Deep Cocktail Network (DCTN) [xu2018deep] proposes a k-way domain discriminator and category classifier for digit classification and real-world object recognition in a multi-source domain adaptation setting.

Apart from these approaches, a slightly different method that has been recently proposed is called Tri-Training. Tri-Training algorithms use three classifiers trained on the labeled source domain and refine them for unlabeled target domain. To be precise, in each round of tri-training, a target sample is pseudo-labeled if the other two classifiers agree on the labeling, under certain conditions such as confidence thresholding. Asymmetric Tri-Training (ATT) [pmlr-v70-saito17a] uses three classifiers to bootstrap high confidence target domain samples by confidence thresholding. This way of bootstrapping works only if the source classifier has very high accuracy. In case of of low source classifier accuracy, target samples are never obtained to bootstrap, resulting in a bad model. Multi-Task Tri-training (MT-Tri) [DBLP:conf/acl/PlankR18] explores the tri-training technique on the language domain adaptation tasks in a multi-task setting.

All the domain adaptation approaches mentioned earlier have a common unifying theme: they attempt to morph the target and source distributions so as to make them indistinguishable. In this paper, we propose a completely different approach: instead of focusing on aligning the source and target distributions, we learn a single classifier referred as Contradistinguisher, jointly on both the domain distributions using contradistinguish loss for the unlabeled target domain data and supervised loss for the labeled source domain data.

3 Proposed Method: CUDA

A domain is specified by its input feature space , the label space

and the joint probability distribution

, where and . Let be the number of class labels such that for any instance . Domain adaptation, in particular, consists of two domains and that are referred as the source and target domains respectively. A common assumption in domain adaptation is that the input feature space as well as the label space remains unchanged across the source and the target domain, i.e., and . Hence, the only difference between the source and target domain is input-label space distributions, i.e., . This is referred to as domain shift in the domain adaptation literature.

In particular, in an unsupervised domain adaptation, the training data consists of labeled source domain instances and unlabeled target domain instances . Given a labeled data in the source domain, it is straightforward to learn a classifier by maximizing the conditional probability over the labeled samples. However, the task at hand is to learn a classifier on the unlabeled target domain by transferring the knowledge from the labeled source domain.

3.1 Overview

Fig. 6: Architecture of the proposed method CUDA with Contradistinguisher (Encoder and Classifier). Three optimization objectives with their respective inputs involved in training of CUDA: (i) Source supervised (2), (ii) Target unsupervised (5), and Adversarial regularization (9).

The outline of the proposed method CUDA which involves contradistinguisher and the respective losses involved in training are depicted in Fig. 6 . The objective of contradistinguisher is to find a clustering scheme using the most contrastive features on unlabeled target in such a way that it also satisfies the target domain prior, i.e., prior enforcing. We achieve this by jointly training on labeled source samples in a supervised manner and unlabeled target samples in an unsupervised end-to-end manner by using a contradistinguish loss same as [Pandey2017UnsupervisedFL].

This fine-tunes the classifier learnt from source domain also to the target domain as demonstrated in Fig. 5 and Fig. 11. The crux of our approach is the contradistinguish loss (5) which is discussed in detail in Section 3.3. Hence, the apt name contradistinguisher for our neural network architecture.

Note that the objective of contradistinguisher is not same as a classifier, i.e., distinguishing is not same as classifying. Suppose there are two contrastive entities and , where are two classes. The aim of a classifier is to classify and , where to train a classifier one requires labeled data. On the contrary, the job of contradistinguisher is to just identify , i.e., contradistinguisher can classify (or ) and (or ) indifferently. To train contradistinguisher, we do not need any class information but only need unlabeled entities and . Using unlabeled target data, contradistinguisher is able to find a clustering scheme by distinguishing the unlabeled target domain samples in an unsupervised way. However, since the final task is classification, one would require a selective incorporation of the pre-existing informative knowledge required for the task of classification. This knowledge of assigning the label to the clusters is obtained by jointly training, thus classifying and .

3.2 Supervised Source Classification

For the labeled source domain instances , we define the conditional-likelihood of observing given as, , where denotes the parameters of contradistinguisher.

We estimate

by maximizing the conditional log-likelihood of observing the labels given the labeled source domain samples. Therefore, the source domain supervised objective to maximize is given as

(1)

Alternatively, one can minimize the cross-entropy loss, as used in practical implementation, instead of maximizing (1), i.e.,

(2)

where is the softmax output of contradistinguisher that represents the probability of class for the given sample .

3.3 Unsupervised Target Classification

For the unlabeled target domain instances , as the corresponding labels are unknown, a naive way of predicting the target labels is to directly use the classifier trained only with supervised loss given in (2). While this approach may perform reasonably well in certain cases, it fails to deliver state-of-the-art performance. This may be attributed to the following reason: the support for the distribution is defined only over the source domain instances and not the target domain instances . Hence, we model a non-trivial joint distribution parameterized by the same over target domain with only the target domain instances as the support as,

(3)

However (3) is not a joint distribution yet because , i.e., marginalizing over all does not yield the target prior distribution . We modify (3) so as to include the marginalization condition. Hence, we refer to this as target domain prior enforcing.

(4)

Note that defines a non-trivial approximate of joint distribution over the target domain as a function of learnt over source domain. The resultant unsupervised maximization objective for the target domain is given by maximizing the log-probability of the joint distribution which is

(5)

Next, we discuss how the objective given in (5) is solved and the reason why (5) is referred as contradistinguish loss. Since the target labels are unknown, one needs to maximize (5) over the parameters as well as the unknown target labels . As there are two unknown variables for maximization, we follow a two step approach to maximize (5

) as analogous to Expectation Maximization (EM) algorithm 

[dempster1977maximum]. The two optimization steps are as follows.

(i) Pseudo-label selection: We maximize (5) only with respect to the label for every by fixing as
(6)
Pseudo-labeling approach under semi-supervised representation learning setting has been well studied in [pseudo-label] and shown equivalent to entropy regularization [grandvalet2005semi]. As previously mentioned, pseudo-label selection is analogous to E-step in EM algorithm. Moreover, we derive the motivation from [Pandey2017UnsupervisedFL] that also uses pseudo-labeling in the context of semi-supervised representation learning. However, the proposed method addresses a more complex problem of domain adaptation in the presence of domain shift.
(ii) Maximization: By fixing the pseudo-labels from (6), we train contradistinguisher to maximize (5) with respect to the parameter .
(7)
The first term, i.e., log-probability for a label given forces contradistinguisher to choose features to classify to . The second term is a constant, hence it has no effect on the optimization with respect to . The third term is the negative of log-probability for the label given all the samples in the entire domain. Maximization of this term forces contradistinguisher to choose features to not classify all the other to selected pseudo-label except the given sample . This forces contradistinguisher to extract the most unique features of a given sample against all the other samples , i.e., most unique contrastive feature of the selected sample over all the other samples to distinguish a given sample from all others. The first and third term together in (7) enforce that contradistinguisher learns the most contradistinguishing features among the samples , thus performing unlabeled target domain classification in a fully unsupervised way. Because of this contradistinguishing feature learning, we refer the unsupervised target domain objective (5) as contradistinguish loss.

Ideally, one would like to compute the third term in (7) using the complete target training data for each input sample. Since it is expensive to compute the third term over the entire for each individual sample during training, one evaluates the third term in (7) over a mini-batch. In our experiments, we have observed that mini-batch strategy does not cause any problem during training as far as it includes at least one sample from each class which is fair assumption for a reasonably large mini-batch size of . For numerical stability, we use trick to optimize third term in (7).

3.4 Adversarial Regularization

In order to prevent contradistinguisher from over-fitting to the chosen pseudo labels during the training, we use adversarial regularization. In particular, we train contradistinguisher to be confused about set of fake negative samples by maximizing the conditional log-probability over the given fake sample such that the sample belongs to all classes simultaneously. The objective of the adversarial regularization is to multi-label the fake sample (e.g., noisy image that looks like a cat and a dog) equally to all classes as labeling to any unique class introduces more noise in pseudo labels. This strategy is similar to entropy regularization [grandvalet2005semi] in the sense that instead of minimizing the entropy for the real target samples, we maximize the conditional log-probability over the fake negative samples. Therefore, we add the following maximization objective to the total contradistinguisher objective as a regularizer.

(8)

for all . As maximization of (8) is analogous to minimizing the binary cross-entropy loss (9) of a multi-class multi-label classification task, in our practical implementation, we minimize (9) for assigning labels to all the classes for every sample.

(9)

where is the softmax output of contradistinguisher which represents the probability of class for the given sample .

The fake negative samples

can be directly sampled from, say a Gaussian distribution in the input feature space

with the mean and standard deviation of the samples

. For the language domain, fake samples are generated randomly as mentioned above because the input feature is the form of embeddings extracted from denoising auto-encoder with bag-of-words as the auto-encoder’s input. In case of visual datasets, as the feature space is high dimensional, the fake images are generated using a generator network with parameter

that takes Gaussian noise vector

as input to produce a fake sample , i.e., . Generator is trained by minimizing kernel MMD loss [DBLP:conf/nips/LiCCYP17], i.e., a modified version of MMD loss between the encoder output and of fake images and real target domain images respectively.

(10)

where is the Gaussian kernel.

Note that the objective of the generator is not to generate realistic images but to generate fake noisy images with mixed image attributes from the target domain. This reduces the effort of training powerful generators which is the focus in adversarial based domain adaptation approaches [DBLP:conf/cvpr/Sankaranarayanan18a, DBLP:conf/cvpr/LiuYFWCW18, Russo_2018_CVPR, pmlr-v80-hoffman18a, xie2018learning] used for domain alignment.

3.5 Algorithms and Complexity Analysis

Algorithm 1 and 2 list steps involved in CUDA training and inference respectively.

Input: , ,
Output:   // parameter of contradistinguisher
1
Data: ,
2if target domain prior is known then
       use for the contradistinguish loss (5)   // target domain prior enforcing
3      
4else
       compute assuming   // fair assumption as most datasets are well balanced
5      
6for  to  do
7       for  to  do
             sample a mini-batch , compute  (1) using   // source supervised loss
             compute  (6) using   // pseudo label selection step
            compute  (7) fixing   // maximization step
            /* steps 1 and 1 together optimize unsupervised contradistinguish loss (5) */
8             if adversarial regularization is enabled then
9                   if Generator is used then
                         get fake samples from Gaussian noise vectors using , compute (10)   // generator training
10                        
11                  else
12                         get fake samples by random sampling in the input feature space
                  compute  (9) using   // fake samples are assigned to all classes equally
13                  
            combine losses in steps 1,1,1, and 1 to compute gradients using back-propagation update using gradient descent   // and if is used
14            
15      
Algorithm 1 CUDA Training
Input:   // input test samples
1
Output:   // predicted labels
2
3 for  to  do
4       predict label as
Algorithm 2 CUDA Inference

Further, we briefly discuss the time complexity of Algorithm 1 and 2. We also compare model complexity of CUDA against domain alignment approaches.

(a) Time complexity: We consider a batch of instances for forward and backward propagation during training. For computing source supervised loss given in (2), the time complexity is , where is the time complexity involved in obtaining the classifier output which mainly depends on the model complexity which will be discussed next. For computing target unsupervised loss given in (5), the time complexity is for pseudo-label selection and for first and third terms in maximization step, i.e., effectively for the target unsupervised loss (5). The adversarial regularization loss in (9) has the time complexity . Time complexity for generator training is , where is dimension of the encoder output and is the time complexity of the encoder neural network which also depends on the model complexity discussed next. As dominates , total training time complexity can be further simplified to per mini-batch with a patience based early-stopping on loss over the held-out validation set. During inference phase, the time complexity is , where is the number of inference samples. (b) Model complexity: As discussed above,

mainly depends on the model complexity involving many factors such as input feature dimension, number of neural network layers, type of normalization, type of activation functions etc. Contradistinguisher is a simple network with a single encoder and classifier unlike MCD 

[8578490] that uses a single encoder with two classifier. This makes MCD [8578490] time complexity instead of just . Similarly, SE [french2018selfensembling] uses 2 copies of network of encoder and classifier one for student and other for teacher network. This makes SE [french2018selfensembling] time complexity instead of . In general, as domain alignment approaches use additional circuitry either in-terms of multiple classifiers or GANs, the model complexity increases at least by a factor of 2. This increased model complexity requires more data augmentation to prevent over-fitting leading to further increases in time complexity at the expense of only a slight improvement, if any, compared to CUDA as indicated by our state-of-the-art results without any data augmentation in both visual and language domain adaptation tasks. We believe the trade-off achieved by the simplicity of CUDA, as evident from our results, is very desirable compared to most domain alignment approaches that use data augmentation and complex neural networks for a slight improvement, if any.

3.6 Extending to Multi-Source Domain Adaptation

We can easily extend our proposed method to perform multi-source domain adaptation. Let us suppose we are given with source domains , consisting of labeled training data and unlabeled target domain instances . We compute the source supervised loss for the source domain using (2), i.e.,  (1) with training data. We further compute the total multi-source supervised loss as

(11)

We replace  (1) in the total optimization objective with  (11) in step 1 of Algorithm 1.

4 Experiments

For our domain adaptation experiment, under real-world datasets, we consider both visual and language datasets for domain adaptation to further demonstrate the input data format independence of the proposed method. Visual datasets can be further divided into two categories, low resolution visual datasets and high resolution visual datasets. Table II provides details on the visual datasets used in our experiments. Table II provides details on the language datasets used in our experiments. We have published our python code for all the experiments at https://github.com/sobalgi/cuda, originally derived from https://github.com/gauravpandeyamu/DisCoder, for DisCoder [Pandey2017UnsupervisedFL] .

4.1 Experiments on synthetic toy-dataset using blobs

(a)
(b)
(c)
(d)
Fig. 11: Additional demonstration of difference in domain alignment and proposed method CUDA on synthetic toy-datasets using blobs similar to Fig. 11. Top row corresponds to domain alignment approach with two different domains in both domain adaptation tasks. Bottom row corresponds to the proposed method CUDA in comparison with their respective domain alignment method in top row. As seen above, swapping domains affects the classifier learnt in domain alignment methods because the classifier depends on the source domain. However, because of joint learning on both the domains simultaneously in the proposed method CUDA, contradistinguisher shows almost the same decision boundary irrespective of the source domain when the domains are swapped. (Best viewed in color.)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
Fig. 32: Contour plots show the probability contours along with clear decision boundaries on different toy-dataset settings trained using CUDA. (source domain: , target domain: , class 0: blue, class 1: red.) (Best viewed in color.)

We validate our proposed method by performing experiments on synthetically created simple datasets that model different source and target domain distributions in a -dimensional input feature space using different blobs of source-target domain orientations and offsets (i.e., domain shift). We create blobs for source and target domains with 4000 samples using standard  [pedregosa2011scikit] as indicated in Fig. 5 and Fig. 11. We further evenly split these 4000 data-points into equal train and test sets. Each of the splits consists the same number of samples corresponding to both the class labels.

The main motivation of the experiments on toy-dataset is to understand and visualize the behavior of the proposed method under some typical domain distribution scenarios and analyse the performance of  CUDA. toy-dataset plots in Fig. 11 shows clear comparisons of the classifier decision boundaries learnt using CUDA over domain alignment approaches. The top row in Fig. 11 corresponds to domain alignment classifier trained only on the labeled source domain, i.e., . However, the bottom row in Fig. 11 corresponds to contradistinguisher trained using the proposed method CUDA with labeled source and unlabeled target domain, i.e., .

Fig. 32 demonstrates the classifier learnt using CUDA on the synthetic datasets with different complex shapes and orientations of the source and target domain distributions for the input data. Fig.s (c)c, and (a)a-(d)d indicates the simplest form of the domain adaptation tasks where are domains have similar orientations in source and target domain distributions.It is important to note that the prior enforcing used in pseudo-label selection is the reason such fine classifier boundaries are observed especially in Fig.s (d)d, and (e)e-(m)m. Fig.s (n)n-(p)p represents more complex configurations of source and target domain distributions that indicate the hyperbolic decision boundaries jointly learnt on both the domains simultaneously using a single classifier without explicit domain alignment. Similarly, Fig. (q)q represents a complex configuration of source and target domain distributions that indicate the elliptical decision boundary.

4.2 Experimental Setup and Datasets

Dataset # Train # Test # Classes Target Resolution Channels
USPS () 7,291 2,007 10 Digits 16 16 Mono
MNIST () 60,000 10,000 10 Digits 28 28 Mono
SVHN () 73,257 26,032 10 Digits 32 32 RGB
SYNNUMBERS () 479,400 9,553 10 Digits 32 32 RGB
CIFAR-9 () 45,000 9,000 9 Object ID 32 32 RGB
STL-9 () 4,500 7,200 9 Object ID 96 96 RGB
SYNSIGNS () 100,000 - 43 Traffic Signs 40 40 RGB
GTSRB () 39,209 12,630 43 Traffic Signs varies RGB
AMAZON () 2,817 - 31 Office Objects 224 224 RGB
DSLR () 498 - 31 Office Objects 224 224 RGB
WEBCAM () 795 - 31 Office Objects 224 224 RGB
TABLE II:

Details of language dataset (Amazon customer reviews for sentiment analysis).

Domain # Train # Test
Books () 2,000 4,465
DVDs () 2,000 3,586
Electronics () 2,000 5,681
Kitchen Appliances () 2,000 5,945
TABLE I: Details of visual datasets.
(a) All 10 classes of Digits datasets (column {1,5}: , {2,6}: , {3,7}: , {4,8}: ).
(b) All 9 overlapping classes of Objects datasets (column {1,3,5}: , {2,4,6}: ).
(c)

All 43 classes of Traffic signs datasets (column {odd numbered}:

, {even numbered}: ).
Fig. 36: Illustrations of samples from all the low resolution visual datasets with exactly one instance per each class from every domain. (Best viewed in color.)
Fig. 37: Illustrations of samples from all the three domains of high resolution Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset with one instance per each class from every domain (column {1,4,7,10}: , {2,5,8,11}: , {3,6,9,12}: ). (Best viewed in color.)
(a) Randomly initialized neural network before training
(b) after 1 training
(c) after 6 training
(d) after full training
Fig. 42: t-SNE [vandermaaten2008visualizing] plots for embeddings from the output of contradistinguisher before applying softmax corresponding to the test samples from visual task trained with CUDA. We consider this task as this is the most difficult among all the visual experiments due contrasting domains with high domain shift. (a) Initial plot of all the test samples before training indicating domain shift as there are two separate clusters for each domain. (b) Plot of subset from test samples after . (c) Plot of subset from test samples after . (d) Plot of subset from test samples after full contradistinguisher training with class-wise clustering. (Best viewed in color.)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 51: t-SNE [vandermaaten2008visualizing] plots for embeddings from the output of contradistinguisher before applying softmax corresponding to the test samples in low resolution visual experiments with class-wise clustering. (Best viewed in color.)
Method
ADA [DBLP:conf/iccv/HausserFMC17] - - 97.16 - - - 91.86 97.66
MCD [8578490] 94.10 94.20 96.20 - - - - 94.40
DRCN [10.1007/978-3-319-46493-0_36] 73.67 91.80 81.97 40.05 66.37 58.65 - -
DSN [Bousmalis:2016:DSN:3157096.3157135] - - 82.70 - - - 91.20 93.10
RevGrad [pmlr-v37-ganin15] 74.01 91.11 73.91 35.67 66.12 56.91 91.09 88.65
CoGAN [NIPS2016_6544] 89.10 91.20 - - - - - -
ADDA [8099799] 90.10 89.40 76.00 - - - - -
G2A [DBLP:conf/cvpr/Sankaranarayanan18a] 90.80 92.50 84.70 36.40 - - - -
CDRD [DBLP:conf/cvpr/LiuYFWCW18] 94.35 95.05 - - - - - -
SBADA-GAN [Russo_2018_CVPR] 95.00 97.60 76.10 61.10 - - - 96.70
CyCADA [pmlr-v80-hoffman18a] 96.50 95.60 90.40 - - - - -
MSTN [xie2018learning] - 92.90 91.70 - - - - -
CDAN [NIPS2018_7436] 97.10 96.50 90.50 - - - - -
JDDA [DBLP:conf/aaai/ChenCJJ19] 96.70 - 94.20 - - - - -
ATT [pmlr-v70-saito17a] - - 86.20 52.80 - - 93.10 96.20
CUDA (Ours) 99.20 97.86 99.07 71.30 77.22 65.93 94.30 99.40
(Ours) 81.18 82.00 77.54 24.86 77.64 62.10 91.45 95.13
(Ours) 99.64 97.98 99.64 96.02 73.78 91.46 96.85 98.23
(Ours) 98.83 97.71 98.81 50.83 77.22 62.50 93.65 98.15
(Ours) 98.77 97.86 98.62 54.38 76.93 61.09 93.52 97.86
(Ours) 99.20 97.31 98.85 54.32 76.18 59.37 93.59 99.40
(Ours) 89.97 93.87 97.15 41.71 75.00 56.99 90.79 99.35
(Ours) 98.75 96.26 95.73 55.25 70.93 61.37 92.97 99.11
SE [french2018selfensembling] (requires data augumentation) 99.54 98.26 99.26 97.00 80.09 74.24 97.11 99.37
DIRT-T [shu2018a] (requires data augumentation) - - 99.40 54.50 - 73.30 96.20 99.60
ACAL [hosseini-asl2018augmented] (requires data augumentation) 97.16 98.31 96.51 60.85 - - 97.98 -
TABLE III: Target domain test accuracy (%) on low resolution visual datasets. ‘-’ indicates that particular domain adaptation task not being experimented out in the respective method. CUDA corresponds to our best results obtained with best hyper-parameter settings. : source supervised, : target supervised, : target unsupervised, : source unsupervised, : adversarial regularization and : source adversarial regularization represents different training configurations. We exclude [french2018selfensembling, shu2018a, hosseini-asl2018augmented] from comparison as they use heavy data augmentation.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Fig. 64: t-SNE [vandermaaten2008visualizing] plots for embeddings from the output of contradistinguisher corresponding to the samples from Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset in high resolution visual tasks trained with CUDA. First 2 rows corresponds to ResNet-50 as the encoder. Last 2 rows corresponds to ResNet-152 as the encoder. We can observe the clear class-wise clustering of among all the 31 classes in the Office-31 [DBLP:conf/eccv/SaenkoKFD10] datasets. We achieve high accuracies in spite of having only few hundred training samples in each domain. (Best viewed in color.)
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 71: Confusion matrix corresponding to the target domain accuracy of CUDA in Table IV. The confusion matrix of real-world domains, and , indicates lower complexity involved in real-world target domains. The confusion matrix of synthetic domain indicates higher complexity involved in synthetic target domain. (Best viewed in color.)
Method
DAN [DBLP:conf/icml/LongC0J15] 78.6 80.5 63.6 97.1 62.8 99.6 80.3
RTN [DBLP:conf/nips/LongZ0J16] 77.5 84.5 66.2 96.8 64.8 99.4 81.5
JAN [DBLP:conf/icml/LongZ0J17] 84.7 85.4 68.6 97.4 70.0 99.8 84.3
Rozantsev et. al. [rozantsev2018beyond] 75.5 75.8 55.7 96.7 57.6 99.6 76.8
RevGrad [pmlr-v37-ganin15] 79.7 82.0 68.2 96.9 67.4 99.1 82.2
ADDA [8099799] 77.8 86.2 69.5 96.2 68.9 98.4 82.8
G2A [DBLP:conf/cvpr/Sankaranarayanan18a] 87.7 89.5 72.8 97.9 71.4 99.8 86.5
CDAN [NIPS2018_7436] 92.9 94.1 71.0 98.6 69.3 100.0 87.6
DICE [liang2018aggregating] 68.5 72.5 58.1 97.2 60.3 100.0 76.1
CUDA (Ours) 97.0 98.5 76.0 99.1 76.0 100.0 91.1
(Ours) (fine-tune ResNet-50) 41.0 38.7 23.2 80.6 25.6 94.2 50.6
(Ours) (fixed ResNet-50) 82.0 77.9 68.4 97.2 67.1 100.0 82.1
(Ours) (fixed ResNet-50) 95.0 93.8 71.5 98.9 73.3 99.4 88.7
(Ours) (fixed ResNet-50) 96.0 95.6 69.5 99.1 70.7 100.0 88.5
(Ours) (fixed ResNet-50) 92.8 91.6 72.5 98.4 72.8 99.8 88.0
(Ours) (fixed ResNet-50) 91.8 95.6 73.2 98.0 74.7 100.0 88.9
(Ours) (fixed ResNet-152) 84.9 82.8 70.3 98.2 71.1 100.0 84.6
(Ours) (fixed ResNet-152) 97.0 94.3 73.9 99.0 75.5 100.0 90.0
(Ours) (fixed ResNet-152) 95.6 95.6 73.8 98.7 74.3 100.0 89.7
(Ours) (fixed ResNet-152) 97.0 97.4 76.0 98.6 75.1 99.8 90.7
(Ours) (fixed ResNet-152) 95.4 98.5 75.0 98.9 76.0 100.0 90.6
TABLE IV: Target domain accuracy (%) on high resolution Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset containing three domains. CUDA corresponds to our best results obtained with best hyper-parameter settings. : source supervised, : target unsupervised, : source unsupervised, : source adversarial regularization, and : target adversarial regularization represents different training configurations.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 78: Top Row: t-SNE [vandermaaten2008visualizing] plots for embeddings from the output of contradistinguisher corresponding to the samples from Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset in high resolution visual tasks trained with CUDA with ResNet-50 as the encoder in a multi-source domain adaptation setting. Bottom Row: Confusion matrix of target domain corresponding to respective t-SNE plots in top row as indicated in Table V. (Best viewed in color.)
Setting Method
Best single source DAN [DBLP:conf/icml/LongC0J15] 97.1 63.6 99.6 86.8
RTN [DBLP:conf/nips/LongZ0J16] 96.8 66.2 99.4 87.5
JAN [DBLP:conf/icml/LongZ0J17] 97.4 70.0 99.8 89.1
Rozantsev et. al. [rozantsev2018beyond] 96.7 57.6 99.6 84.6
mDA-layer [8792192] 94.5 64.9 94.9 84.8
RevGrad [pmlr-v37-ganin15] 96.9 68.2 99.1 88.1
ADDA [8099799] 96.2 69.5 98.4 88.0
G2A [DBLP:conf/cvpr/Sankaranarayanan18a] 97.9 72.8 99.8 90.2
CDAN [NIPS2018_7436] 98.6 71.0 100.0 89.8
DICE [liang2018aggregating] 97.2 60.3 100.0 85.8
CUDA (Ours) 99.1 74.7 100.0 91.3
Multi-source DAN [DBLP:conf/icml/LongC0J15] 95.2 53.4 98.8 82.5
mDA-layer [8792192] 94.6 62.6 93.7 83.6
DIAL [carlucci2017just] 94.3 62.5 93.8 83.5
RevGrad [pmlr-v37-ganin15] 96.2 54.6 98.8 83.2
DCTN [xu2018deep] 96.9 54.9 99.6 83.8
SGF [gopalan2011domain] 52.0 28.0 39.0 39.7
sFRAME [xie2015learning] 52.2 32.1 54.5 46.3
CUDA (Ours) 99.5 73.6 99.8 91.0
(Ours) 95.6 68.1 99.2 87.6
(Ours) 99.4 72.1 99.8 90.4
(Ours) 98.9 70.3 99.4 89.5
(Ours) 99.5 73.3 99.6 90.8
(Ours) 99.4 73.6 99.2 90.7
TABLE V: Target domain accuracy (%) on high resolution Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset under multi-source domain adaptation setting by combining two domains into a single source domain and the remaining domain as the target domain with ResNet-50 as encoder. CUDA corresponds to our best results obtained with best hyper-parameter settings. : source supervised, : target unsupervised, : source unsupervised, : source adversarial regularization, and : target adversarial regularization represents different training configurations.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Fig. 91: t-SNE [vandermaaten2008visualizing] plots for embeddings from the output of contradistinguisher after applying softmax corresponding to the test samples in language domain adaptation experiments. (Best viewed in color.)
Method
VFAE [DBLP:journals/corr/LouizosSLWZ15] 79.90 79.20 81.60 75.50 78.60 82.20 72.70 76.50 85.00 72.00 73.30 83.80 78.35
CMD [2017arXiv170208811Z] 80.50 78.70 81.30 79.50 79.70 83.00 74.40 76.30 86.00 75.60 77.50 85.40 79.82
DANN [ganin2016domain] 78.40 73.30 77.90 72.30 75.40 78.30 71.30 73.80 85.40 70.90 74.00 84.30 76.27
ATT [pmlr-v70-saito17a] 80.70 79.80 82.50 73.20 77.00 82.50 73.20 72.90 86.90 72.50 74.90 84.60 78.39
MT-Tri [DBLP:conf/acl/PlankR18] 78.14 81.45 82.14 74.86 81.45 82.14 74.86 78.14 82.14 74.86 78.14 81.45 79.14
CUDA (Ours) 82.77 83.07 85.58 80.02 82.06 85.70 75.88 76.05 87.30 73.08 73.06 86.66 80.93
(Ours) 81.07 75.11 77.53 77.67 75.99 79.78 73.12 74.48 86.19 72.59 76.24 85.92 77.97
(Ours) 83.83 87.19 89.05 84.08 87.19 89.05 84.08 83.83 89.05 84.08 83.83 87.19 86.03
(Ours) 81.99 81.45 84.36 77.18 81.48 84.37 67.26 67.71 87.30 70.68 71.97 84.79 78.37
(Ours) 82.63 81.73 83.75 75.88 77.45 80.96 69.70 70.69 87.37 72.99 67.76 84.51 77.91
(Ours) 82.77 83.07 85.58 80.02 82.06 85.70 75.88 76.05 87.30 73.08 73.06 86.66 80.93
(Ours) 80.37 80.20 84.58 78.45 81.36 85.03 75.05 75.01 87.47 72.63 71.97 86.31 79.86
TABLE VI: Target domain test accuracy (%) on Amazon customer reviews dataset for Sentiment Analysis. CUDA corresponds to our best results obtained with best hyper-parameter settings. : source supervised, : target supervised, : target unsupervised, : source unsupervised, : source adversarial regularization and : target adversarial regularization represents different training configurations.

For our domain adaptation experiment, we consider both synthetic and real-world datasets. Under synthetic datasets, we experiment using 2D blobs with different source and target domain probability distributions to demonstrate the effectiveness of the proposed method under different domain shifts. Under real-world datasets, we consider both visual and language datasets for domain adaptation to further demonstrate the input data format independence of the proposed method. Visual datasets can be further divided into categories, low resolution visual datasets and high resolution visual datasets. Table II provides details on the visual datasets used in our experiments. Table II provides details on the language datasets used in our experiments.

4.2.1 Low Resolution Visual Datasets

In low resolution visual experiments, we consider eight benchmark visual datasets with three different nature of images: Digits, Objects and Traffic Signs. These low resolution visual experiments are grouped as one set because all these datasets have low resolution images with a generally large number of training samples. Due to these two reasons, there is no need to use any pre-trained networks and entire setup can be trained from scratch using large number of training samples from source and target domains combined.

(a) Digits: USPS ([lecun1989backpropagation] and MNIST ([lecun1998gradient] form gray-scale digits datasets. SVHN ([37648] and SYNNUMBERS ([pmlr-v37-ganin15] form RGB digits datasets. Fig. (a)a shows illustrations of the images from the above mentioned digits datasets.The data processing of the digits datasets is done as follows: (i) : images are up-scaled using bi-linear interpolation from 16161 to 28281 to match the size of , (ii) : images are up-scaled using bi-linear interpolation to 32321. The RGB channels of are converted to Mono image resulting in 32321 size. Several other combinations were tried but this combination produced the best results, (iii) : No pre-processing required as these domains have same image size. (b) Objects: CIFAR ([krizhevsky2009learning] and STL ([coates2011analysis] are datasets of objects/animals RGB images. We consider only the 9 overlapping classes from the original datasets excluding the ‘frog’ class from CIFAR and the ‘monkey’ class from STL. Fig. (b)b shows illustrations of the images from the above mentioned objects datasets.The data processing of the objects datasets is done as follows: : images are down-scaled from 96963 to 32323 to match the size of images. (c) Traffic Signs: SYNSIGNS ([pmlr-v37-ganin15] and GTSRB ([Stallkamp-IJCNN-2011] are datasets depicting traffic signs. Fig. (c)c shows illustrations of the images from the above mentioned traffic signs datasets. In both the datasets, images were cropped to 40403 based on the region of interest in the images.

We use the same neural network architecture as used in SE [french2018selfensembling] without any data augmentation for low resolution visual datasets. The networks are trained from scratch as the number of training samples were high relative to high resolution visual datasets where we use pre-trained networks to extract features. We try to use the same hyper-parameters as used in SE [french2018selfensembling] in order to demonstrate the effectiveness of the proposed approach with minor modifications if necessary. Note that we do not perform any image data augmentation in our experiments unlike [french2018selfensembling]. Our aim in this paper is to demonstrate that the proposed method performs above/on-par the standard domain alignment methods without data augmentation as data augmentation is expensive and not always possible as seen in language tasks. We show that even without any domain specific centering or data augmentation, we still achieve the best results as the contradistinguish loss is able to classify directly on target domain by learning the most contrastive features in that domain.

4.2.2 High Resolution Visual Datasets

In high resolution visual datasets, we consider Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset for our experiments. Unlike low resolution visual datasets, here we have only few hundreds of training samples which makes this an even more challenging task.

Office objects: Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset consists of high resolution images of objects belonging to 31 classes obtained from three different domains AMAZON (), DSLR (), and WEBCAM (). Fig. 37 shows illustrations of the images from all the three above mentioned domains of the Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset. AMAZON () domain consists of synthetic images with clear white background. DSLR () and WEBCAM () domains consist of real images with noisy background and surroundings. We consider all possible six combinatorial tasks of domain adaptation involving all the three domains, i.e., , and . Compared to low resolution visual datasets, Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset domain adaptation tasks have increased complexity due to the small number of training images.

To alleviate the lack of large number of training samples, pre-trained networks such as ResNet-50 [he2016deep] and ResNet-152 [he2016deep] were used to extract 2048 dimensional features from high resolution images similar to CDAN [NIPS2018_7436]. Since the images are not well centered and have a high resolution, we use the standard ten-crop of the image to extract features from the same images during training and testing, also similar to CDAN [NIPS2018_7436].

The use of pre-trained models leads to two choices of training, (i) Fine-tune the pre-trained model used as feature extractor along with the final classifier layer: This requires careful selection of several hyper-parameters such as learning rate, learning rate decay, batch size etc. to fine-tune the network to the current dataset while preserving the ability of the pre-trained network. We observed that fine-tuning also depends on the loss function used for training [DBLP:conf/iclr/JacobsenBZB19], which in our case the use of contradistinguish loss greatly affected the changes in the pre-trained model as it is trained only using cross-entropy loss. Fine-tuning also computationally expensive and time-consuming as each iteration requires computing gradients of all the parameters of the pre-trained model. (ii) Fix the pre-trained model and only train the final classifier layer: Alternative to fine-tuning is to fix the pre-trained model and use it only as a feature extractor. This approach has multiple benefits such as, (a) The computational time and cost of the fine-tuning the parameters of pre-trained model is alleviated. (b) Since the extractor is fixed, it requires only once to extractor and store the features locally instead of extracting the same features every iteration. Hence reducing the training time as it is only required to train the classifier.

4.2.3 Language Datasets

We consider four benchmark language domains (i) Books (), (ii) DVDs (), (iii) Electronics (), and (iv) Kitchen Appliances () from the Amazon customer reviews [blitzer2006domain] dataset. The dataset includes product reviews from four different domains labeled for sentiment analysis task as indicated in Table II.

On these domains, we consider all twelve combinations of domain adaptation tasks studied in [ganin2016domain, DBLP:journals/corr/LouizosSLWZ15, 2017arXiv170208811Z, pmlr-v70-saito17a, DBLP:conf/acl/PlankR18]. We use the same neural networks and text pre-processing used in [Chen:2012:MDA:3042573.3042781, ganin2016domain, DBLP:conf/acl/PlankR18]

to get 5000 dimensional feature vector using marginalizing Stacked Linear Denoising Autoencoders (mSLDA) 

[chen2015marginalizing], an improvement over vanilla Stacked Denoising Autoencoder (SDA) [glorot2011domain]. We assign binary label ‘0’ for the reviews rated stars and ‘1’ for the reviews rated star ratings.

We select the best existing neural networks without major modifications to hyper-parameters so as to demonstrate the effectiveness of CUDA. All the experiments were done using PyTorch 

[paszke2017automatic] with mini-batch size of 64 per GPU distributed over four GPUs, Adam optimizer with an initial learning rate and decay rate of

every 30 epochs was used.

4.3 Experimental Results

We use the same metric used for evaluation as in [pmlr-v37-ganin15, NIPS2016_6544, 10.1007/978-3-319-46493-0_36, pmlr-v70-saito17a, DBLP:conf/iccv/HausserFMC17, 8099799, DBLP:conf/cvpr/Sankaranarayanan18a, DBLP:conf/cvpr/LiuYFWCW18, 8578490, Russo_2018_CVPR, pmlr-v80-hoffman18a, xie2018learning, NIPS2018_7436, DBLP:conf/aaai/ChenCJJ19, french2018selfensembling, shu2018a, hosseini-asl2018augmented, ganin2016domain, DBLP:journals/corr/LouizosSLWZ15, 2017arXiv170208811Z, DBLP:conf/acl/PlankR18, Bousmalis:2016:DSN:3157096.3157135], i.e., the accuracy on target domain test set for the low resolution visual dataset experiments. Table III indicates the target domain test accuracy across all the eight low resolution visual domain adaptation tasks described earlier compared with several state-of-the-art domain alignment methods [pmlr-v37-ganin15, NIPS2016_6544, 10.1007/978-3-319-46493-0_36, pmlr-v70-saito17a, DBLP:conf/iccv/HausserFMC17, 8099799, DBLP:conf/cvpr/Sankaranarayanan18a, DBLP:conf/cvpr/LiuYFWCW18, 8578490, Russo_2018_CVPR, pmlr-v80-hoffman18a, xie2018learning, NIPS2018_7436, DBLP:conf/aaai/ChenCJJ19, french2018selfensembling, shu2018a, hosseini-asl2018augmented, Bousmalis:2016:DSN:3157096.3157135]. In contrast to low resolution visual datasets, high resolution Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset does not have separate pre-defined train and test splits. Since we do not use any labels from the target domain during training, we report ten-crop test accuracy on the target domain by summing the softmax values of all the ten crops of the image and assign the label with maximum aggregate softmax value for the given image as in CDAN [NIPS2018_7436] in Table IV. In Table V, we report the target domain accuracy similar to Table IV in a multi-source domain adaptation setting by combining two domains into a single labeled source domain and the remaining domain as the unlabeled target domain. Table VI indicates the target domain test accuracy across all the twelve language domain adaptation tasks compared with different state-of-the-art methods [ganin2016domain, DBLP:journals/corr/LouizosSLWZ15, 2017arXiv170208811Z, pmlr-v70-saito17a, DBLP:conf/acl/PlankR18].

Apart from the standard domain alignment methods used for comparison, we report the performance of two baselines and of our own, in Tables III-VI, by fixing the contradistinguisher neural network architecture and varying only the training losses. involves training contradistinguisher using only the source domain in a fully supervised way. involves training contradistinguisher using only the target domain in a fully supervised way. and respectively indicates the minimum and maximum target domain test accuracy that can be attained with chosen contradistinguisher neural network. Comparing CUDA with in Tables III-VI, we can see huge improvements in the target domain test accuracies due to the use of contradistinguish loss (5) demonstrating the effectiveness of contradistinguisher.As our method is mainly dependent on the contradistinguish loss (5), by experimenting with better neural networks along with our contradistinguish loss (5), we observed improved results over neural networks of [8099799, 8578490] on low resolution visual experiments. We used ResNet with 4 fully connected layer over AlexNet on high resolution visual experiments and Multinomial Adversarial Network (MAN) [DBLP:conf/naacl/ChenC18] on language experiments.

4.4 Analysis of Experimental Results

4.4.1 Low Resolution Visual Experimental Results

In tasks and , the performance of is less than CUDA in Table III. is poor because causing over-fitting when only target domain supervised loss is used. The improved results of CUDA indicates that contradistinguisher is able contradistinguish on the target domain along with the transfer of informative knowledge required for the classification from a larger source domain. This indicates that contradistinguisher is indeed successful in contradistinguishing on a relatively small set of unlabeled target domain using larger source domain information.

Another interesting observation is that in the task , is slightly better than CUDA. This is due to slight over-fitting on the target domain training examples which are actually non-informative for classification leading to a small decrease in the target domain test accuracy. outperforms in certain tasks indicating that source domain has more information than target domain due to large source and small target training sets.

Fig.s (a)a-(d)d shows t-SNE [vandermaaten2008visualizing] plots for embeddings from the output of contradistinguisher before applying softmax corresponding to the test samples from as the contradistinguisher training progresses using CUDA. We indicate these plots as this is the most difficult among all the visual experiments due to contrasting domains.Fig.s (a)a-(h)h shows t-SNE [vandermaaten2008visualizing] plots for embeddings from the output of contradistinguisher before applying softmax corresponding to the test samples in low resolution visual experiments and they show clear class-wise clustering on both source and target domains indicating the efficacy of CUDA.

As an ablation study, keeping the neural network and all the hyper-parameters same, we investigated the effect of each of the loss functions, i.e., source supervised loss  (1), target unsupervised loss  (5) and target adversarial loss  (9) and reported the same in Table III. Since the contradistinguish loss (5) only requires unlabeled input, we observed that using the source domain without labels as an additional loss unsupervised loss  (5) only complements in the contradistinguisher performance. An important observation to be made is that when source adversarial loss  (9) is used alone without target adversarial loss  (9), it always leads to a decrease in performance in the target domain test accuracy. An explanation for this behavior is that an adversarial input in source domain might be a real input in target domain. So assigning such an input to all the classes indifferently sometime would lead to additional noise in pseudo-labels. It should be noted that , i.e., source domain supervised loss and target domain contradistinguish loss always improves on which is source domain supervised loss only. This indicates the efficacy of the target domain unsupervised contradistinguish loss  (5) in the proposed approach CUDA.

4.4.2 High Resolution Visual Experimental Results

We report the standard ten-crop accuracy on the target domain images as reported by several state-of-the-art domain adaptation methods [NIPS2018_7436, DBLP:conf/cvpr/Sankaranarayanan18a, DBLP:conf/icml/LongZ0J17]. Since there are no explicit test split specified in the dataset and no labels are used from the target domain during training, it is common to report ten-crop accuracy considering the whole target domain.

In Table IV, we report accuracies obtained by fine-tuning ResNet-50 using the learning rate scheduling followed in CDAN [NIPS2018_7436] and also without fine-tuning ResNet-50. Fig.s (a)a-(f)f indicate the t-SNE plots of the softmax output after aggregating the ten-crop of each image corresponding to training configuration reported in Table IV. Apart from fixed ResNet-50, we also report accuracies with fixed ResNet-152 in Table IV for comparison. Fig.s (g)g-(l)l indicate the t-SNE plots of the softmax output after aggregating the ten-crop of each image corresponding to training configuration reported in Table IV. Fig. 64 reports the t-SNE plots of the training setting using ResNet-50 and ResNet-152 encoder with the highest mean accuracy of all the six domain adaptation tasks. We clearly observe that CUDA outperforms several state-of-the-art methods that also use ResNet-50 and even surpasses further using ResNet-152 encoder with CUDA.

Among the three domains in Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset, can be considered as well curated synthetic dataset with clear background and as uncurated real-world dataset with noisy background and surroundings. We report the six domain adaptation tasks in the order of their complexity from low to high as, (i) Fig.s (c)c,(f)f,(i)i and (l)l indicate highest accuracies because of similar real-world to real-world domain adaptation task, (ii) Fig.s (a)a,(b)b,(g)g and (h)h indicate moderately high accuracies because of synthetic to real-world domain adaptation task, and (iii) Fig.s (d)d,(e)e,(j)j and (k)k indicate lowest accuracies among all the six tasks because of real-world to synthetic domain adaptation task. Fig. 71 reiterates the above observations involving synthetic and real-world domains. mDA-layer [mancini2018boosting, 8792192] report the target domain accuracy after unifying the remaining domains as a single source domain. This is an easier task than ours because having at least one real-world domain as source boost the performance heavily as indicated in Fig.s (c)c,(f)f,(i)i and (l)l. Even in this multi-source domain setting, CUDA outperforms [mancini2018boosting, 8792192].

We also extend the experiments to multi-source domain adaptation on the Office-31 [DBLP:conf/eccv/SaenkoKFD10] dataset. In Table V, we can clearly observe that in task, multi-source domain adaptation provides better results than their respective best single source domain adaptation experiments. However in case of