1 Introduction
Visual recognition has seen vast improvements based mainly on the success of deep learning based models [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton]
. These models are trained on very large annotated datasets such as Imagenet
[Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and FeiFei]. The deployment of these generically trained models require them to adapt to work in specific settings (for instance with catalog images in Ecommerce websites). This problem is recognized as one of dataset bias and was demonstrated through the work of [Torralba and Efros(2011)]. However, the requirement of a large annotated dataset becomes a bottleneck for training networks in deep learning frameworks. In this paper, we tackle the problem of adapting classifiers to work on datasets that do not have any labeled information. This problem is one of unsupervised domain adaptation in a more general setting.Ganin and Lempitsky [Ganin and Lempitsky(2015)] proposed a method to solve unsupervised domain adaptation through backpropagation. In this method, the domain adaptation problem is solved by using a discriminator that ensures domain invariance of learned representations used for classification. There have been several methods [Tzeng et al.(2017)Tzeng, Hoffman, Saenko, and Darrell, Hoffman et al.(2017)Hoffman, Tzeng, Park, Zhu, Isola, Saenko, Efros, and Darrell, Shen et al.(2017)Shen, Qu, Zhang, and Yu] proposed for improving the discriminator. However, most of these involve an increase in the number of parameters. For instance a recent work by Pei et al (MADA) [Pei et al.(2018)Pei, Cao, Long, and Wang] addresses this issue through classspecific discriminators. This leads to a linear increase in the number of parameters with the number of classes in dataset. In contrast, we propose the use of curriculumbased dropout discriminator to obtain improved performance of the domain adaptation task without increasing the number of parameters. It makes our model’s applicability comprehensive as it can also adapt to datasets with a large number of classes.
Specifically, in this paper, we propose Curriculum based Dropout Discriminator for Domain Adaptation (CA) and compare it with a variant, Dropout Discriminator for Domain Adaptation (A). It’s a novel approach that solves the above problem through an adversarial dynamic dropout based ensemble of discriminators, where we consider dropout as being a source of an ensemble of domain classifiers [Hara et al.(2016)Hara, Saitoh, and Shouno]. The proposed model also enables the discriminators reduce the prediction variance, remove overfitting, and average out the bias. The idea for this discriminator is illustrated in Figure 1. The initial discriminator by Ganin and Lempitsky [Ganin and Lempitsky(2015)] suggests the use of a single binary discriminator and MADA [Pei et al.(2018)Pei, Cao, Long, and Wang] extends it to classspecific cues. In contrast, CA obtains a discriminator distribution that provides a muchimproved feedback for improving the feature extractor. The performance of any adversarial learning method largely depends upon the capability of the discriminator network. The ensemble method [Hara et al.(2016)Hara, Saitoh, and Shouno] improves the discriminator’s performance and makes it robust. We show that this indeed helps in an improved domain adaptation (around 5.3% improvement in AmazonDSLR adaptation) with much fewer parameters(59M) than MADA(98M). More importantly, our method does not increase the number of parameters as the number of classes increase, making it scalable to datasets with a large number of classes. Through this paper we make the following main contributions:

We propose a method to obtain a dropout based discriminator that provides a distribution based discrimination for every sample ensuring a more robust feature adaptation

We adopt a curriculum based dropout model, CA, that ensures gradual increase in the number of samples as the adaptation progresses to ensure better adaptation in contrast to a fixed number of samples based dropout distribution (DA).

We provide a thorough empirical analysis of the method (including statistical significance, discrepancy distance) and evaluate our approach against the stateoftheart approaches.
2 Related Work
Domain Adaptation:A large number of methods have been proposed to tackle the domain adaptation problem. The basic common structure that has been followed is the Siamese architecture [Bromley et al.(1994)Bromley, Guyon, LeCun, Säckinger, and Shah] with two streams, representing the source and target models. It is trained with a combination of a classification loss and the other being one of discrepancy loss or an adversarial loss. The classification loss depends on the source data label, while the discrepancy loss reduces the shift between the two domains. A discrepancy based deep learning method is that of deep domain confusion (DDC) [Tzeng et al.(2014)Tzeng, Hoffman, Zhang, Saenko, and Darrell]. The loss between a single FC (fully connected) layer of source and target feature extractor network is used to minimize the maximum mean discrepancy (MMD) between the source and the target. This approach is further extended by deep adaptation network (DAN) [Long et al.(2015)Long, Cao, Wang, and Jordan]. Recently, a number of other methods have been proposed which use discrepancy of domain [Saito et al.(2018b)Saito, Watanabe, Ushiku, and Harada, Zhang et al.(2018b)Zhang, Wang, Huang, and Nehorai, Sun and Saenko(2016), Sun et al.(2017)Sun, Feng, and Saenko, Sun et al.(2016)Sun, Feng, and Saenko, Shen et al.(2018)Shen, Qu, Zhang, and Yu, Long et al.(2017a)Long, Zhu, Wang, and Jordan, Rozantsev et al.(2018)Rozantsev, Salzmann, and Fua]. Other similar works are also applied in vision and language work [Patro and Namboodiri(2018), Patro et al.(2018a)Patro, Kumar, Kurmi, and Namboodiri]
Adversarial Learning: In the domain adaptation setting, an adversarial network provides domain invariant representations by making the source and target domain indistinguishable by the discriminator. Adversarial Discriminative Domain Adaptation [Tzeng et al.(2017)Tzeng, Hoffman, Saenko, and Darrell] uses an inverted label GAN loss to split the optimization into two independent objectives. One such method is the domain confusion based model proposed in [Tzeng et al.(2015)Tzeng, Hoffman, Darrell, and Saenko]
that considers a domain confusion objective. DomainAdversarial Neural Networks (DANN)
[Ganin and Lempitsky(2015)] integrates a gradient reversal layer into the standard architecture to promote the emergence of the learned representations that are discriminative for the main learning task on the source domain and nondiscriminative concerning the shift between the domains. Recently, some works have been proposed which use an adversarial discriminative approach in solving the domain adaptation problem [Saito et al.(2018a)Saito, Ushiku, Harada, and Saenko, Hoffman et al.(2018)Hoffman, Tzeng, Park, Zhu, Isola, Saenko, Efros, and Darrell, Bousmalis et al.(2017a)Bousmalis, Silberman, Dohan, Erhan, and Krishnan, Zhang et al.(2018a)Zhang, Ding, Li, and Ogunbona, Chen et al.(2018)Chen, Liu, Wang, Wassell, and Chetty, Li et al.(2018)Li, Pan, Wang, and Kot, Kurmi et al.(2019)Kurmi, Kumar, and Namboodiri, Patro et al.(2018b)Patro, Kurmi, Kumar, and Namboodiri]. Similarly, the model proposed in [Bousmalis et al.(2017b)Bousmalis, Silberman, Dohan, Erhan, and Krishnan, Choi et al.(2017)Choi, Choi, and Kim] exploits GANs with the aim to generate sourcedomain images such that they appear as if they were drawn from the target domain distribution. The closest related work to our approach is the work by [Pei et al.(2018)Pei, Cao, Long, and Wang] that extends the gradient reversal method by a classspecific discriminator.Ensemble and Curriculum learning: Ensemble methods [Lakshminarayanan et al.(2017)Lakshminarayanan, Pritzel, and Blundell] can capture the uncertainty of the neural network (NN). Gal et.al. [Gal and Ghahramani(2016)]
use dropout to obtain the predictive uncertainty and apply Markov chain Monte Carlo
[Neal(2012)] also known as MCMC at the test time to deal with intractable posterior. In discriminator based approaches, ensembles can be considered as multidiscriminator or multigenerator architecture. Multi discriminator approach has also been proposed by [Nguyen et al.(2017)Nguyen, Le, Vu, and Phung, Ghosh et al.(2017)Ghosh, Kulharia, Namboodiri, Torr, and Dokania, Durugkar et al.(2016)Durugkar, Gemp, and Mahadevan] to learn the data distribution more effectively. In Bayesian GAN [Saatci and Wilson(2017)], dropout in the discriminator is used which can be interpreted as an ensemble model [Gal and Ghahramani(2016)]. The curriculum learning [Bengio et al.(2009)Bengio, Louradour, Collobert, and Weston] enhances model’s performance and its generalization capability. The performance of the GAN is also improved through the curriculum learning of the discriminator [Sharma et al.(2018)Sharma, Barratt, Ermon, and Pande]. It has been shown that dropout can also work with curriculum learning [Morerio et al.(2017)Morerio, Cavazza, Volpi, Vidal, and Murino]. In domain adaptation, a curriculum style learning approach has been applied in [Zhang et al.(2017)Zhang, David, and Gong]to minimize the domain gap in semantic segmentation. The curriculum domain adaptation first solves easy tasks such as estimating label distributions, then infers the necessary properties about the target domain. The theoretical framework for curriculum learning in transfer learning is proposed in
[Weinshall et al.(2018)Weinshall, Cohen, and Amir]. Recently other curriculum learning based domain adaptation methods have been proposed in Transferable Curriculum Learning [Shu et al.(2019)Shu, Cao, Long, and Wang].In contrast to the previous works, the main contribution of the present work is to propose a curriculum based dropout discriminator. We show that through the proposed method, we are able to outperform state of the art domain adaptation techniques in a scalable way by using fewer number of parameters as compared to techniques such as MADA[Pei et al.(2018)Pei, Cao, Long, and Wang] and similar number of parameters as GRL [Ganin and Lempitsky(2015)].
3 Motivation
In the adversarial domain adaptation problem, the previous methods have used classical statistical inference in the discriminator. A single discriminator learns the source and target domain classification. Our hypothesis is that it may lead to overconfident inference and decisions which in turn may lead to challenges in learning invariant features. In the domain adaptation problem, data is generally structured in a multimodal distribution. Thus, a multiple discriminator approach is compelling [Pei et al.(2018)Pei, Cao, Long, and Wang], due to its capacity to capture multiple modes of the dataset. It also leads to solving the perennial problem of mode collapse (which GANs are infamous for) as multiple discriminators now learn to distinguish classes with different modes. The diversity of an ensemble of such discriminators reduces the random errors in prediction. The performance of an ensemble model rests on the number of entities in the ensemble. However, as the number of entities increase, the model parameters and complexity will increase. This is one of the primary bottlenecks of the ensemble based methods. The number of parameters in an algorithm is a significant factor in determining model efficiency.
To tackle the above problems, we propose a novel and efficient discriminator architecture by using Monte Carlo (MC) sampling [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov]
. We incorporate Bernoulli dropout in a multiadversarial network, by dropping out a certain number of neurons from our discriminator with some probability
d. This gives rise to a set of dynamic discriminators for every data sample. The main idea behind our method is to construct a training regime for the feature extractor in domain adaptation that consists of increasingly challenging tasks to generate domain invariant features. This allows the sophistication of the feature extractor to gradually increase throughout training, rather than aiming for full sophistication at the outset. This method is similar to that of curriculum in supervised learning, where one orders the training examples to be presented to a learning algorithm according to some measure of difficulty
[Bengio et al.(2009)Bengio, Louradour, Collobert, and Weston]. Despite the conceptual similarity, the methods are quite different. Under our approach, it is not the difficulty of the training examples presented to either network, but rather the capacity, and hence strength, of the discriminator network that is increased as the training progresses. The idea behind the use of a curriculum based dropout discriminator is to exploit the characteristics of several independent discriminators by consolidating them in order to achieve higher performance.We do a curriculum based learning on these dropout discriminators. As the training proceeds, the number of discriminators sampled, increase, thereby boosting the variance of our model’s prediction. The proposed approach enforces the feature extractor network not to constrain the learned representations to satisfy a single discriminator, but, instead, to satisfy an ensemble of dynamic discriminators (composition is different across different discriminators). Instead of learning a point estimate (in case of MADA [Pei et al.(2018)Pei, Cao, Long, and Wang]), the feature extractor network of our proposed model learns a distribution, due to the ensemble effect of feedback from a set of dynamic discriminators. This approach leads to a more generalized feature extractor, promoting resemblance in learned representations of a class from different domains. The instinct behind incorporating dropout in our model is to warrant that neurons are not exclusively reliant on a precise set of other neurons to determine their outputs. Instead, each neuron relies on the agglomerate behavior of several other neurons, promoting generalization. By applying dropout on the discriminator, we obtain a set of entirely dynamic discriminators and hence the feature extractor cannot use the trick of relying on a specific type of discriminator or ensemble of discriminators to learn to generate representations to deceive the discriminator. Instead, it will now have to genuinely learn domain invariant representations. Thus, the feature extractor network is now guided by diverse feedback given to it by an ensemble of dynamic discriminators. All this increase in performance is obtained without compromising on the scalability and complexity front through our proposed model.
(a)  (b) 
4 Proposed Adaptation Model
In the unsupervised domain adaptation problem, we consider that the source dataset has access to all its labels while there are no labels for the target dataset at the training time. We assume that comes from a source distribution and comes from a target distribution . We assume that there are source data points and unlabeled target data points. So has labeled examples and the target domain has unlabeled examples. Our underlying assumption is that both distributions are complex and unknown. Our model provides a deep neural network that enables learning of transferable feature representations and an adaptive classifier
to reduce the shift in the joint distributions across domains, such that the target risk
is minimized by jointly minimizing source risk and distribution discrepancy by adversarial domain adaptation where is assumed to be the joint distribution of target samples.In this work, we employ a variant of GRL[Ganin and Lempitsky(2015)], where discriminator is modeled as an MCdropout based ensemble. The feature extractor network consists of convolution layers to produce image embeddings. Both source and target feature extractors share the same parameters. The classifier network consists of fully connected layers. Only source embeddings are forwarded to the classifier network to predict the class label. The classifier network parameters () are updated only by the loss from source data samples. The discriminator receives both source and target embeddings. The parameters of the MCdropout discriminator () are updated with domain classification loss. The feature extractor parameters() are updated by the gradients from the classifier network as well as by the reverse gradient of both source and target data samples from the dynamic set of the ensemble of discriminators. Detailed architecture is presented in Figure 2.
For the adaptation task, the feature extractor learns domaininvariant features with the help of MCDropout based discriminator. For each data sample that goes to the discriminator, we obtain the domain classification loss. These losses are backpropagated through respective Monte Carlo sampled dropout discriminators followed by gradient reversal layer. Hence, for every input, we obtain a distribution of gradients. The feature extractor is updated by a gradient from this distribution to generate domain invariant features. In a binary discriminator
[Ganin and Lempitsky(2015)], we obtain a point estimate of the gradient for specific input. In the case of multi discriminator [Pei et al.(2018)Pei, Cao, Long, and Wang], we obtain an ensemble of the point estimates of gradients. The advantage of obtaining a distribution of gradients is that we get generalized learned representations robustly leading to domain invariant features. We propose Curriculum based Dropout Discriminator (CA), where we increase the number of MC samples as training proceeds in a paradigm similar to curriculum learning. However, in the other variant(A), we maintain a fixed number of MC sampled discriminators throughout the training.4.1 Curriculum based Dropout Discriminator for Domain Adaptation (CA)
In CA, the distribution of gradients is obtained through a curriculum fashioned training, i.e., we increase the number of MC samples as training proceeds. The motivation behind increasing the number of MC samples is that, in the initial phase of the adaptation, the feature extractor learns the domain invariant features without considering the multimode structure of data. For this purpose, only a small number of discriminators is required. As the training advances, we expect the network to learn the domain invariant features along with its multimodal structure. Thus, in the proposed model, we increase the MC samples of discriminator as training progress to obtain the domain invariant feature without losing its multimode structure. Given an input sample , we obtain feature embedding , by passing it through a feature extractor . These embeddings are further used to obtain the classification score and the domain classification score for samples of discriminator , where . The curriculum learning of the discriminator does not rely on the difficulty of the training examples presented to either network, but rather the capacity, and hence strength, of the discriminator that is increased throughout the training. We construct an ordered set of sets of samples of discriminator increasing in numbers. More formally the set of discriminators is , where is a MC sampled discriminator. We can clearly see that the is a ordered set in terms of the capacity, where capacity of .
4.2 Fixed sampling based Dropout Discriminator for Domain Adaptation(A)
In this variant, we fix the number of MC sampled discriminators during the training. In this scenario, we obtain an ensemble of discriminators. We call this variant as a Dropout Discriminator for Domain Adaptation( A). This modification can be considered a more efficient version of the multi discriminator model. We experimented with different sampling values (details are reported in supplementary) and obtained the best results when the number of samples is chosen close to the number of classes in the target dataset.
4.3 Loss Function
Our loss function is composed of classification loss and domain classification loss. Our classifier takes learned representations as input and predicts its label. Classification loss function
is a crossentropy loss. Dropout discriminator is expected to label (output) the source domain images as 0 and target domain images as 1. Domain classification loss is a binary cross entropy loss between the output of discriminator and the expected output. It is summed over the number of MCsampled discriminators . is increased as the training proceeds in the case of CDA model, whereas it is fixed for DA model.(1) 
(2) 
where if and if . The function is the feature extractor network with shared weights for source and target data ( and are denoted by common shared network ). is the tradeoff parameter between the two objectives. is the classifier network and is the MCsampled dropout discriminator. and represent source and target domains respectively.
We generate the entities of ensemble via dropout. In contrast, previous works [Pei et al.(2018)Pei, Cao, Long, and Wang] use multiple discriminators; their number being equal to the number of classes in the dataset. It leads to an increase in the number of parameters employed in the discriminator which makes it unsuitable for datasets with a large number of classes. Also, due to our model’s parameters being significantly less, the data requirements are also quite low. This has been shown in supplementary material, where we remove half of the source data and still obtain good accuracy. Also, MADA uses the predicted label probabilities to weigh the discriminator’s response. This is a drawback as it can lead to misleading corrections of the feature extractor network in case of wrong predictions by the label predictor (classifier). Our model doesn’t have such constraints making our discriminator even more powerful leading to better learning of domain invariant features by the feature extractor network. The implementation details are provided in the supplementary material, and other details are provided on the project page ^{1}^{1}1https://deltalabiitk.github.io/CD3A/.
5 Results and Analysis
5.1 Datasets
Office31 Dataset:
Office31 [Saenko et al.(2010)Saenko, Kulis, Fritz, and Darrell] is a benchmark dataset for domain adaptation, comprising 4,110 images in 31 classes collected from three distinct domains: Amazon (A), Webcam (W) and DSLR (D). To enable unbiased evaluation, we evaluate all the methods on all 6 transfer tasks A W, D A, W A, A D , D W and W D.
ImageCLEF Dataset:
ImageCLEF2014 dataset consists of 3 domains: Caltech256 (C), ILSVRC 2012 (I), and PascalVOC 2012 (P). There are 12 common classes, and each class has 50 samples.There is a total of 600 images in each domain. We evaluate models on all 6 transfer tasks: IP, PI, IC, CI, CP, and PC.
Method  A W  D W  W D  A D  D A  W A  Average 

Alexnet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton]  60.6  95.0  99.5  64.2  45.5  48.3  68.8 
MMD[Tzeng et al.(2014)Tzeng, Hoffman, Zhang, Saenko, and Darrell]  61.0  95.0  98.5  64.9  47.2  49.4  69.3 
RTN[Long et al.(2016)Long, Zhu, Wang, and Jordan]  73.3  96.8  99.6  71.0  50.5  51.0  74.1 
DAN[Long et al.(2015)Long, Cao, Wang, and Jordan]  68.5  96.0  99.0  66.8  50.0  49.8  71.7 
GRL [Ganin and Lempitsky(2015)]  73.0  96.4  99.2  72.3  52.4  50.4  74.1 
JAN [Long et al.(2017b)Long, Zhu, Wang, and Jordan] 
75.2  96.6  99.6  72.8  57.5  56.3  76.3 
CDAN[Long et al.(2018)Long, Cao, Wang, and Jordan]  77.9  96.9  100.0  74.6  55.1  57.5  77.0 
MADA[Pei et al.(2018)Pei, Cao, Long, and Wang]  78.5  99.8  100.0  74.1  56.0  54.5  77.1 
IDDA[Kurmi and Namboodiri(2019)]  82.2  99.8  100.0  82.4  54.1  52.5  78.5 
A(31)  79.0  97.7  100.0  79.4  58.2  55.3  78.3 
CDA  82.3  99.8  100.0  81.1  58.2  55.6  79.5 
Method  IP  PI  IC  CI  CP  PC  Avg 

AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton]  66.2  70.0  84.3  71.3  59.3  84.5  73.9 
DAN[Long et al.(2015)Long, Cao, Wang, and Jordan]  67.3  80.5  87.7  76.0  61.6  88.4  76.9 
GRL [Ganin and Lempitsky(2015)]  66.5  81.8  89.0  79.8  63.5  88.7  78.2 
RTN[Long et al.(2016)Long, Zhu, Wang, and Jordan]  67.4  82.3  89.5  78.0  63.0  90.1  78.4 
MADA [Pei et al.(2018)Pei, Cao, Long, and Wang]  68.3  83.0  91.0  80.7  63.8  92.2  79.8 
A(12)  69.1  80.9  91.0  81.5  66.2  90.0  79.8 
CDA  69.3  81.5  91.3  81.6  65.9  90.2  80.0 
Method  I P  P I  I C  C I  C P  P C  Average 

ResNet [He et al.(2016)He, Zhang, Ren, and Sun]  74.8  83.9  91.5  78.0  65.5  91.2  80.7 
DAN [Long et al.(2015)Long, Cao, Wang, and Jordan]  75.0  86.2  93.3  84.1  69.8  91.3  83.3 
RTN [Long et al.(2016)Long, Zhu, Wang, and Jordan]  75.6  86.8  95.3  86.9  72.7  92.2  84.9 
GRL [Ganin and Lempitsky(2015)]  75.0  86.0  96.2  87.0  74.3  91.5  85.0 
JAN [Long et al.(2017a)Long, Zhu, Wang, and Jordan]  76.8  88.0  94.7  89.5  74.2  91.7  85.8 
MADA [Pei et al.(2018)Pei, Cao, Long, and Wang]  75.0  87.9  96.0  88.8  75.2  92.2  85.8 
CDAN [Long et al.(2018)Long, Cao, Wang, and Jordan]  77.2  88.3  98.3  90.7  76.7  94.0  87.5 
CDA  77.5  88.7  96.8  93.2  78.3  94.7  88.2 
5.2 Results
We use pretrained Alexnet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] architecture following the typical setting in unsupervised domain adaption for our base model. Table 1 summarizes results on Office31 dataset, and Table 2 and Table 3 have results for the ImageClef dataset for AlexNet and ResNet networks respectively. The results on OfficeHome [Venkateswara et al.(2017)Venkateswara, Eusebio, Chakraborty, and Panchanathan] along with the implementation details are provided in the supplementary material. We obtained stateoftheart results on all the datasets. It is noteworthy that the proposed model boosts the classification accuracies substantially on hard transfer tasks, e.g., A D, A W, etc. where the source and target domains are substantially different. On average, we obtain considerably improved accuracies and statistically significant results as shown further.
(a) tSNE plot of  (b) tSNE plot of  (c)ProxyDistance  (d)ProxyDistance 
RevGrad  CA  AD  AW 
5.3 Analysis
Curriculum v/s Fixed sampling:
We have plotted the accuracy as a function of the number of MC samples for both the models, curriculumbased sampling (CDA) and fixed sampling
(DA) in Figure 4 (a). We can clearly observe that in the case of DA, the performance increases as we increase the number of MC sampled discriminators, but after some samples, the performance starts to deteriorate. While in case of CD
A, the performance of model saturates after certain epochs. We can also see that CD
A outperforms DA.Model complexity comparison with MADA: The proposed CDA model uses one discriminator(ensemble using dropout) whereas MADA uses as many discriminators as are the number of classes. Therefore, CDA has very few parameters as compared to MADA even for datasets with a small number of classes. For instance, in case of Office31 dataset, MADA has 31 discriminators compared to CDA, which has only one discriminator. MADA has 98M parameters, while CDA has 59M parameters for Office31 dataset. If we further increase the class size, the number of parameters in MADA increases(by 1.3M for every class label), but CDA will have constant number of parameters (59M).
Feature visualization: The adaptability of target to source features can be visualized using the tSNE embeddings of image features. We follow similar setting as in [Ganin and Lempitsky(2015)] to plot tSNE embeddings for AW adaptation task in Figure 3 (a) and (b). From the plot, we observe that adapted features(CDA) are more domain invariant than the features adapted with GRL.
(a) CDA v/s DA Model AW  (b) SSA Plot for AW 

Statistical significance analysis: We analyzed statistical significance [Demšar(2006)] for our CA model against GRL[Ganin and Lempitsky(2015)] and source only method for the domain adaptation task. The Critical Difference (CD) for Nemenyi test depends upon the given confidence level (0.05 in our case) for average ranks and number of tested datasets. If the difference in the rank of the two methods lies within CD (our case CD = 0.6051), then they are not significantly different. Figure 4(b) visualizes the posthoc analysis using the CD diagram for AW. From the figures, it is clear that our CA model is better and significantly different from other methods.
Proxy Distance distance as a measure of cross domain discrepancy[BenDavid et al.(2010)BenDavid, Blitzer, Crammer, Kulesza, Pereira, and Vaughan], which, together with the source risk, will bound the target risk. The proxy distance is defined as , where is the generalization error of a classifier(e.g. kernel SVM) trained on the binary task of discriminating source and target. Figure 3(c) and (d) shows on tasks A D and A W, with features of source only model[Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton], GRL[Ganin and Lempitsky(2015)], MADA[Pei et al.(2018)Pei, Cao, Long, and Wang] and proposed model CA. We observe that calculated using CA model features is much smaller than calculated using source only model, GRL and MADA features, which suggests that representations learned via CA can reduce the crossdomain gap more effectively.
6 Conclusion
In this paper, we provide a simple approach to obtain an improved discriminator for adversarial domain adaptation. We specifically show that the use of samplingbased ensemble results in an improved discriminator without increasing the number of parameters. The main reason for this improvement is that the features are made domain invariant based on a distribution of observations as against a single point estimate. Our approach based on curriculum dropout suggests that we are able to obtain an improved discriminator that is stable and improves the feature invariance learnt. We compare our method with standard baselines and provide a thorough empirical analysis of the method. We further observe through visualization that domain adapted features do result in domain invariant feature representations. Using the discriminator obtained through curriculum based dropout to solve domain adaptation is a promising direction, which we have initiated through this work.
Acknowledgment: We acknowledge the resource support from Delta Lab, IIT Kanpur. Vinod Kurmi acknowledges support from TCS Research Scholarship Program.
References
 [BenDavid et al.(2010)BenDavid, Blitzer, Crammer, Kulesza, Pereira, and Vaughan] Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
 [Bengio et al.(2009)Bengio, Louradour, Collobert, and Weston] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.

[Bousmalis et al.(2017a)Bousmalis, Silberman, Dohan,
Erhan, and Krishnan]
Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip
Krishnan.
Unsupervised pixellevel domain adaptation with generative
adversarial networks.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, volume 1, page 7, 2017a.  [Bousmalis et al.(2017b)Bousmalis, Silberman, Dohan, Erhan, and Krishnan] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixellevel domain adaptation with generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 7, 2017b.
 [Bromley et al.(1994)Bromley, Guyon, LeCun, Säckinger, and Shah] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a" siamese" time delay neural network. In Advances in Neural Information Processing Systems, pages 737–744, 1994.
 [Chen et al.(2018)Chen, Liu, Wang, Wassell, and Chetty] Qingchao Chen, Yang Liu, Zhaowen Wang, Ian Wassell, and Kevin Chetty. Reweighted adversarial adaptation network for unsupervised domain adaptation. 2018.

[Choi et al.(2017)Choi, Choi, and Kim]
Yunjey Choi, Minje Choi, and Munyoung Kim.
Stargan: Unified generative adversarial networks for multidomain imagetoimage translation.
2017.  [Demšar(2006)] Janez Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research, 7(Jan):1–30, 2006.
 [Durugkar et al.(2016)Durugkar, Gemp, and Mahadevan] Ishan Durugkar, Ian Gemp, and Sridhar Mahadevan. Generative multiadversarial networks. arXiv preprint arXiv:1611.01673, 2016.
 [Gal and Ghahramani(2016)] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
 [Ganin and Lempitsky(2015)] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, pages 1180–1189, 2015.
 [Ghosh et al.(2017)Ghosh, Kulharia, Namboodiri, Torr, and Dokania] Arnab Ghosh, Viveka Kulharia, Vinay Namboodiri, Philip HS Torr, and Puneet K Dokania. Multiagent diverse generative adversarial networks. CoRR, abs/1704.02906, 6:7, 2017.
 [Hara et al.(2016)Hara, Saitoh, and Shouno] Kazuyuki Hara, Daisuke Saitoh, and Hayaru Shouno. Analysis of dropout learning regarded as ensemble learning. In International Conference on Artificial Neural Networks, pages 72–79. Springer, 2016.
 [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [Hoffman et al.(2017)Hoffman, Tzeng, Park, Zhu, Isola, Saenko, Efros, and Darrell] Judy Hoffman, Eric Tzeng, Taesung Park, JunYan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell. Cycada: Cycleconsistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
 [Hoffman et al.(2018)Hoffman, Tzeng, Park, Zhu, Isola, Saenko, Efros, and Darrell] Judy Hoffman, Eric Tzeng, Taesung Park, JunYan Zhu, Phillip Isola, Kate Saenko, Alyosha Efros, and Trevor Darrell. Cycada: Cycleconsistent adversarial domain adaptation. 2018.

[Krizhevsky et al.(2012)Krizhevsky, Sutskever, and
Hinton]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, 2012.  [Kurmi and Namboodiri(2019)] Vinod Kumar Kurmi and Vinay P Namboodiri. Looking back at labels: A class based domain adaptation technique. arXiv preprint arXiv:1904.01341, 2019.
 [Kurmi et al.(2019)Kurmi, Kumar, and Namboodiri] Vinod Kumar Kurmi, Shanu Kumar, and Vinay P Namboodiri. Attending to discriminative certainty for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 491–500, 2019.
 [Lakshminarayanan et al.(2017)Lakshminarayanan, Pritzel, and Blundell] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
 [Li et al.(2018)Li, Pan, Wang, and Kot] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.(CVPR), 2018.
 [Long et al.(2015)Long, Cao, Wang, and Jordan] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105, 2015.
 [Long et al.(2016)Long, Zhu, Wang, and Jordan] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pages 136–144, 2016.
 [Long et al.(2017a)Long, Zhu, Wang, and Jordan] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International Conference on Machine Learning, pages 2208–2217, 2017a.
 [Long et al.(2017b)Long, Zhu, Wang, and Jordan] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International Conference on Machine Learning, pages 2208–2217, 2017b.
 [Long et al.(2018)Long, Cao, Wang, and Jordan] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pages 1640–1650, 2018.
 [Morerio et al.(2017)Morerio, Cavazza, Volpi, Vidal, and Murino] Pietro Morerio, Jacopo Cavazza, Riccardo Volpi, René Vidal, and Vittorio Murino. Curriculum dropout. In Proceedings of the IEEE International Conference on Computer Vision, pages 3544–3552, 2017.
 [Neal(2012)] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
 [Nguyen et al.(2017)Nguyen, Le, Vu, and Phung] Tu Nguyen, Trung Le, Hung Vu, and Dinh Phung. Dual discriminator generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2670–2680, 2017.
 [Patro and Namboodiri(2018)] Badri Patro and Vinay P. Namboodiri. Differential attention for visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[Patro et al.(2018a)Patro, Kumar, Kurmi, and
Namboodiri]
Badri Narayana Patro, Sandeep Kumar, Vinod Kumar Kurmi, and Vinay Namboodiri.
Multimodal differential network for visual question generation.
In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages 4002–4012, 2018a.  [Patro et al.(2018b)Patro, Kurmi, Kumar, and Namboodiri] Badri Narayana Patro, Vinod Kumar Kurmi, Sandeep Kumar, and Vinay Namboodiri. Learning semantic sentence embeddings using sequential pairwise discriminator. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2715–2729, 2018b.

[Pei et al.(2018)Pei, Cao, Long, and Wang]
Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang.
Multiadversarial domain adaptation.
In
Advances in Artificial Intelligence (AAAI)
, 2018.  [Rozantsev et al.(2018)Rozantsev, Salzmann, and Fua] Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Beyond sharing weights for deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and FeiFei] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s112630150816y.
 [Saatci and Wilson(2017)] Yunus Saatci and Andrew G Wilson. Bayesian gan. In Advances in neural information processing systems, pages 3622–3631, 2017.
 [Saenko et al.(2010)Saenko, Kulis, Fritz, and Darrell] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213–226. Springer, 2010.
 [Saito et al.(2018a)Saito, Ushiku, Harada, and Saenko] Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate Saenko. Adversarial dropout regularization. 2018a.
 [Saito et al.(2018b)Saito, Watanabe, Ushiku, and Harada] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2018b.
 [Sharma et al.(2018)Sharma, Barratt, Ermon, and Pande] Rishi Sharma, Shane Barratt, Stefano Ermon, and Vijay Pande. Improved training with curriculum gans. arXiv preprint arXiv:1807.09295, 2018.
 [Shen et al.(2017)Shen, Qu, Zhang, and Yu] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Adversarial representation learning for domain adaptation. arXiv preprint arXiv:1707.01217, 2017.
 [Shen et al.(2018)Shen, Qu, Zhang, and Yu] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In AAAI, 2018.
 [Shu et al.(2019)Shu, Cao, Long, and Wang] Yang Shu, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. Transferable curriculum for weaklysupervised domain adaptation. 2019.
 [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [Sun and Saenko(2016)] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, pages 443–450. Springer, 2016.
 [Sun et al.(2016)Sun, Feng, and Saenko] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, volume 6, page 8, 2016.
 [Sun et al.(2017)Sun, Feng, and Saenko] Baochen Sun, Jiashi Feng, and Kate Saenko. Correlation alignment for unsupervised domain adaptation. Domain Adaptation in Computer Vision Applications, page 153, 2017.
 [Torralba and Efros(2011)] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1521–1528. IEEE, 2011.
 [Tzeng et al.(2014)Tzeng, Hoffman, Zhang, Saenko, and Darrell] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
 [Tzeng et al.(2015)Tzeng, Hoffman, Darrell, and Saenko] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4068–4076. IEEE, 2015.
 [Tzeng et al.(2017)Tzeng, Hoffman, Saenko, and Darrell] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
 [Venkateswara et al.(2017)Venkateswara, Eusebio, Chakraborty, and Panchanathan] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proc. CVPR, pages 5018–5027, 2017.
 [Weinshall et al.(2018)Weinshall, Cohen, and Amir] Daphna Weinshall, Gad Cohen, and Dan Amir. Curriculum learning by transfer learning: Theory and experiments with deep networks. In International Conference on Machine Learning, pages 5235–5243, 2018.
 [Zhang et al.(2018a)Zhang, Ding, Li, and Ogunbona] Jing Zhang, Zewei Ding, Wanqing Li, and Philip Ogunbona. Importance weighted adversarial nets for partial domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8156–8164, 2018a.
 [Zhang et al.(2017)Zhang, David, and Gong] Yang Zhang, Philip David, and Boqing Gong. Curriculum domain adaptation for semantic segmentation of urban scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 2020–2030, 2017.
 [Zhang et al.(2018b)Zhang, Wang, Huang, and Nehorai] Zhen Zhang, Mianzhi Wang, Yan Huang, and Arye Nehorai. Aligning infinitedimensional covariance matrices in reproducing kernel hilbert spaces for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3437–3445, 2018b.