Contrastive Domain Adaptation

03/26/2021 ∙ by Mamatha Thota, et al. ∙ University of Aberdeen University of Lincoln 3

Recently, contrastive self-supervised learning has become a key component for learning visual representations across many computer vision tasks and benchmarks. However, contrastive learning in the context of domain adaptation remains largely underexplored. In this paper, we propose to extend contrastive learning to a new domain adaptation setting, a particular situation occurring where the similarity is learned and deployed on samples following different probability distributions without access to labels. Contrastive learning learns by comparing and contrasting positive and negative pairs of samples in an unsupervised setting without access to source and target labels. We have developed a variation of a recently proposed contrastive learning framework that helps tackle the domain adaptation problem, further identifying and removing possible negatives similar to the anchor to mitigate the effects of false negatives. Extensive experiments demonstrate that the proposed method adapts well, and improves the performance on the downstream domain adaptation task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last few years, Deep Learning (DL)

[22] has been successfully applied across numerous applications and domains due to the availability of large amounts of labeled data, such as computer vision and image processing [34, 42, 37, 8], signal processing [2, 33, 15], autonomous driving [26, 41, 11], agri-food technologies [1, 20], medical imaging [19, 25]

, etc. Most of the applications of DL techniques, such as the aforementioned ones, refer to supervised learning, it requires manually labeling a dataset, which is a very time consuming, cumbersome and expensive process that has led to the widespread use of certain datasets, e.g. ImageNet, for model pre-training. On the other hand, unlabeled data is being generated in abundance through sensor networks, vision systems, satellites, etc. One way to make use of this huge amount of unlabeled data is to get supervision from the data itself. Since unlabeled data are largely available and are less prone to labeling bias issues, they tend to provide visual information independent from specific domain styles.

Nowadays, self-supervised visual representation learning has been largely closing the gap with, in some cases, even surpassing supervised learning methods. One of the most prominent self-supervised visual representation learning techniques that has been gaining popularity is contrastive learning, which aims to learn an embedding space by contrasting semantically positive and negative pairs of samples [4, 5, 13].

However, whether these self-supervised visual representation learning techniques can be efficiently applied for domain adaptation has not yet been satisfactorily explored. When, one applies a well performing model learned from a source training set to a different but related target test set, generally the assumption is that both these sets of data are drawn from the same distributions. When this assumption is violated, the DL model trained on the source domain data will not generalize well on the target domain, due to the distribution differences between the source and the target domains known as domain shift. Learning a discriminative model in the presence of domain shift between source and target datasets is known as Domain Adaptation.

Existing domain adaptation methods rely on rich prior knowledge about the source data labels, which greatly limits their application, as explained above. This paper introduces a contrastive learning based domain adaptation approach that requires no prior knowledge of the label sets. The assumption is that both the source and target datasets share the same labels, but only the marginal probability distributions differ.

One of the fundamental problems with contrastive self-supervised learning is the presence of potential false negatives that need to be identified and eliminated; but without labels, this problem is rather difficult to solve. Some notable work related to this area has been proposed in [17] and [35], where both methods focused on mining hard negatives; [16] developed a method for false negative elimination and false negative attraction and [7] proposed a method to correct the sampling bias of negative samples.

Over the past few years, ImageNet pre-training has become a standard practice, but using contrastive learning has demonstrated a competitive performance without access to labeled data by training the encoder using the input data itself. In this paper, we extend contrastive learning also referred as unsupervised representation learning without access to labeled data or pretrained imagenet weights, where we leverage the vast amount of unlabeled source and target data to train an encoder using random initialized parameters to the domain adaptation setting, a particular situation occurring where the similarity is learned and deployed on samples following different probability distributions. We also present an approach to address one of the fundamental problems of contrastive representation learning, i.e. identifying and removing the potential false negatives. We performed various experiments and tested our proposed model and its variants on several benchmarks that focus on the downstream domain adaptation task, demonstrating a competitive performance against baseline methods, albeit not using any source or target labeled data.

The rest of the paper is laid out as follows: Section 2 presents the related work in self-supervised contrastive representation learning and domain adaptation methods. Section 3 describes our proposed approach, Section 4 presents the datasets and experimental results on domain adaptation after applying our model, and finally, Section 5 summarizes our work and future directions.

1.1 Contributions

The main contributions of this work can be summarised as follows:

  • We explore contrastive learning in the context of Domain Adaptation, attempting to maximize generalization between source and target domains with different distributions.

  • We propose a Domain Adaptation approach that does not make use of any labeled data or involves imagenet pretraining.

  • We incorporate false negative elimination to the domain adaptation setting, resulting in improved accuracy and without incurring any additional computational overhead.

  • We extend our domain adaptation framework and perform various experiments to learn from more than two views.

Figure 1: Overview of our proposed Contrastive Domain Adaptation model. Image on the Left, shows the pipeline of our model and image on the Right

shows the loss function.

2 Related Work

Domain Adaptation

: Domain adaptation is a special case of transfer learning where the goal is to learn a discriminative model in the presence of domain shift between source and target datasets. Various methods have been introduced to minimize the domain discrepancy in order to learn domain-invariant features. Some involve adversarial methods like DANN

[10], ADDA[39] that help align source and target distributions. Other methods propose aligning distributions through minimizing divergence using popular methods like maximum mean discrepancy [12, 28, 29], correlation alignment [36, 3], and the Wasserstein metric [6, 24]. MMD was first introduced for the two-sample tests of the hypothesis that two distributions are equal based on observed samples from the two distributions [12], and this is currently the most widely used metric to measure the distance between two feature distributions. The Deep Domain Confusion Network proposed by Tzeng et al.[40] learns both semantically meaningful and domain invariant representations, while Long et al. proposed DAN [28] and JAN [29] which both perform domain matching via multi-kernel MMD (MK-MMD) or a joint MMD (J-MMD) criteria in multiple domain-specific layers across domains.

Contrastive Learning: Recently, contrastive learning has achieved state-of-the-art performance in representation learning, leading to state-of-the-art results in computer vision. The aim is to learn an embedding space where positive pairs are pulled together, whilst negative pairs are pushed away from each other. Positive pairs are drawn by pairing the augmentations of the same image, whereas the negative pairs are drawn from different images. Existing contrastive learning methods have different strategies to generate positive and negative samples. Wu et al.[43] maintains all the sample representations of the images in a memory bank, MoCo [13] maintains an on-the-fly momentum encoder along with a limited queue of previous samples, Tian et al.[38] uses all the generated multi view samples with the mini-batch approach, whereas both SimClr V1 [4] and SimClr V2 [5] use momentum encoder and utilize all the generated sample representations within the mini batch. The above methods can provide a pretrained network for a downstream task, but do not consider domain shift if they are applied directly. However, our approach aims to learn representations that are generalizable without any need of labeled data. Recently, contrastive learning was applied in Unsupervised Domain Adaptation setting [18, 32, 21], where models have access to the source labels and/or used models pretrained on imagenet as their backbone network. In comparison, our work is based on contrastive learning, which is also referred to as unsupervised representation learning, without having access to labeled data or pretrained imagenet parameters, but instead leveraging the vast amount of unlabeled source and target data to train a encoder from random initialized parameters.

Removal of false negatives: As the name suggests, contrastive learning methods learn by contrasting semantically similar and dissimilar pairs of samples. They rely on the number of negative samples for generating good quality representations and favor large batch size. As we do not have access to labels, when an anchor image is paired with the negative samples to form a negative pair, there is a probability that these images could share the same class, in which case the contribution towards the contrastive loss becomes minimal, limiting the ability of the model to converge quickly. These false negatives remain a fundamental problem in contrastive learning methodology, but relatively limited work has been in this area thus far.

Most existing methods focus on mining hard negatives; [17] developed hard negative mixing to synthesize hard negatives on the fly in the embedding space, [35] developed new sampling methods for selecting hard negative samples where the user can control the hardness, [16] proposed an approach for false negative elimination and false negative attraction and [7] developed a debiased contrastive objective that corrects for the sampling bias of negative samples. [16] use additional support views and aggregation as part of their elimination and attraction strategy. Regarding our proposed approach and inspired by [16], we have further simplified and only applied the false elimination part to the domain adaptation framework. Instead of using additional support views, we compute the similarity loss between the anchor and the negatives in the mini-batch, we then sort the corresponding negative pair similarity losses for each anchor and remove the negative pairs similar to the anchor. For each anchor in the mini-batch, we remove the exact same number of negative pairs, for example in we remove one potential false negative from a total of 1023 negative samples with a batch size of 512, totalling 512 total potential false negatives for all the anchor images in the mini-batch of 512.

Input : Source Data S:,
Target Data T:
Output : Encoder network , Projection-head network
for sampled minibatch do
       Make two augmentations per source image
       # source augmentation-1
      
      
       # source augmentation-2
      
      
       Calculate for using Eq-3
       Make two augmentations per target image
       # target augmentation-1
      
      
       # target augmentation-2
      
      
       Calculate for using Eq-3
       Calculate using Eq-4
       Calculate using Eq-5
       Update and by back propogating and
end for
Algorithm 1 Algorithm for Contrastive Domain Adaptation with False Negative Removal and Maximum Mean Discrepancy summarizing our proposed method.

3 Method

3.1 Model Overview

Contrastive Domain Adaptation (CDA): We explore a new domain adaptation setting in a fully self-supervised fashion without any labelled data, that is where both the source and the target domain contains unlabelled data. In the normal UDA setting, we have access to the source domain whereas, our goal is to train a model using these unlabelled data sources in order to generalize visual features in both the domains. The aim is to obtain pre-trained weights that are robust to domain-shift and generalizable to the downstream domain adaptation task. Our model uses unlabelled source and target datasets in an attempt to learn and solve the adaptation between domains.

Inspired by the recent successes of learning from unlabeled data, the proposed learning framework is based on SimClr [4] for domain adaptation setting where data from unlabelled source and target domains is used in task-agnostic way. SimClr [4]

method learns visual similarity where a model pulls together visually similar-looking images nearby while pushing away dissimilar-looking images. However, in domain adaptation, the same class images may look very different due to domain gap, so that learning visual similarity alone does not ensure semantic similarity and domain-invariance between domains. So using CDA, we aim to learn general visual class-discriminative and domain-invariant features from both the domains via unsupervised pretraining. We introduce each specific component in detail below which is illustrated in Figure-

1 and Figure-2 for four views.

From randomly sampled mini-batch of images N, we augment each image S twice creating two views of same anchor image and . We use base encoder(Resnet50 architecture [14]) that is trained from scratch to encode augmented images in order to generate representations and

. These representations are then inputted into a non-linear MLP with two hidden layers to get the projected vector representations

and . We find that this MLP projection benefits our model by compressing our images into a latent space representation, enabling the model to learn the high-level features of the images. We apply contrastive loss on the vector representations using the NT-Xent loss [4] that has been modified to identify and eliminate false negatives, thus resulting in improved accuracy, details of which are discussed in section 4.2. We also introduce MMD to measure domain discrepancy in feature space in order to minimize domain shift, details of which are discussed later in this section. The aim is to obtain the pretrained weights that are robust to domain-shift and efficiently generalizable. In the later stage, we perform linear evaluation using the encoder whilst entirely discarding the MLP projection head after pretraining.

3.2 Contrastive Loss for DA setting:

The goal of contrastive learning is to maximize the similarities between positive pairs and minimize the similarities of negative ones. We randomly sample mini batch of N images, each anchor image x is augmented twice creating two views of the same sample and , resulting in 2N images. We do not explicitly sample the negative pairs, we instead follow [4], and treat other 2(N-1) augmented image samples as negative pairs. The contrastive loss is defined as follows:

(1)

where sim(u, v) is a cosine similarity function

and is a temperature parameter.

However If we use the above contrastive loss is used in a domain adaptation scenario, as the mini-batch contains image samples from both domains, it may treat all other samples as negatives against the anchor image even though they may belong to the same class without distinguishing domains which could further widen the distance between them due to the difference in the domain specific visual characteristics, and therefore unable to learn domain invariance. In order to overcome these problems, we propose to use perform contrastive learning in the source and target domain independently by randomly sampling instances from source and target domain. Finally, our contrastive loss for DA is defined as follows:

(2)

where and are source contrastive loss and target contrastive loss

Figure 2: Overview of CDA with four views

3.3 Removal of False Negatives:

Unsupervised contrastive representation learning methods aim to learn by contrasting semantically positive and negative pairs of samples. As we do not have access to the true labels in this type of setting, positive pairs are drawn by pairing the augmentations of the same image whereas the negative pairs are drawn from different images within the same batch. So, for a batch of N images, augmented images form N positive pairs for a total of 2N images and 2N-1 negative pairs. From those 2N-1, there could be images which are similar to the anchor, hence treated as false negative.

During training, an augmented anchor image is compared against the negative samples to contribute towards a contrastive loss, as a result, there is a possibility that some of these pairs may have the same semantic information (label) as that of the anchor, and therefore can be treated as false negatives. But in cases where the original image sample and a negative image sample share the same class, the contribution towards the contrastive loss becomes minimal, limiting the ability of the model to converge quickly, as the presence of these false negatives discard semantic information leading to significant performance drop, we therefore identify and remove the negatives that are similar to the anchor in order to improve the performance of the contrastive learning.

Inspired by [16], we have simplified and only applied the false elimination part to the domain adaptation framework in order to improve the performance of contrastive learning. Instead of using additional support views, we compute the similarity loss between the anchor and the negatives within the mini-batch, we then sort the corresponding negative pair similarity losses for each anchor and remove the negative pairs’ similar to the anchor. For each anchor we remove the same number of negative pairs, example removes 1 negative pair per anchor in the mini-batch.

After removing the false negatives, the contrastive loss can be defined as follows:

(3)

where Si is the set of the negative pair that are similar to the anchor i.

However If we use the above loss is used in a domain adaptation scenario, similar to the contrastive loss, as the mini-batch contains image samples from both domains, it may treat all other samples as negatives against the anchor image even though they may belong to the same class without distinguishing domains, further widening the distance between them due to the difference in the domain specific visual characteristics, and therefore unable to learn domain invariance. In order to overcome these problems, we propose to use FNR loss in the source and target domain independently by randomly sampling instances from source and target domain. Finally, our joint FNR loss for DA is defined as follows:

(4)

where and are source contrastive loss and target contrastive loss

3.4 Revisiting Maximum Mean Discrepancy(MMD):

MMD defines the distance between the two distributions with their mean embeddings in the Reproducing Kernel Hilbert Space (RKHS). MMD is a two sample kernel test to determine whether to accept or reject the null hypothesis

[12], where and

are source and target domain probability distributions. MMD is motivated by the fact that if two distributions are identical, all of their statistics should be the same. The empirical estimate of the squared MMD using two datasets is computed by the following equation:

(5)

where is the mapping to the RKHS H, is the universal kernel associated with this mapping, and N, M are the total number of items in the source and target respectively. In short, the MMD between the distributions of two datasets is equivalent to the distance between the sample means in a high-dimensional feature space.

Figure 3: Average Accuracy comparision of proposed CDA frameworks with CDA-Base.

4 Experiments

4.1 Datasets

Since we propose a new task, there is no benchmark that is specifically designed for our task. We illustrate the performance of our method for the contrastive domain adaptation task comparing with the SimClr-Base and CDA-Base explained later in the section-4.2, we apply it on the standard digits dataset benchmarks using accuracy as the evaluation metric.

MNIST USPS (MU): MNIST [23] which stands for “modified NIST” is treated as source domain, it consists of black-and-white handwritten digits and USPS [9] is treated as target domain, it consists handwritten digit datasets scanned and segmented by the U.S. Postal Service. As both these datasets contain grayscale images the domain shift between these two datasets is relatively small. Figure 4 below shows sample images from MU.

Figure 4: Sample images from datasets: MNIST-USPS

SVHN MNIST (MS): In this setting, SVHN [30] is treated as source domain and MNIST [23] is treated as the target domain. MNIST consists of black-and-white handwritten digits, SVHN consists of crops of coloured, streetview house numbers consisting of single digits extracted from images of urban house numbers from Google Street View. SVHN and MNIST are two digit classification datasets with a drastic distributional shift between the two of them. The adaptation from MNIST to SVHN is quite challenging because MNIST has a significantly lower intrinsic dimensionality than SVHN. Figure below 5 shows sample images from MS.

Figure 5: Sample images from datasets: SVHN-MNIST

MNIST MNISTM (MMM): MNIST [23], which consists of black-and-white handwritten digits is treated as the source domain and MNISTM is treated as a target domain. MNISTM is a modification of MNIST dataset where the digits are blended with random patches from BSDS500 dataset color photos. Figure below 6 shows sample images from MMM.

Figure 6: Sample images from datasets: MNIST-MNISTM

4.2 Implementation Details

CDA uses a base encoder ResNet-50 [14] trained from scratch followed by a two layered non-linear MLP. During pretraining, we train CDA on 2 Titan Xp GPUs, using LARS optimizer [44]

with a batch size of 512 and weight decay of le-6 for a total of 300 epochs. Similar to SimClr

[4]

, we report performance by training a linear classifier on top of a fixed representation, but only with source labels to evaluate representations which is a standard benchmark that has been adopted by many papers in the literature

[4, 5, 31].

4.3 Evaluation

We conducted various experiments using unlabeled source and target digit datasets. The goal of our experiments is to introduce contrastive learning to the domain adaptation problem in order to maximize generalization between source and target datasets by learning class discriminative and domain-invariant features along with improving the performance of contrastive loss by eliminating the false negatives which is one of the main drawbacks in the contrastive learning without access to labels. We have performed multiple experimented using two views and four views [38]. Figure-3 compares the average accuracy of our proposed two view CDA frameworks with the CDA-Base. Following are the various experimental scenarios we performed on the digit datasets.

SimClr-Base: We start our experimental analysis by evaluating using SimClr. We have trained on source dataset using the same setup as SimClr, whilst testing on the target dataset. We treat this as a strong baseline which we call SimClr-Base and use this as reference for comparison against other methods.

CDA-Base: We followed the methodology as described in section-3.2, trained the model using the equation-2 and evaluated on the target domain. Looking at table-1, we can clearly observe that the model shows higher performance compared to the SimClr-Base. The model has clearly learnt both visual similarity and domain-invariance resulting in minimizing the distance between the domains and maximizing the classification accuracy. Overall, the average accuracy for all the datasets has increased by around 19% compared to the SimClr-Base model. We treat this result as a second strong baseline and call it CDA-Base.

CDA_FNR: We followed the methodology as described in section-3.3, and trained the model using the equation-4. We then evaluated the model trained using this method on the target domain. Looking at table-1, in addition to learning visual similarity and domain-invariance, model also successfully identified and eliminated the potential false negatives as they contain the same semantic information as that of the anchor, resulting in converging faster and increased accuracy. We experimented on two scenarios, firstly we removed one false negative which we call FNR1 and in the second case, we experimented by removing two false negatives which we call FNR2. The results of these experiments can be seen in table-1, which concludes that removal of false negatives improves accuracy resulting in converging faster. The average accuracy has increased by 2.3% after removing one false negative. Additionally by removing two false negatives, we observe that the average accuracy has increased by 3.8% in comparison to CDA-Base and 1.5% in comparison to FNR1. Compared to the SimClr-Base, the average accuracy has increased around 21%.

CDA-MMD: We have used the same setup as that of CDA-Base. Additionally we introduced MMD as described in section 3.4, which is computed between vector representations extracted from each domain as per the equation-5

, in order to reduce the distance between the source and target distributions. We backpropagate NT-Xent loss from equation-

2 along with MMD loss equation-5. From table-2, we observe that by minimizing both these losses together, our model achieves much better alignment of the source and target domains, showing the usefulness of combined contrastive loss and MMD alignment. In comparison to the CDA-Base method, the performance gain tends to be large as we can see that it has increased by 4.5%.

CDA_FNR-MMD: We have used the same setup as that of CDA_FNR, additionally we have introduced MMD which is computed between vector representations extracted from each domain as per the equation-5, in order to reduce the distance between the source and target distributions. We calculate FNR loss both for source and target domains using equation-4 and backpropagate FNR loss equation-4 along with MMD loss equation-5. From table-2, we observe that by removing the potential false negatives and minimizing the discrepancy together, our model retains semantic information, hence converges faster and learns both visual similarity and domain-invariance by aligning source and target datasets efficiently, showing the effectiveness of this method. In comparison to the CDA-Base method, the average performance gain tends to be larger as we can see that it has increased by huge margin of 5.1%.

Comparison with the state of art methods: As the proposed framework is new, unfortunately, there are no benchmarks specifically designed for our task so it’s very difficult for a like for like comparison. Using our method, we demonstrate that our model can perform as well to the domain adaptation setting without access to labelled data and imagenet parameters, just by training using the unlabeled data itself, whereas all the unsupervised domain adaptation methods have access to the source labels. We have compared our results with the state-of-the-art models and can conclude our model performs favorably in comparison with other state-of-the art models. From table-3, we can conclude that our model has outperformed in the MNIST-USPS and SVHN-MNIST tasks compared to the other state-of-the-art models like DANN, DAN, ADDA, DDC and Simclr-Base [10, 28, 39, 27, 40, 4]

Method
MU
SM
MMM
Avg
SimClr-Base [4] 92.0 31.7 34.9 53.1
CDA-Base 92.5 64.8 57.9 71.7
CDA_FNR1 93.2 69.4 59.5 74.0
CDA_FNR2 94.1 71.7 60.6 75.5
Table 1: Accuracy values on the digits datasets evaluated using the proposed SimClr-Base and proposed CDA framework, along with the introduction of false negative removal. The best average is indicated in bold. M:MNIST, U:USPS, S:SVHN and MM:MNISTM.

Inspired by [38], we have also performed similar experiments using four views, following are the experimental scenarios we performed on the digit datasets which we compare with the CDA-Base and Contrastive Domain Adaptation with Four Augmentations(CDAx4aug).

CDAx4aug: We have tested our method by using four augmentations per anchor per source and followed the methodology as described in section-3.2, by training the model using the equation-2. The only change is that we now backpropagate four contrastive losses two from source and two from target. From table-4, we can observe that the additional augmentations have significantly improved the average method accuracy compared to the two view CDA-Base due to the availability of additional positive and negative samples. Overall, by adding two additional views to the CDA-Base method we have gained an average accuracy of 5.1% compared to the CDA-Base method.

Method
MU
SM
MMM
Avg
SimClr-Base [4] 92.0 31.7 34.9 53.1
CDA-Base 92.5 64.8 57.9 71.7
CDA-MMD 93.4 74.8 60.6 76.2
CDA_FNR-MMD 94.2 76.2 60.2 76.8
Table 2: Accuracy values on the digits datasets evaluated using CDA framework with the introduction of MMD and compared against base models. The best average is indicated in bold. M:MNIST, U:USPS, S:SVHN and MM:MNISTM.
Method
MS
SM
MMM
SimClr-Base [4] 92.0 31.7 34.9
DDC 79.1 68.1 -
ADDA 89.4 76.0 -
DANN - 73.8 76.6
DAN 81.1 71.1 76.9
CDA_FNR-MMD
(our method) 94.2 76.2 60.2
Table 3: Comparision of the proposed CDA method with state-of-the-art methods, using ACCURACY as the performance metric. The best numbers are indicated in bold. M:MNIST, U:USPS, S:SVHN and MM:MNISTM.
Method
MU
SM
MMM
Avg
SimClr-Base [4] 92.0 31.7 34.9 53.1
CDA-Base 92.5 64.8 57.9 71.7
CDAx4aug 92.9 74.1 63.5 76.8
CDAx4aug_FNR 93.6 75.0 64.0 77.5
Table 4: Accuracy values on the digits datasets compared with Base models and evaluated using CDA framework with four views along with the introduction of false negative removal. The best average is indicated in bold. M:MNIST, U:USPS, S:SVHN and MM:MNISTM.
Method
MS
SM
MMM
Avg
SimClr-Base [4] 92.0 31.7 34.9 53.1
CDA-Base 92.5 64.8 57.9 71.7
CDAx4aug 92.9 74.1 63.5 76.8
CDAx4aug-MMD 92.7 69.3 58.6 73.5
CDAx4aug
_FNR-MMD 92.5 70.6 61.5 74.9
Table 5: Accuracy values on the digits datasets evaluated using CDA framework with four views, along with the introduction of MMD compared with Base models. The best average is indicated in bold. M:MNIST, U:USPS, S:SVHN and MM:MNISTM.

CDAx4aug_FNR: We followed the methodology as described in section-3.3, and trained the model using the equation-4, by training the model on four augmentations per domain as opposed to two. We then evaluated the model trained using this method on the target domain. Looking at table-4, we can clearly establish that the additional views helped the model learn visual similarity and domain-invariance resulting in minimizing the distance between the domains, it also helped the model with successful identification and elimination of the potential false negatives, thus resulting in converging faster with average accuracy increase of 5.8% compared to CDA-Base and 0.7% compared to CDAx4aug-Base.

CDAx4aug-MMD: We have used the same setup as that of CDAx4aug, additionally we introduced MMD computed between vector representations extracted from each domain as per the equation-5. We backpropagate XT-Xent loss equation-2 for two pairs of source and two pairs of target along with MMD loss-5. From table-5 we can observe that performance gain using MMD was not significant due to the noise from additional augmentations, resulting in slow convergence between the source and target distributions.

CDAx4aug_FNR-MMD: We have used the same setup as that of CDAx4aug_FNR, additionally we have introduced MMD computed between vector representations extracted from each domain as per the equation-5. We backpropagate FNR loss using equation-4 along with MMD loss using equation-5. From table-5, we can see that the average performance gain has increased compared to CDAx4aug-MMD method due to the false negative removal, but addition of MMD has comparatively slowed the convergence.

5 Conclusion

Over the past few years, ImageNet pre-training has become a standard practice. Employing our proposed contrastive domain adaptation approach and its variants, we demonstrate that our model can perform competitively in a domain adaptation setting, without having access to labelled data or imagenet parameters, just by training using the unlabeled data itself. CDA also introduces identification and removal of the potential false negatives in the DA setting, resulting in improved accuracy. We also extend our framework to learn from more than two views in the DA setting. We tested our model using various experimental scenarios demonstrating that it can be effectively used for downstream domain adaptation task. We hope that our work encourages future researchers to apply contrastive learning to domain adaptation.

References