1 Introduction
Over the last few years, Deep Learning (DL)
[22] has been successfully applied across numerous applications and domains due to the availability of large amounts of labeled data, such as computer vision and image processing [34, 42, 37, 8], signal processing [2, 33, 15], autonomous driving [26, 41, 11], agrifood technologies [1, 20], medical imaging [19, 25], etc. Most of the applications of DL techniques, such as the aforementioned ones, refer to supervised learning, it requires manually labeling a dataset, which is a very time consuming, cumbersome and expensive process that has led to the widespread use of certain datasets, e.g. ImageNet, for model pretraining. On the other hand, unlabeled data is being generated in abundance through sensor networks, vision systems, satellites, etc. One way to make use of this huge amount of unlabeled data is to get supervision from the data itself. Since unlabeled data are largely available and are less prone to labeling bias issues, they tend to provide visual information independent from specific domain styles.
Nowadays, selfsupervised visual representation learning has been largely closing the gap with, in some cases, even surpassing supervised learning methods. One of the most prominent selfsupervised visual representation learning techniques that has been gaining popularity is contrastive learning, which aims to learn an embedding space by contrasting semantically positive and negative pairs of samples [4, 5, 13].
However, whether these selfsupervised visual representation learning techniques can be efficiently applied for domain adaptation has not yet been satisfactorily explored. When, one applies a well performing model learned from a source training set to a different but related target test set, generally the assumption is that both these sets of data are drawn from the same distributions. When this assumption is violated, the DL model trained on the source domain data will not generalize well on the target domain, due to the distribution differences between the source and the target domains known as domain shift. Learning a discriminative model in the presence of domain shift between source and target datasets is known as Domain Adaptation.
Existing domain adaptation methods rely on rich prior knowledge about the source data labels, which greatly limits their application, as explained above. This paper introduces a contrastive learning based domain adaptation approach that requires no prior knowledge of the label sets. The assumption is that both the source and target datasets share the same labels, but only the marginal probability distributions differ.
One of the fundamental problems with contrastive selfsupervised learning is the presence of potential false negatives that need to be identified and eliminated; but without labels, this problem is rather difficult to solve. Some notable work related to this area has been proposed in [17] and [35], where both methods focused on mining hard negatives; [16] developed a method for false negative elimination and false negative attraction and [7] proposed a method to correct the sampling bias of negative samples.
Over the past few years, ImageNet pretraining has become a standard practice, but using contrastive learning has demonstrated a competitive performance without access to labeled data by training the encoder using the input data itself. In this paper, we extend contrastive learning also referred as unsupervised representation learning without access to labeled data or pretrained imagenet weights, where we leverage the vast amount of unlabeled source and target data to train an encoder using random initialized parameters to the domain adaptation setting, a particular situation occurring where the similarity is learned and deployed on samples following different probability distributions. We also present an approach to address one of the fundamental problems of contrastive representation learning, i.e. identifying and removing the potential false negatives. We performed various experiments and tested our proposed model and its variants on several benchmarks that focus on the downstream domain adaptation task, demonstrating a competitive performance against baseline methods, albeit not using any source or target labeled data.
The rest of the paper is laid out as follows: Section 2 presents the related work in selfsupervised contrastive representation learning and domain adaptation methods. Section 3 describes our proposed approach, Section 4 presents the datasets and experimental results on domain adaptation after applying our model, and finally, Section 5 summarizes our work and future directions.
1.1 Contributions
The main contributions of this work can be summarised as follows:

We explore contrastive learning in the context of Domain Adaptation, attempting to maximize generalization between source and target domains with different distributions.

We propose a Domain Adaptation approach that does not make use of any labeled data or involves imagenet pretraining.

We incorporate false negative elimination to the domain adaptation setting, resulting in improved accuracy and without incurring any additional computational overhead.

We extend our domain adaptation framework and perform various experiments to learn from more than two views.
2 Related Work
Domain Adaptation
: Domain adaptation is a special case of transfer learning where the goal is to learn a discriminative model in the presence of domain shift between source and target datasets. Various methods have been introduced to minimize the domain discrepancy in order to learn domaininvariant features. Some involve adversarial methods like DANN
[10], ADDA[39] that help align source and target distributions. Other methods propose aligning distributions through minimizing divergence using popular methods like maximum mean discrepancy [12, 28, 29], correlation alignment [36, 3], and the Wasserstein metric [6, 24]. MMD was first introduced for the twosample tests of the hypothesis that two distributions are equal based on observed samples from the two distributions [12], and this is currently the most widely used metric to measure the distance between two feature distributions. The Deep Domain Confusion Network proposed by Tzeng et al.[40] learns both semantically meaningful and domain invariant representations, while Long et al. proposed DAN [28] and JAN [29] which both perform domain matching via multikernel MMD (MKMMD) or a joint MMD (JMMD) criteria in multiple domainspecific layers across domains.Contrastive Learning: Recently, contrastive learning has achieved stateoftheart performance in representation learning, leading to stateoftheart results in computer vision. The aim is to learn an embedding space where positive pairs are pulled together, whilst negative pairs are pushed away from each other. Positive pairs are drawn by pairing the augmentations of the same image, whereas the negative pairs are drawn from different images. Existing contrastive learning methods have different strategies to generate positive and negative samples. Wu et al.[43] maintains all the sample representations of the images in a memory bank, MoCo [13] maintains an onthefly momentum encoder along with a limited queue of previous samples, Tian et al.[38] uses all the generated multi view samples with the minibatch approach, whereas both SimClr V1 [4] and SimClr V2 [5] use momentum encoder and utilize all the generated sample representations within the mini batch. The above methods can provide a pretrained network for a downstream task, but do not consider domain shift if they are applied directly. However, our approach aims to learn representations that are generalizable without any need of labeled data. Recently, contrastive learning was applied in Unsupervised Domain Adaptation setting [18, 32, 21], where models have access to the source labels and/or used models pretrained on imagenet as their backbone network. In comparison, our work is based on contrastive learning, which is also referred to as unsupervised representation learning, without having access to labeled data or pretrained imagenet parameters, but instead leveraging the vast amount of unlabeled source and target data to train a encoder from random initialized parameters.
Removal of false negatives: As the name suggests, contrastive learning methods learn by contrasting semantically similar and dissimilar pairs of samples. They rely on the number of negative samples for generating good quality representations and favor large batch size. As we do not have access to labels, when an anchor image is paired with the negative samples to form a negative pair, there is a probability that these images could share the same class, in which case the contribution towards the contrastive loss becomes minimal, limiting the ability of the model to converge quickly. These false negatives remain a fundamental problem in contrastive learning methodology, but relatively limited work has been in this area thus far.
Most existing methods focus on mining hard negatives; [17] developed hard negative mixing to synthesize hard negatives on the fly in the embedding space, [35] developed new sampling methods for selecting hard negative samples where the user can control the hardness, [16] proposed an approach for false negative elimination and false negative attraction and [7] developed a debiased contrastive objective that corrects for the sampling bias of negative samples. [16] use additional support views and aggregation as part of their elimination and attraction strategy. Regarding our proposed approach and inspired by [16], we have further simplified and only applied the false elimination part to the domain adaptation framework. Instead of using additional support views, we compute the similarity loss between the anchor and the negatives in the minibatch, we then sort the corresponding negative pair similarity losses for each anchor and remove the negative pairs similar to the anchor. For each anchor in the minibatch, we remove the exact same number of negative pairs, for example in we remove one potential false negative from a total of 1023 negative samples with a batch size of 512, totalling 512 total potential false negatives for all the anchor images in the minibatch of 512.
3 Method
3.1 Model Overview
Contrastive Domain Adaptation (CDA): We explore a new domain adaptation setting in a fully selfsupervised fashion without any labelled data, that is where both the source and the target domain contains unlabelled data. In the normal UDA setting, we have access to the source domain whereas, our goal is to train a model using these unlabelled data sources in order to generalize visual features in both the domains. The aim is to obtain pretrained weights that are robust to domainshift and generalizable to the downstream domain adaptation task. Our model uses unlabelled source and target datasets in an attempt to learn and solve the adaptation between domains.
Inspired by the recent successes of learning from unlabeled data, the proposed learning framework is based on SimClr [4] for domain adaptation setting where data from unlabelled source and target domains is used in taskagnostic way. SimClr [4]
method learns visual similarity where a model pulls together visually similarlooking images nearby while pushing away dissimilarlooking images. However, in domain adaptation, the same class images may look very different due to domain gap, so that learning visual similarity alone does not ensure semantic similarity and domaininvariance between domains. So using CDA, we aim to learn general visual classdiscriminative and domaininvariant features from both the domains via unsupervised pretraining. We introduce each specific component in detail below which is illustrated in Figure
1 and Figure2 for four views.From randomly sampled minibatch of images N, we augment each image S twice creating two views of same anchor image and . We use base encoder(Resnet50 architecture [14]) that is trained from scratch to encode augmented images in order to generate representations and
. These representations are then inputted into a nonlinear MLP with two hidden layers to get the projected vector representations
and . We find that this MLP projection benefits our model by compressing our images into a latent space representation, enabling the model to learn the highlevel features of the images. We apply contrastive loss on the vector representations using the NTXent loss [4] that has been modified to identify and eliminate false negatives, thus resulting in improved accuracy, details of which are discussed in section 4.2. We also introduce MMD to measure domain discrepancy in feature space in order to minimize domain shift, details of which are discussed later in this section. The aim is to obtain the pretrained weights that are robust to domainshift and efficiently generalizable. In the later stage, we perform linear evaluation using the encoder whilst entirely discarding the MLP projection head after pretraining.3.2 Contrastive Loss for DA setting:
The goal of contrastive learning is to maximize the similarities between positive pairs and minimize the similarities of negative ones. We randomly sample mini batch of N images, each anchor image x is augmented twice creating two views of the same sample and , resulting in 2N images. We do not explicitly sample the negative pairs, we instead follow [4], and treat other 2(N1) augmented image samples as negative pairs. The contrastive loss is defined as follows:
(1) 
However If we use the above contrastive loss is used in a domain adaptation scenario, as the minibatch contains image samples from both domains, it may treat all other samples as negatives against the anchor image even though they may belong to the same class without distinguishing domains which could further widen the distance between them due to the difference in the domain specific visual characteristics, and therefore unable to learn domain invariance. In order to overcome these problems, we propose to use perform contrastive learning in the source and target domain independently by randomly sampling instances from source and target domain. Finally, our contrastive loss for DA is defined as follows:
(2) 
where and are source contrastive loss and target contrastive loss
3.3 Removal of False Negatives:
Unsupervised contrastive representation learning methods aim to learn by contrasting semantically positive and negative pairs of samples. As we do not have access to the true labels in this type of setting, positive pairs are drawn by pairing the augmentations of the same image whereas the negative pairs are drawn from different images within the same batch. So, for a batch of N images, augmented images form N positive pairs for a total of 2N images and 2N1 negative pairs. From those 2N1, there could be images which are similar to the anchor, hence treated as false negative.
During training, an augmented anchor image is compared against the negative samples to contribute towards a contrastive loss, as a result, there is a possibility that some of these pairs may have the same semantic information (label) as that of the anchor, and therefore can be treated as false negatives. But in cases where the original image sample and a negative image sample share the same class, the contribution towards the contrastive loss becomes minimal, limiting the ability of the model to converge quickly, as the presence of these false negatives discard semantic information leading to significant performance drop, we therefore identify and remove the negatives that are similar to the anchor in order to improve the performance of the contrastive learning.
Inspired by [16], we have simplified and only applied the false elimination part to the domain adaptation framework in order to improve the performance of contrastive learning. Instead of using additional support views, we compute the similarity loss between the anchor and the negatives within the minibatch, we then sort the corresponding negative pair similarity losses for each anchor and remove the negative pairs’ similar to the anchor. For each anchor we remove the same number of negative pairs, example removes 1 negative pair per anchor in the minibatch.
After removing the false negatives, the contrastive loss can be defined as follows:
(3) 
where Si is the set of the negative pair that are similar to the anchor i.
However If we use the above loss is used in a domain adaptation scenario, similar to the contrastive loss, as the minibatch contains image samples from both domains, it may treat all other samples as negatives against the anchor image even though they may belong to the same class without distinguishing domains, further widening the distance between them due to the difference in the domain specific visual characteristics, and therefore unable to learn domain invariance. In order to overcome these problems, we propose to use FNR loss in the source and target domain independently by randomly sampling instances from source and target domain. Finally, our joint FNR loss for DA is defined as follows:
(4) 
where and are source contrastive loss and target contrastive loss
3.4 Revisiting Maximum Mean Discrepancy(MMD):
MMD defines the distance between the two distributions with their mean embeddings in the Reproducing Kernel Hilbert Space (RKHS). MMD is a two sample kernel test to determine whether to accept or reject the null hypothesis
[12], where andare source and target domain probability distributions. MMD is motivated by the fact that if two distributions are identical, all of their statistics should be the same. The empirical estimate of the squared MMD using two datasets is computed by the following equation:
(5) 
where is the mapping to the RKHS H, is the universal kernel associated with this mapping, and N, M are the total number of items in the source and target respectively. In short, the MMD between the distributions of two datasets is equivalent to the distance between the sample means in a highdimensional feature space.
4 Experiments
4.1 Datasets
Since we propose a new task, there is no benchmark that is specifically designed for our task. We illustrate the performance of our method for the contrastive domain adaptation task comparing with the SimClrBase and CDABase explained later in the section4.2, we apply it on the standard digits dataset benchmarks using accuracy as the evaluation metric.
MNIST USPS (MU): MNIST [23] which stands for “modified NIST” is treated as source domain, it consists of blackandwhite handwritten digits and USPS [9] is treated as target domain, it consists handwritten digit datasets scanned and segmented by the U.S. Postal Service. As both these datasets contain grayscale images the domain shift between these two datasets is relatively small. Figure 4 below shows sample images from MU.
SVHN MNIST (MS): In this setting, SVHN [30] is treated as source domain and MNIST [23] is treated as the target domain. MNIST consists of blackandwhite handwritten digits, SVHN consists of crops of coloured, streetview house numbers consisting of single digits extracted from images of urban house numbers from Google Street View. SVHN and MNIST are two digit classification datasets with a drastic distributional shift between the two of them. The adaptation from MNIST to SVHN is quite challenging because MNIST has a significantly lower intrinsic dimensionality than SVHN. Figure below 5 shows sample images from MS.
MNIST MNISTM (MMM): MNIST [23], which consists of blackandwhite handwritten digits is treated as the source domain and MNISTM is treated as a target domain. MNISTM is a modification of MNIST dataset where the digits are blended with random patches from BSDS500 dataset color photos. Figure below 6 shows sample images from MMM.
4.2 Implementation Details
CDA uses a base encoder ResNet50 [14] trained from scratch followed by a two layered nonlinear MLP. During pretraining, we train CDA on 2 Titan Xp GPUs, using LARS optimizer [44]
with a batch size of 512 and weight decay of le6 for a total of 300 epochs. Similar to SimClr
[4], we report performance by training a linear classifier on top of a fixed representation, but only with source labels to evaluate representations which is a standard benchmark that has been adopted by many papers in the literature
[4, 5, 31].4.3 Evaluation
We conducted various experiments using unlabeled source and target digit datasets. The goal of our experiments is to introduce contrastive learning to the domain adaptation problem in order to maximize generalization between source and target datasets by learning class discriminative and domaininvariant features along with improving the performance of contrastive loss by eliminating the false negatives which is one of the main drawbacks in the contrastive learning without access to labels. We have performed multiple experimented using two views and four views [38]. Figure3 compares the average accuracy of our proposed two view CDA frameworks with the CDABase. Following are the various experimental scenarios we performed on the digit datasets.
SimClrBase: We start our experimental analysis by evaluating using SimClr. We have trained on source dataset using the same setup as SimClr, whilst testing on the target dataset. We treat this as a strong baseline which we call SimClrBase and use this as reference for comparison against other methods.
CDABase: We followed the methodology as described in section3.2, trained the model using the equation2 and evaluated on the target domain. Looking at table1, we can clearly observe that the model shows higher performance compared to the SimClrBase. The model has clearly learnt both visual similarity and domaininvariance resulting in minimizing the distance between the domains and maximizing the classification accuracy. Overall, the average accuracy for all the datasets has increased by around 19% compared to the SimClrBase model. We treat this result as a second strong baseline and call it CDABase.
CDA_FNR: We followed the methodology as described in section3.3, and trained the model using the equation4. We then evaluated the model trained using this method on the target domain. Looking at table1, in addition to learning visual similarity and domaininvariance, model also successfully identified and eliminated the potential false negatives as they contain the same semantic information as that of the anchor, resulting in converging faster and increased accuracy. We experimented on two scenarios, firstly we removed one false negative which we call FNR1 and in the second case, we experimented by removing two false negatives which we call FNR2. The results of these experiments can be seen in table1, which concludes that removal of false negatives improves accuracy resulting in converging faster. The average accuracy has increased by 2.3% after removing one false negative. Additionally by removing two false negatives, we observe that the average accuracy has increased by 3.8% in comparison to CDABase and 1.5% in comparison to FNR1. Compared to the SimClrBase, the average accuracy has increased around 21%.
CDAMMD: We have used the same setup as that of CDABase. Additionally we introduced MMD as described in section 3.4, which is computed between vector representations extracted from each domain as per the equation5
, in order to reduce the distance between the source and target distributions. We backpropagate NTXent loss from equation
2 along with MMD loss equation5. From table2, we observe that by minimizing both these losses together, our model achieves much better alignment of the source and target domains, showing the usefulness of combined contrastive loss and MMD alignment. In comparison to the CDABase method, the performance gain tends to be large as we can see that it has increased by 4.5%.CDA_FNRMMD: We have used the same setup as that of CDA_FNR, additionally we have introduced MMD which is computed between vector representations extracted from each domain as per the equation5, in order to reduce the distance between the source and target distributions. We calculate FNR loss both for source and target domains using equation4 and backpropagate FNR loss equation4 along with MMD loss equation5. From table2, we observe that by removing the potential false negatives and minimizing the discrepancy together, our model retains semantic information, hence converges faster and learns both visual similarity and domaininvariance by aligning source and target datasets efficiently, showing the effectiveness of this method. In comparison to the CDABase method, the average performance gain tends to be larger as we can see that it has increased by huge margin of 5.1%.
Comparison with the state of art methods: As the proposed framework is new, unfortunately, there are no benchmarks specifically designed for our task so it’s very difficult for a like for like comparison. Using our method, we demonstrate that our model can perform as well to the domain adaptation setting without access to labelled data and imagenet parameters, just by training using the unlabeled data itself, whereas all the unsupervised domain adaptation methods have access to the source labels. We have compared our results with the stateoftheart models and can conclude our model performs favorably in comparison with other stateofthe art models. From table3, we can conclude that our model has outperformed in the MNISTUSPS and SVHNMNIST tasks compared to the other stateoftheart models like DANN, DAN, ADDA, DDC and SimclrBase [10, 28, 39, 27, 40, 4]
Method 



Avg  

SimClrBase [4]  92.0  31.7  34.9  53.1  
CDABase  92.5  64.8  57.9  71.7  
CDA_FNR1  93.2  69.4  59.5  74.0  
CDA_FNR2  94.1  71.7  60.6  75.5 
Inspired by [38], we have also performed similar experiments using four views, following are the experimental scenarios we performed on the digit datasets which we compare with the CDABase and Contrastive Domain Adaptation with Four Augmentations(CDAx4aug).
CDAx4aug: We have tested our method by using four augmentations per anchor per source and followed the methodology as described in section3.2, by training the model using the equation2. The only change is that we now backpropagate four contrastive losses two from source and two from target. From table4, we can observe that the additional augmentations have significantly improved the average method accuracy compared to the two view CDABase due to the availability of additional positive and negative samples. Overall, by adding two additional views to the CDABase method we have gained an average accuracy of 5.1% compared to the CDABase method.
Method 



Avg  

SimClrBase [4]  92.0  31.7  34.9  53.1  
CDABase  92.5  64.8  57.9  71.7  
CDAMMD  93.4  74.8  60.6  76.2  
CDA_FNRMMD  94.2  76.2  60.2  76.8 
Method 





SimClrBase [4]  92.0  31.7  34.9  
DDC  79.1  68.1    
ADDA  89.4  76.0    
DANN    73.8  76.6  
DAN  81.1  71.1  76.9  
CDA_FNRMMD  
(our method)  94.2  76.2  60.2 
Method 



Avg  

SimClrBase [4]  92.0  31.7  34.9  53.1  
CDABase  92.5  64.8  57.9  71.7  
CDAx4aug  92.9  74.1  63.5  76.8  
CDAx4aug_FNR  93.6  75.0  64.0  77.5 
Method 



Avg  

SimClrBase [4]  92.0  31.7  34.9  53.1  
CDABase  92.5  64.8  57.9  71.7  
CDAx4aug  92.9  74.1  63.5  76.8  
CDAx4augMMD  92.7  69.3  58.6  73.5  
CDAx4aug  
_FNRMMD  92.5  70.6  61.5  74.9 
CDAx4aug_FNR: We followed the methodology as described in section3.3, and trained the model using the equation4, by training the model on four augmentations per domain as opposed to two. We then evaluated the model trained using this method on the target domain. Looking at table4, we can clearly establish that the additional views helped the model learn visual similarity and domaininvariance resulting in minimizing the distance between the domains, it also helped the model with successful identification and elimination of the potential false negatives, thus resulting in converging faster with average accuracy increase of 5.8% compared to CDABase and 0.7% compared to CDAx4augBase.
CDAx4augMMD: We have used the same setup as that of CDAx4aug, additionally we introduced MMD computed between vector representations extracted from each domain as per the equation5. We backpropagate XTXent loss equation2 for two pairs of source and two pairs of target along with MMD loss5. From table5 we can observe that performance gain using MMD was not significant due to the noise from additional augmentations, resulting in slow convergence between the source and target distributions.
CDAx4aug_FNRMMD: We have used the same setup as that of CDAx4aug_FNR, additionally we have introduced MMD computed between vector representations extracted from each domain as per the equation5. We backpropagate FNR loss using equation4 along with MMD loss using equation5. From table5, we can see that the average performance gain has increased compared to CDAx4augMMD method due to the false negative removal, but addition of MMD has comparatively slowed the convergence.
5 Conclusion
Over the past few years, ImageNet pretraining has become a standard practice. Employing our proposed contrastive domain adaptation approach and its variants, we demonstrate that our model can perform competitively in a domain adaptation setting, without having access to labelled data or imagenet parameters, just by training using the unlabeled data itself. CDA also introduces identification and removal of the potential false negatives in the DA setting, resulting in improved accuracy. We also extend our framework to learn from more than two views in the DA setting. We tested our model using various experimental scenarios demonstrating that it can be effectively used for downstream domain adaptation task. We hope that our work encourages future researchers to apply contrastive learning to domain adaptation.
References

[1]
B. Alhnaity, S. Kollias, G. Leontidis, S. Jiang, B. Schamp, and S. Pearson.
An autoencoder wavelet based deep neural network with attention mechanism for multistep prediction of plant growth.
Information Sciences, 560, 2021. 
[2]
F. Caliva, F. S. De Ribeiro, A. Mylonakis, C. Demazi’ere, P. Vinai,
G. Leontidis, and S. Kollias.
A deep learning approach to anomaly detection in nuclear reactors.
In 2018 International joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2018. 
[3]
C. Chen, Z. Chen, B. Jiang, and X. Jin.
Joint domain alignment and discriminative feature learning for
unsupervised deep domain adaptation.
In
Proceedings of the AAAI conference on artificial intelligence
, volume 33, pages 3296–3303, 2019.  [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
 [5] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton. Big selfsupervised models are strong semisupervised learners. arXiv preprint arXiv:2006.10029, 2020.
 [6] Z. Chen, C. Chen, X. Jin, Y. Liu, and Z. Cheng. Deep joint twostream wasserstein autoencoder and selective attention alignment for unsupervised domain adaptation. Neural computing and applications, pages 1–14, 2019.
 [7] C.Y. Chuang, J. Robinson, L. YenChen, A. Torralba, and S. Jegelka. Debiased contrastive learning. arXiv preprint arXiv:2007.00224, 2020.
 [8] F. De Sousa Ribeiro, G. Leontidis, and S. Kollias. Introducing routing uncertainty in capsule networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6490–6502. Curran Associates, Inc., 2020.

[9]
J. S. Denker, W. Gardner, H. P. Graf, D. Henderson, R. E. Howard, W. Hubbard,
L. D. Jackel, H. S. Baird, and I. Guyon.
Neural network recognizer for handwritten zip code digits.
In Advances in neural information processing systems
, pages 323–331. Citeseer, 1989.

[10]
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,
M. Marchand, and V. Lempitsky.
Domainadversarial training of neural networks.
The journal of machine learning research
, 17(1):2096–2030, 2016.  [11] S. Ghosh, A. Pal, S. Jaiswal, K. Santosh, N. Das, and M. Nasipuri. Segfastv2: Semantic image segmentation with less parameters in deep learning for autonomous driving. International Journal of Machine Learning and Cybernetics, 10(11):3145–3154, 2019.
 [12] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.

[13]
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick.
Momentum contrast for unsupervised visual representation learning.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 9729–9738, 2020.  [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [15] M. He and D. He. A new hybrid deep signal processing approach for bearing fault diagnosis using vibration signals. Neurocomputing, 396:542–555, 2020.
 [16] T. Huynh, S. Kornblith, M. R. Walter, M. Maire, and M. Khademi. Boosting contrastive selfsupervised learning with false negative cancellation. arXiv preprint arXiv:2011.11765, 2020.
 [17] Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and D. Larlus. Hard negative mixing for contrastive learning. arXiv preprint arXiv:2010.01028, 2020.
 [18] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann. Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4893–4902, 2019.
 [19] S. Karakanis and G. Leontidis. Lightweight deep learning models for detecting covid19 from chest xray images. Computers in Biology and Medicine, 130:104181.
 [20] P. W. Khan, Y.C. Byun, and N. Park. Iotblockchain enabled optimized provenance system for food industry 4.0 using advanced deep learning. Sensors, 20(10):2990, 2020.
 [21] D. Kim, K. Saito, T.H. Oh, B. A. Plummer, S. Sclaroff, and K. Saenko. Crossdomain selfsupervised learning for domain adaptation with few source labels. arXiv preprint arXiv:2003.08264, 2020.
 [22] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436–444, 2015.
 [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [24] C.Y. Lee, T. Batra, M. H. Baig, and D. Ulbricht. Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10285–10295, 2019.
 [25] Y. Li, H. Zhang, C. Bermudez, Y. Chen, B. A. Landman, and Y. Vorobeychik. Anatomical context protects deep learning from adversarial perturbations in medical imaging. Neurocomputing, 379:370–378, 2020.
 [26] D. Liu, Y. Cui, Y. Chen, J. Zhang, and B. Fan. Video object detection for autonomous driving: Motionaid feature calibration. Neurocomputing, 409:1–11, 2020.
 [27] M.Y. Liu and O. Tuzel. Coupled generative adversarial networks. arXiv preprint arXiv:1606.07536, 2016.
 [28] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, pages 97–105. PMLR, 2015.
 [29] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In International conference on machine learning, pages 2208–2217. PMLR, 2017.
 [30] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011.
 [31] A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 [32] C. Park, J. Lee, J. Yoo, M. Hur, and S. Yoon. Joint contrastive learning for unsupervised domain adaptation. arXiv preprint arXiv:2006.10297, 2020.
 [33] H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.Y. Chang, and T. Sainath. Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 13(2):206–219, 2019.
 [34] F. D. S. Ribeiro, F. Calivá, M. Swainson, K. Gudmundsson, G. Leontidis, and S. Kollias. Deep bayesian selftraining. Neural Computing and Applications, pages 1–17, 2019.
 [35] J. Robinson, C.Y. Chuang, S. Sra, and S. Jegelka. Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592, 2020.
 [36] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pages 443–450. Springer, 2016.
 [37] M. Thota, S. Kollias, M. Swainson, and G. Leontidis. Multisource domain adaptation for quality control in retail food packaging. Computers in Industry, 123:103293, 2020.
 [38] Y. Tian, D. Krishnan, and P. Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
 [39] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.
 [40] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
 [41] D. Wang, C. Devin, Q.Z. Cai, F. Yu, and T. Darrell. Deep objectcentric policies for autonomous driving. In 2019 International Conference on Robotics and Automation (ICRA), pages 8853–8859. IEEE, 2019.
 [42] X. Wu, D. Sahoo, and S. C. Hoi. Recent advances in deep learning for object detection. Neurocomputing, 396:39–64, 2020.
 [43] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
 [44] Y. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.