1 Introduction

Deep neural networks have shown impressive performance in learning tasks on a domain where a large number of labeled data are available for training
[21, 34, 15]. However, they often fail to generalize to a new domain where the distribution of input data significantly deviates from the original domain, , when a domain gap arises. The goal of domain adaptation is to adapt a learner to the new domain (target) using the labeled data available from the original domain (source). Unsupervised domain adaptation (UDA) attempts to tackle this inter-domain discrepancy problem without any supervision on the target domain, assuming that no labels for samples are available from the target domain in training [12, 33, 26, 17]. In contrast, semi-supervised domain adaptation (SSDA) relaxes the strict constraint, using a small number of additional labels on the target data, , a few labels per class [32]. As we are able to obtain such additional labels easily on the target data, it renders the adaptation problem more practical and better situated in learning.Empirical results [32] show that a naïve adaptation of UDA to SSDA, , considering the labeled samples on the target domain as a part of those on the source domain, suffers from the effect of target intra-domain discrepancy, , the distribution of labeled samples on the target domain is separated from that of unlabeled samples during training. We consider the intra-domain discrepancy and the aforementioned inter-domain discrepancy as major challenges of SSDA, and we illustrate them in Figure 2. Previous methods for SSDA [32, 20] aims to address the issue using a prototype-based approach; they create a prototype representation for each class and reduce the distance between each prototype and its nearby unlabeled samples.
In this paper, we propose a new SSDA approach, dubbed sample-to-sample self-distillation (SD), that leverages rich sample-to-sample relations rather than prototype-to-data relations. Our method takes a labeled sample as a teacher on either source or target domain and an unlabeled sample as a student. When the teacher comes from the source domain, it minimizes the inter-domain discrepancy between the source and the target. When the teacher is a labeled sample on the target domain, it effectively suppresses the intra-domain discrepancy within the target. For the reason that naïvely reducing the domain gap is demanding, we generate assistant features which support to bridge the domain gap. It is known that the domain and the style of an image are closely related and training with mixed style features helps to bridge the domain discrepancy [46, 13]. Inspired by the fact, the assistant features are created by transferring intermediate styles between the teacher and the student. Then a model is trained by minimizing the output discrepancy between the assistant and the student. The assistant features smoothly bridge the discrepancy between the two domains, thus making the student easily learns from the teacher.
To generate reliable pairs of the teacher and the student, we employ pseudo-labeling [22] and present a new form of reliability evaluation on the pseudo-label motivated by [45]. Compared to the previous prototype-based approach, our pair-based approach fully exploits rich and diverse supervisory signals via data-to-data distillation and effectively adapts to the target domain by minimizing both the intra-domain and inter-domain discrepancy. The contributions of this paper are summarized as follows:
-
We propose sample-to-sample self-distillation (SD) that exploits rich sample-to-sample relations using self-distillation.
-
We generate assistant features of which the style is represented by an intermediate of the source and the target to fill the domain gap, thus facilitating the adaptation.
-
We show that SD effectively adapts a network to a target domain by alleviating both the inter- and intra-domain discrepancy and SD sets a new state of the art.


D). We use a feature extractor and a classifier trained in the pre-training stage (Section
3.1). SD consists of the reliable student-set generation step (RSS), and the sample pairing, assistant feature generation and self-distillation step. We omit the reliable student-set generation step (RSS) in the figure. The abbreviations of TF, AF, and SF represent the teacher, the assistant, and the student feature, respectively. In the sample pairing step, the model pairs a student and a teacher of the same (pseudo-)label. In the assistant generation step, the teacher is forwarded and the set of mean and standard deviation of the teacher is extracted. The extracted
is utilized for adjusting intermediate style to the student to generate the assistant feature. The is then calculated to minimize the difference between the outputs of the assistant and the student. We omit feature normalization, temperature scaling, and softmax operation for simplicity.2 Related work
Semi-supervised domain adaptation. The goal of semi-supervised domain adaptation (SSDA) is to adapt a model on the target domain with a few labels of target data [32]. Although SSDA has been considered in [1, 10, 41], most recent research has explored unsupervised domain adaptation (UDA). The main issue of domain adaptation is the gap between the source and the target domain distributions. Previous UDA methods focus on aligning the two domain distributions. Adversarial learning between a domain-classifier and a feature extractor is one of the representative UDA approaches [12, 33, 26, 24, 40]. Learning with pseudo-labels [22] is another approach in UDA [39, 4, 9, 45]. To supplement the absence of target domain labels, the network assigns labels to the target data in a certain standard. The network then utilizes the obtained pseudo-labels as supervision for training using the target domain data. SSDA is re-examined in Minimax Entropy (MME) [32] for taking the advantage of extra supervision. With a minor effort, the model benefits from just a few target labels. MME discovers the ineffectiveness of previous UDA methods in SSDA, and proposes a new approach for the task. They minimize the distance between the class prototypes and nearby unlabeled target samples by minimax entropy. After MME, several new SSDA methods are followed. [19] generate bidirectional adversarial samples from source to target domain and from target to source domain to fill the domain gap. Attract, Perturb, and Explore (APE) [20] analyzes the target intra-domain discrepancy issue, and suggests to minimize the gap using Maximum Mean Discrepancy, perturbation loss, and the class prototypes. Among the previous work, MME and APE use the class prototypes and adapt to the target domain for SSDA. We tackle the issues of SSDA in a simple pair-based way by applying self-distillation different from previous work. Dissimilar to the prototype-based way, our pair-based method enables unlabeled target samples to be trained with more abundant supervision.
Style Manipulation. The style of images has been manipulated to increase the recognition ability of neural networks [17, 13, 35, 18, 46]. Previous work of style transfer [35, 18] find that mean and standard deviation of an intermediate feature from neural networks are closely related to the style of an image. Further, [46] reveals that domain is related to the image style and suggest to mix the style of given source domains to generalize the model in domain generalization. Our method is motivated by the fact that the intermediate domain style helps to minimize the domain discrepancy [13, 46]. Unlike [46], which trains the model by exposing it to various styles from source domains, our method generates intermediate style features (assistants) from labeled (teachers) and unlabeled samples (students) to guide students. Also, we use the assistant only for matching its soft output with the student’s one. Therefore, our method forces to produce the same results between two features of the same content with different styles.
Knowledge distillation. The idea of knowledge distillation (KD) is to train a model (student) by transferring knowledge extracted from another model (teacher) that is more powerful than the student [2, 3]. A series of study on KD has shown its attractive characteristics such as regularizing the student [42], stabilizing training [6], and preventing models to be overconfident [43]. One line of work on KD assumes two independent teacher and student networks sharing an input sample, and maps the output of the student to that of the teacher [16, 31, 44, 28]. This branch of work motivates GDSDA [1], which proposes to use multiple pre-trained source models to give predictions to a target model for domain adaptation tasks. The other interesting line of work on KD investigates self-knowledge distillation; a single network is trained by the knowledge from itself [11, 38, 43]
. Our design resorts to the second line of work. We propose to minimize Kullback–Leibler divergence of two predictions between an intermediate style feature and its corresponding unlabeled target sample in the form of self-distillation. This learning objective naturally conforms to the goal of domain adaptation: adapting a learner to a target domain by aligning two samples sharing semantics yet visually diverse.
3 Method
The task of semi-supervised domain adaptation is formulated as to classify unlabeled samples on a target domain using labeled samples on a source domain together with a limited number of labeled samples on the target domain [32, 20, 25]. Let us consider three datasets given in this context: a source dataset , a labeled target dataset , and an unlabeled target dataset , where , , and denote a sample, its corresponding label, and the number of samples, respectively. Here, we are given only a limited number of labeled samples per class on the target domain, , . The two domains share the same label space but with different input distribution. In this setup, we train a model on , and , and then evaluate it on with its ground-truth labels. In training, we validate models on additional labeled target set of . We select the best model and search hyper-parameters on the validation set.
3.1 Classifier model and its pre-training
Our model consists of two parts: a feature extractor and a classifier , where and
denote trainable parameters. We use a convolutional neural network for
, and a distance-based classifier for [37, 5]. The distance-based classifier computes its output as the cosine similarity between the input feature
and each column of :(1) |
where the final prediction is obtained via softmax operation with a temperature . In the following subsections, we often omit the function parameters, and , for notational simplicity.
We pre-train the model with labeled samples in , where , via minimizing the cross-entropy loss:
(2) |
This pre-training improves the performance of SD and also speeds up its convergence.
3.2 Sample-to-sample self-distillation (SD)
The sample-to-sample self-distillation (SD) is designed to perform SSDA by simultaneously minimizing both the inter-domain discrepancy (between the source and the target) and the intra-domain discrepancy (within the target).
It achieves the goal by alternating two steps: the student-set generation step, and the training step.
The training step consists of sample pairing, assistant generation and self-distillation.
At the student-set generation step, we pseudo-label samples from unlabeled target dataset and select reliable ones using reliability evaluation. The resultant set is used for student samples in self-distillation.
At the training step, we randomly produce teacher-student pairs with the same class label, generating corresponding assistant features, and perform self-distillation by minimizing the distance between the assistant and the student predictions. In pairing, we take one sample from (as a teacher) and the other from (as a student).
The overall procedure is summarized in Algorithm 1 and also illustrated in Figure 3.
In the following, we explain the details of each step and describe the overall training objective.
Reliable student-set generation. This step consists of pseudo-labeling and reliability evaluation. We assign a class label to each unlabeled sample , and construct a pseudo-labeled set of the student samples ; we simply take a pseudo-label of as the class index of the maximum prediction value:
(3) |
Although pseudo-labeling enables supervised training on unlabeled samples, pseudo-labels are often incorrect, in particular, in an early stage of training. We thus drop unreliable samples to compose for pairing. Let be a selection operator that selects largest value. We construct a student set by reliability evaluation:
(4) |
where
is an average margin between the first and the second highest logits of all unlabeled target on the pre-trained model, and
is a hyper-parameter. We include analysis on the margin in the supplementary materials in detail. The first condition is met when a margin between the largest and the second largest value of the logit is high enough [45]. The second condition is met when the absolute largest class probability score is high enough. In this way, the model assigns pseudo-labels to confident samples only and generates reliable student-set (RSS) so that the model can take reliable pairs.
Net | Method | R to C | R to P | P to C | C to S | S to P | R to S | P to R | MEAN | ||||||||
1-shot | 3-shot | 1-shot | 3-shot | 1-shot | 3-shot | 1-shot | 3-shot | 1-shot | 3-shot | 1-shot | 3-shot | 1-shot | 3-shot | 1-shot | 3-shot | ||
AlexNet |
S+T | 43.3 | 47.1 | 42.4 | 45.0 | 40.1 | 44.9 | 33.6 | 36.4 | 35.7 | 38.4 | 29.1 | 33.3 | 55.8 | 58.7 | 40.0 | 43.4 |
DANN | 43.3 | 46.1 | 41.6 | 43.8 | 39.1 | 41.0 | 35.9 | 36.5 | 36.9 | 38.9 | 32.5 | 33.4 | 53.6 | 57.3 | 40.4 | 42.4 | |
ADR | 43.1 | 46.2 | 41.4 | 44.4 | 39.3 | 43.6 | 32.8 | 36.4 | 33.1 | 38.9 | 29.1 | 32.4 | 55.9 | 57.3 | 39.2 | 42.7 | |
CDAN | 46.3 | 46.8 | 45.7 | 45.0 | 38.3 | 42.3 | 27.5 | 29.5 | 30.2 | 33.7 | 28.8 | 31.3 | 56.7 | 58.7 | 39.1 | 41.0 | |
ENT | 37.0 | 45.5 | 35.6 | 42.6 | 26.8 | 40.4 | 18.9 | 31.1 | 15.1 | 29.6 | 18.0 | 29.6 | 52.2 | 60.0 | 29.1 | 39.8 | |
MME | 48.9 | 55.6 | 48.0 | 49.0 | 46.7 | 51.7 | 36.3 | 39.4 | 39.4 | 43.0 | 33.3 | 37.9 | 56.8 | 60.7 | 44.2 | 48.2 | |
APE | 47.7 | 54.6 | 49.0 | 50.5 | 46.9 | 52.1 | 38.5 | 42.6 | 38.5 | 42.2 | 33.8 | 38.7 | 57.5 | 61.4 | 44.6 | 48.9 | |
SD w/o AF | 52.3 | 56.2 | 48.7 | 51.2 | 48.0 | 51.3 | 39.2 | 43.5 | 40.6 | 46.5 | 37.4 | 39.8 | 59.5 | 65.1 | 46.5 | 50.5 | |
SD (ours) | 53.5 | 56.5 | 51.8 | 52.2 | 49.1 | 53.9 | 40.1 | 44.4 | 44.9 | 48.7 | 39.9 | 39.2 | 61.7 | 65.4 | 48.7 | 51.5 | |
ResNet34 |
S+T | 55.6 | 60.0 | 60.6 | 62.2 | 56.8 | 59.4 | 50.8 | 55.0 | 56.0 | 59.5 | 46.3 | 50.1 | 71.8 | 73.9 | 56.9 | 60.0 |
DANN | 58.2 | 59.8 | 61.4 | 62.8 | 56.3 | 59.6 | 52.8 | 55.4 | 57.4 | 59.9 | 52.2 | 54.9 | 70.3 | 72.2 | 58.4 | 60.7 | |
ADR | 57.1 | 60.7 | 61.3 | 61.9 | 57.0 | 60.7 | 51.0 | 54.4 | 56.0 | 59.9 | 49.0 | 51.1 | 72.0 | 74.2 | 57.6 | 60.4 | |
CDAN | 65.0 | 69.0 | 64.9 | 67.3 | 63.7 | 68.4 | 53.1 | 57.8 | 63.4 | 65.3 | 54.5 | 59.0 | 73.2 | 78.5 | 62.5 | 66.5 | |
ENT | 65.2 | 71.0 | 65.9 | 69.2 | 65.4 | 71.1 | 54.6 | 60.0 | 59.7 | 62.1 | 52.1 | 61.1 | 75.0 | 78.6 | 62.6 | 67.6 | |
MME | 70.0 | 72.2 | 67.7 | 69.7 | 69.0 | 71.7 | 56.3 | 61.8 | 64.8 | 66.8 | 61.0 | 61.9 | 76.1 | 78.5 | 66.4 | 68.9 | |
APE | 70.4 | 76.6 | 70.8 | 72.1 | 72.9 | 76.7 | 56.7 | 63.1 | 64.5 | 66.1 | 63.0 | 67.8 | 76.6 | 79.4 | 67.6 | 71.7 | |
SD w/o AF | 73.4 | 75.3 | 69.2 | 70.8 | 73.4 | 74.4 | 60.2 | 63.1 | 66.1 | 69.1 | 62.8 | 64.7 | 79.3 | 79.7 | 69.2 | 71.0 | |
SD (ours) | 73.3 | 75.9 | 68.9 | 72.1 | 73.4 | 75.1 | 60.8 | 64.4 | 68.2 | 70.0 | 65.1 | 66.7 | 79.5 | 80.3 | 69.9 | 72.1 |
Sample pairing and assistant generation. After obtaining , we construct a pair of a labeled sample and a pseudo-labeled sample
. We then set a labeled sample as a teacher sample, and set an unlabeled sample as a student sample. An assistant feature is generated by an assistant generation (AG) module, which transfers an intermediate style of the teacher and the student. The content of the assistant follows the one of the student. To blend the style of the pair, we construct new style statistics (, mean and standard deviation) by interpolating between the style statistics of the teacher and the student
[18, 46]. Let be the -th layer of and the intermediate feature , where . The style fused feature is calculated as(5) |
where
is extracted from the Beta distribution
with a hyper-parameter and is extracted from the student sample . We use following [46]. , denote mean and standard deviation of the feature across the spatial dimension, respectively:(6) | ||||
(7) |
AG operation is applied to the intervals of the neural network. See supplementary material for more details.
Note that, unlike [46] that directly trains the model with stylized features (, the gradients are back-propagated through the features), the gradients are blocked when we apply AG module to the intermediate features and generate assistant features.
is the feature extractor with AG operation.
Thus the assistant feature is denoted as , where .
Self-distillation and training objectives. When implementing a pair loss, we cannot sample all the pairs from two sets, since their cardinalities are large enough. Thus, we uniformly sample the pairs from the set of all possible pairs. The work [23]
shows this sampling technique is an unbiased estimator of true expectation. The pair loss is calculated as
(8) |
where denotes Iverson brackets. is calculated when the teacher sample is forwarded through the feature extractor . It effectively reduces the inter-domain discrepancy using pairs between and , while suppressing the intra-domain discrepancy using pairs between and . This effect is validated in Figure 5.
To improve our training, we introduce an additional loss using student samples with pseudo-labels. We utilize the latest prediction of student samples to decide the reliability of pseudo-labels and multiply it to the cross-entropy loss of each student sample assuming its pseudo-label as its true label, , we use a weighted cross-entropy loss (WCE) for training student samples:
(9) |
By doing so, plays a role in weighting prediction so that less confident samples from give less effect on updating the model.
Our total loss in training thus consists of three terms:
(10) |
where is a weighting hyper-parameter for the pair loss. The model is updated by minimizing for iterations. We iterate alternating the student-set generation step and the sample pairing, assistant generation and self-distillation step until the model converges on the validation set.
Net | Method | R to C | R to P | R to A | P to R | P to C | P to A | A to P | A to C | A to R | C to R | C to A | C to P | MEAN |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AlexNet |
S+T | 37.5 | 63.1 | 44.8 | 54.3 | 31.7 | 31.5 | 48.8 | 31.1 | 53.3 | 48.5 | 33.9 | 50.8 | 44.1 |
DANN | 42.5 | 64.2 | 45.1 | 56.4 | 36.6 | 32.7 | 43.5 | 34.4 | 51.9 | 51.0 | 33.8 | 49.4 | 45.1 | |
ADR | 37.8 | 63.5 | 45.4 | 53.5 | 32.5 | 32.2 | 49.5 | 31.8 | 53.4 | 49.7 | 34.2 | 50.4 | 44.5 | |
CDAN | 36.1 | 62.3 | 42.2 | 52.7 | 28.0 | 27.8 | 48.7 | 28.0 | 51.3 | 41.0 | 26.8 | 49.9 | 41.2 | |
ENT | 26.8 | 25.8 | 45.8 | 56.3 | 23.5 | 21.9 | 47.4 | 22.1 | 53.4 | 30.8 | 18.1 | 53.6 | 38.8 | |
MME | 42.0 | 69.6 | 48.3 | 58.7 | 37.8 | 34.9 | 52.5 | 36.4 | 57.0 | 54.1 | 39.5 | 59.1 | 49.2 | |
APE | 42.1 | 69.6 | 49.8 | 57.7 | 35.5 | 35.9 | 49.2 | 32.1 | 55.0 | 52.7 | 37.8 | 57.6 | 47.9 | |
SD w/o AF | 45.3 | 69.5 | 48.0 | 58.5 | 34.8 | 34.5 | 55.9 | 34.6 | 57.2 | 56.7 | 37.0 | 60.3 | 49.4 | |
SD (ours) | 43.0 | 70.1 | 48.4 | 60.3 | 35.6 | 35.3 | 56.9 | 35.5 | 56.8 | 55.9 | 37.5 | 59.1 | 49.5 | |
ResNet34 |
S+T | 50.9 | 78.7 | 65.9 | 73.6 | 46.5 | 54.4 | 68.6 | 48.7 | 73.2 | 67.1 | 55.2 | 64.9 | 62.3 |
MME | 60.3 | 82.6 | 71.0 | 79.1 | 57.9 | 63.6 | 74.6 | 59.2 | 77.3 | 73.5 | 64.1 | 75.1 | 69.9 | |
APE | 60.1 | 82.4 | 73.0 | 78.5 | 53.3 | 64.6 | 74.7 | 53.4 | 75.7 | 70.6 | 61.6 | 69.1 | 68.1 | |
SD w/o AF | 61.8 | 82.5 | 70.3 | 78.7 | 54.8 | 62.5 | 75.6 | 58.2 | 78.3 | 74.9 | 65.3 | 77.5 | 70.0 | |
SD (ours) | 63.2 | 82.3 | 71.0 | 79.0 | 56.8 | 64.7 | 75.3 | 59.3 | 77.4 | 73.6 | 64.6 | 76.1 | 70.3 |
4 Experiments
We compare SD with current state-of-the-art methods on two standard SSDA benchmarks. We demonstrate that SD is generally applicable when there are zero or many target domain labels are available. Through extensive ablation studies, we verify the effectiveness of each proposed component in detail. For more experimental setups and results, refer to the supplementary material. Our code is available on https://github.com/userb2020/s3d.
4.1 Setup
Datasets. We evaluate our method using two benchmark datasets: DomainNet [30] and Office-Home [36].
DomainNet contains 6 domains each of which has 345 classes.
Among them, we use 4 domains (Real, Clipart, Painting, and Sketch) and 126 classes.
We use 145,145 images from all 4 domains for our experiment.
We choose seven source-to-target domain scenarios following the work of [32].
Office-Home consists of 4 domains (Real, Clipart, Product, and Art) of 65 classes, and 15,588 images in total.
We conduct Office-Home experiments on all possible source-to-target domain scenarios.
Implementation details. We follow most of the implementation details of [32] for a fair comparison. We select AlexNet [21] and ResNet-34 [15]
, both of which are pre-trained on ImageNet
[8]. A mini-batch consists of samples from and at the ratio of 1 to 1. We choose the same number of source and labeled target data to construct teacher samples in the batch. Specifically, we use 128 and 96 samples in AlexNet and ResNet-34 respectively as done in MME. We adopt the SGD optimizer with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of 0.0005. In the reliable student-set generation step, we set to . We set the student-set generation interval as 100. We validate the model every 1000 iterations during training, and we early-stop training when the model shows no more improvement in 5 validation steps. The details of searching the hyper-parameterare described in the supplementary material. We use PyTorch
[29] for our experiments.Baselines. We compare our method to competitive SSDA baselines: MME [32], and APE [20]. Additionally, we bring S+T that simply minimizes the cross-entropy loss on the labeled dataset. SD w/o AF is our method without the assistant, directly distills the teacher prediction to the student. DANN [12], ADR [33], and CDAN [26], which are the well-known methods in UDA, are also described as a comparison. Further, we include the accuracy of ENT [14].
4.2 Main results
We conduct experiments on both one-shot and three-shot settings with AlexNet and ResNet-34.
For a fair comparison, we select the best model on the validation set for all experiments.
Comparison on DomainNet.
Table 1 compares the classification accuracy of our method and other baselines on the DomainNet dataset.
SD achieves higher accuracy than the previous methods in most domain adaptation scenarios.
Notably, SD is effective where the domain gap between the source and target domain is substantial.
In comparison to S+T, for example, SD increases accuracy by 17.7%p on Real to Clipart with ResNet in the one-shot setting.
Real and Clipart domains appear considerably distinctive to each other because samples in Real domain are photos, whereas, the samples in Clipart domain are artificial illustrations.
The mixed style assistant seems to be effective when comparing the accuracy between SD and SD w/o AF.
![]() |
![]() |
Method | R-C | R-P | P-C | C-S | S-P | R-S | P-R | MEAN |
---|---|---|---|---|---|---|---|---|
AlexNet | ||||||||
Source | 41.1 | 42.6 | 37.4 | 30.6 | 30.0 | 26.3 | 52.3 | 37.2 |
DANN | 44.7 | 36.1 | 35.8 | 33.8 | 35.9 | 27.6 | 49.3 | 37.6 |
ADR | 40.2 | 40.1 | 36.7 | 29.9 | 30.6 | 25.9 | 51.5 | 36.4 |
CDAN | 44.2 | 39.1 | 37.8 | 26.2 | 24.8 | 24.3 | 54.6 | 35.9 |
ENT | 33.8 | 43.0 | 23.0 | 22.9 | 13.9 | 12.0 | 51.2 | 28.5 |
MME | 47.6 | 44.7 | 39.9 | 34.0 | 33.0 | 29.0 | 53.5 | 40.2 |
APE | 45.9 | 47.0 | 42.0 | 36.5 | 37.0 | 30.3 | 54.1 | 41.8 |
SD w/o AF | 49.3 | 49.2 | 42.7 | 38.1 | 41.7 | 38.0 | 54.1 | 44.7 |
SD (ours) | 53.4 | 51.9 | 46.3 | 38.7 | 44.0 | 36.4 | 57.6 | 46.9 |
ResNet34 | ||||||||
Source | 54.5 | 60.2 | 55.9 | 49.7 | 50.1 | 44.1 | 72.1 | 55.2 |
MME | 67.6 | 66.9 | 67.1 | 56.4 | 62.9 | 58.2 | 74.5 | 64.8 |
APE | 65.4 | 68.6 | 63.8 | 56.4 | 65.1 | 60.4 | 75.3 | 65.0 |
SD w/o AF | 70.1 | 69.5 | 66.8 | 56.3 | 61.8 | 59.0 | 73.4 | 65.3 |
SD (ours) | 72.7 | 70.2 | 66.5 | 57.2 | 63.8 | 62.6 | 71.2 | 66.3 |
Comparison on Office-Home. Table 2 compares the results of our method and others on the Office-Home dataset. Our method shows comparable results with other baselines. As SD in DomainNet, the method is powerful in adapting to the quite different domain, for example, on Real to Clipart and Art to Product scenarios. However, the accuracy enhancement of SD on Office-Home is not as high as that on DomainNet. We suppose that the small dataset size of the Office-Home derives less performance improvement compared to that on the DomainNet dataset. Note that the dataset size of the DomainNet is ten times larger than that of the Office-Home. Considering that student samples acquire diverse guidance from source and labeled target dataset using our method, a larger dataset is more advantageous for SD to give rich guidance to unlabeled targets.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
4.3 Varying the number of target labels
Many-shot semi-supervised domain adaptation. We examine our method with increasing target labels per class.
Figure 4 reports the many-shot experiment results.
SD consistently outperforms current state-of-the-art along with the increasing number of target domain ground truths.
Note that the S+T baseline corresponds to the pre-training stage in our context.
This experiment emphasizes the strength of the sample-to-sample training even when abundant target labels are given in the pre-training stage.
In the 20-shot case, for example, the pre-training is supervised by 2,520 target domain examples, and yet the following sample-to-sample training stage gains additional accuracy improvement of 8.5%p from the pre-trained model.
Unsupervised domain adaptation. In Table 3, SD is also shown to be effective even though no target ground truth is available. For UDA experiments, we compose training batch samples from the source dataset and the unlabeled target dataset, and set validation batches as the same as the one used in the semi-supervised setup. SD outperforms counterparts in most scenarios. It is impressive that SD excels methods that have been proposed for UDA [12, 33, 26], each of which involves a domain-adversarial learning objective.
4.4 Ablation study
Ablation study on proposed components. We conduct extensive ablation studies on our major contributions: the pair loss , the weighted-cross entropy loss and the reliable student-set generation (RSS) in Table 4. The top row of Table 4 represents the pre-training stage which is equivalent to the S+T baseline. It attempts to train a network only with labeled samples without any proposed components. The check mark on the RSS column indicates that we filter out unreliable pseudo-labeled samples using Eq. (4), otherwise we utilize all pseudo-labeled samples into training.
As can be seen in Table 4, the pair loss contributes to performance improvement.
Using the pair loss and the weighted cross-entropy together significantly increases the accuracy by a large margin from the top-row baseline on all scenarios.
Additionally, RSS further improves performance by preventing harmful alignment from mis-matched pairs.
The bottom row completes our method, and it validates that all components are complementary.
Ablation study on evaluating reliability of pseudo-labels. We examine that discarding unreliable pseudo-labels is crucial in training with pseudo-labels.
We compare our RSS with other pseudo-label reliability evaluation scheme.
In Figure 5(a), without RSS indicates that we set all pseudo-labeled samples to a student set regardless of their reliability, and CAG [45] indicates that we measure the reliability using the first condition of Eq. (4).
While CAG [45] searches for an optimal margin , our model sets the to the average margin value of all student set thus eliminating a hyper-parameter.
Our RSS method effectively constructs reliable pairs thus improving performance over two baselines.
Compared to the performance gain from , which will be reported in the following paragraph with Figure 5(b), the gain of reliable paring is clearer.
This comparison witnesses again that the pair-based learning brings the major performance improvement in our pipeline.
Ablation study on weighted cross-entropy. In Figure 5(b), we verify the effect of the confidence score in the weighted cross-entropy of Eq. (9).
In Figure 5(b), without indicates that we exclude in the overall loss term, and without indicates that we eliminate in .
We show that multiplying to the cross-entropy term achieves additional performance growth.
RSS | R-C | R-P | P-C | C-S | S-P | R-S | P-R | MEAN | ||
---|---|---|---|---|---|---|---|---|---|---|
✗ | ✗ | ✗ | 56.8 | 60.5 | 55.4 | 51.7 | 55.5 | 47.5 | 72.0 | 57.1 |
✓ | ✗ | ✗ | 68.7 | 65.6 | 68.8 | 59.2 | 64.1 | 61.6 | 78.4 | 66.6 |
✗ | ✓ | ✗ | 67.4 | 65.0 | 67.1 | 61.2 | 64.9 | 62.7 | 77.5 | 66.5 |
✓ | ✓ | ✗ | 69.4 | 65.7 | 69.7 | 61.3 | 65.5 | 61.7 | 78.6 | 67.4 |
✓ | ✓ | ✓ | 73.3 | 68.9 | 73.4 | 60.8 | 68.2 | 65.1 | 79.5 | 69.9 |
4.5 Analysis
Inter-domain and intra-domain discrepancies.
Figure 5 visualizes that SD progressively clusters instances of the same classes by overcoming inter- and intra-domain discrepancies.
Figure 4(a) plots a histogram of cosine similarity between a source and a target embedding from the same class for all classes.
Figure 4(c) plots the one between labeled and unlabeled target embeddings.
Figures 4(b) and 4(d) visualize cosine similarity histograms from the final model of APE [20] and SD.
The similarity population gradually moves toward 1.0 over iterations, which proves that method guides to map two semantically similar samples to nearby points in the embedding space.
While a majority of the same-class embeddings moves close to each other, we observe that a small portion of embeddings pushes apart.
This is considered one limitation of leveraging pseudo-labels; wrong pairs misguide the learning process.
Embedding space visualization. Figure 7 visualizes how SD clusters instances from two domains over iterations using -SNE [27]. We observe that SD clearly enhances the embedding quality from the pre-training stage. We include more qualitative results in the supplementary material.
![]() |
![]() |
5 Conclusion
We propose a novel sample-to-sample self-distillation (SD) by exploiting rich and diverse relations for semi-supervised domain adaptation.
By exploiting an assistant feature, the style of which is mixed, SD efficiently reduce the domain shift.
The experiments demonstrate that SD effectively adapts to a target domain using a single architecture given an extremely few number of labeled target domain samples.
Acknowledgements.
This work was supported by Samsung Electronics Co., Ltd. (IO201208-07822-01) and the IITP grants (No.2019-0-01906, AI Graduate School Program - POSTECH) (No.2021-0-00537, Visual common sense through self-supervised learning for restoration of invisible parts in images) funded by Ministry of Science and ICT, Korea.
References
-
[1]
(2017)
Fast generalized distillation for semi-supervised domain adaptation.
In
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence
, pp. 1719–1725. Cited by: §2, §2. - [2] (1996) Born again trees. University of California, Berkeley, Berkeley, CA, Technical Report 1, pp. 2. Cited by: §2.
- [3] (2006) Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, New York, NY, USA, pp. 535–541. External Links: ISBN 1595933395, Link, Document Cited by: §2.
-
[4]
(2019)
Domain-specific batch normalization for unsupervised domain adaptation
. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 7354–7362. Cited by: §2. - [5] (2019) A closer look at few-shot classification. In International Conference on Learning Representations, Cited by: §3.1.
- [6] (2020) Explaining knowledge distillation by quantifying the knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12925–12935. Cited by: §2.
-
[7]
(2016)
The cityscapes dataset for semantic urban scene understanding
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §A.1. - [8] (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.1.
- [9] (2019) Cluster alignment with a teacher for unsupervised domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9944–9953. Cited by: §2.
- [10] (2013) Semi-supervised domain adaptation with instance constraints. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 668–675. Cited by: §2.
- [11] (2018) Born again neural networks. arXiv preprint arXiv:1805.04770. Cited by: §2.
-
[12]
(2016)
Domain-adversarial training of neural networks.
The Journal of Machine Learning Research
17 (1), pp. 2096–2030. Cited by: §A.1, §A.2, §1, §2, §4.1, §4.3. - [13] (2019) Dlow: domain flow for adaptation and generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2477–2486. Cited by: §1, §2.
- [14] (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §4.1.
- [15] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §A.1, §1, §4.1.
- [16] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
- [17] (2018) Cycada: cycle-consistent adversarial domain adaptation. In International conference on machine learning, pp. 1989–1998. Cited by: §1, §2.
- [18] (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §2, §3.2.
- [19] (2020) Bidirectional adversarial training for semi-supervised domain adaptation. Twenty-Ninth International Joint Conference on Artificial Intelligence. Cited by: §2.
- [20] (2020) Attract, perturb, and explore: learning a feature alignment network for semi-supervised domain adaptation. In European conference on computer vision, Cited by: §A.1, §A.2, §1, §2, §3, §4.1, §4.5.
- [21] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §A.1, §1, §4.1.
- [22] (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Cited by: §1, §2.
- [23] (1990) U-statistics: theory and practice. Marcel Dekker, Inc., New York. Cited by: §3.2.
- [24] (2019) Drop to adapt: learning discriminative features for unsupervised domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 91–100. Cited by: §2.
- [25] (2020) Online meta-learning for multi-source and semi-supervised domain adaptation. arXiv preprint arXiv:2004.04398. Cited by: §3.
- [26] (2018) Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1640–1650. Cited by: §1, §2, §4.1, §4.3.
- [27] (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §A.2, §4.5.
- [28] (2019) Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3967–3976. Cited by: §2.
- [29] (2017) Automatic differentiation in pytorch. Cited by: §4.1.
- [30] (2019) Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1406–1415. Cited by: §4.1.
- [31] (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §2.
- [32] (2019) Semi-supervised domain adaptation via minimax entropy. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8050–8058. Cited by: §A.1, §A.1, §A.2, §1, §1, §2, §3, §4.1, §4.1, §4.1.
- [33] (2017) Adversarial dropout regularization. arXiv preprint arXiv:1711.01575. Cited by: §1, §2, §4.1, §4.3.
- [34] (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
- [35] (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §2.
- [36] (2017) Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5018–5027. Cited by: §A.1, §4.1.
-
[37]
(2018)
Cosface: large margin cosine loss for deep face recognition
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §3.1. - [38] (2020) Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: §2.
- [39] (2018) Learning semantic representations for unsupervised domain adaptation. In International Conference on Machine Learning, pp. 5423–5432. Cited by: §2.
- [40] (2019) Adversarial domain adaptation with domain mixup. arXiv preprint arXiv:1912.01805. Cited by: §2.
- [41] (2015) Semi-supervised domain adaptation with subspace learning for visual recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2142–2150. Cited by: §2.
- [42] (2020) Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3903–3911. Cited by: §2.
- [43] (2020-06) Regularizing class-wise predictions via self-knowledge distillation. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- [44] (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR, External Links: Link Cited by: §2.
- [45] (2019) Category anchor-guided unsupervised domain adaptation for semantic segmentation. In Advances in Neural Information Processing Systems, pp. 435–445. Cited by: §A.1, §1, §2, §3.2, §4.4.
- [46] (2021) Domain generalization with mixstyle. In International Conference on Learning Representations, External Links: Link Cited by: §A.2, §A.2, §1, §2, §3.2, §3.2.
A Supplemental material
In this supplementary material, we provide our method’s additional details, analyses, and experimental results.
a.1 Implementation details
The effectiveness of SD are shown in different experimental settings in our main paper. In this section, we provide our experimental details.
Some results of Tables 1, 2, and 3 of our main paper are borrowed from the paper of MME [32]: the results of S+T, DANN, ADR, CDAN, ENT, and MME in Table 1, their results with the AlexNet base network in Table 2, and their accuracies of unsupervised domain adaptation in Table 3.
Datasets.
In Figure 9, we visualize the examples of DomainNet and Office-Home datasets. In both datasets, all of four domains are distinct from each other, while Real and Product domains in Office-Home are quite similar.
Baselines.
For a fair comparison, we reproduce S+T, MME [32], and APE [20] if the accuracy is not stated in their papers. To reproduce MME, we follow the official implementation 111https://github.com/VisionLearningGroup/SSDA_MME and set (from MME) to 0.1. For APE, we follow the official implementation 222https://github.com/TKKim93/APE and set , , and to 10, 1, and 10, respectively.
Many-shot semi-supervised domain adaptation experiments.
Unsupervised domain adaptation experiments.
Ablation studies.
Inter-domain and intra-domain discrepancy histograms.
We use ResNet for Office-Home [36] on one-shot setting to plot inter-domain and intra-domain discrepancy histograms shown in Figure 5. We choose the Clipart domain for the source domain and the Product domain for the target domain. For Figures 4(a) and 4(c), we plot histograms of cosine similarity after the pre-training stage, specifically from 10,000 iteration. We plot the histogram every 3,000 iterations until the model converges. For APE, we plot the converged model for a comparison.


The balancing hyper-parameter .
We use to balance in the overall loss and make up for the incompleteness of pseudo-labels. We set the hyper-parameter using a ramp-up function like in [12]:
(11) |
where increases over iterations. The increasing makes increases so that influences more on the learning process. This incremental weighting technique is adequate since pseudo-labels are likely to be incorrect at the beginning of the sample-to-sample training stage.
To find a proper ramp-up function, we vary to examine the effects of weighting . Changing controls the slope of the ramp-up function. We choose Real to Sketch one-shot domain scenario, and select at the best validation accuracy. In DomainNet and Office-Home experiment, we set to 8 on both AlexNet and ResNet. Figure 8 shows the accuracy of our model when varying .
![]() |
![]() |
The class logit margin .
In Eq. (4), the margin is used to filter out unreliable target samples. Here, determines a trade-off between a number of pseudo-labels used and how reliable the pseudo-labels are. A small makes the pseudo-labels of the student set inaccurate. On the contrary, a large makes RSS filter many unlabeled target samples so that few student samples are used for training. Therefore, proper is critical for a student-set to contain precise and various student samples.
We investigate whether the average margin of all unlabeled target samples is appropriate for or not.
Figure 10 shows the experiment, which is conducted on DomainNet Real to Clipart scenario using ResNet34.
We compare the result of the average margin to those of the margin from 1 to 4.
is initially calculated from the pre-trained model and is fixed afterward.
In Figure 10, the model calculates as 2.88 and 3.46 in one-shot and three-shot setting, respectively.
The model shows the best accuracy when the student-set is generated using .
This result describes that is an appropriate margin for RSS to make student-set abundant and precise.
Method | |||||
---|---|---|---|---|---|
1 | 2 | 3 | 4 | ||
SD () | 72.6 | 72.9 | 73.1 | 73.5 | 73.4 |
SD () | 72.5 | 74.4 | 75.0 | 75.9 | 75.7 |
SD () | 72.4 | 74.1 | 75.7 | 75.9 | 75.8 |
SD w/o | 72.2 | 72.9 | 75.2 | 75.6 | 75.5 |
Performance varying both and .
Unlike [45], we preset the margin by averaging all unlabeled target’s margin so that we can obtain target adaptive . This is because SD deals with various target domains different from [45], which considers only Cityscapes [7] as a target dataset. Also, the threshold is designed to avoid the situation that CAG [45] might exclude the sample with high confidence because of its low margin. To search the best values of and , we jointly vary the values of them. The details are the same as the setting in Figure 9(b). The results are shown in Table 5, where SD (, ) performs the best. It also shows that SD is not sensitive to the margin parameter and the threshold .

The locations to apply AG.
To generate assistant features, AG operation is applied to different layers. In AlexNet, the operation can be placed at any yellow blocks shown in Figure 11 (a). In ResNet34, the operation can be located at the end of every four residual blocks shown in Figure 11 (b). We conduct experiments on various combinations of the location in Table 6. We conjecture that the optimal combination of the operations is different between ResNet and AlexNet. As the overall accuracy of ResNet is higher than that of AlexNet, the pairs from ResNet are more reliable and thus the semantic meanings from teachers are more effective for students in ResNet than AlexNet. We select all locations for ResNet in DomainNet and Office-Home. For AlexNet, we select AG 1 and AG 1,2 in DomainNet and Office-Home, respectively.
Network | AG | |||
---|---|---|---|---|
1 | 1,2 | 1,2,3 | 1,2,3,4 | |
AlexNet | 39.9 | 38.5 | 31.8 | 32.6 |
ResNet34 | 60.5 | 61.2 | 62.5 | 65.1 |
Net | Method | DomainNet | Office-Home |
---|---|---|---|
AlexNet | S+T | 40.0 | 44.1 |
MixStyle | 39.3 | 43.5 | |
SD () | 46.7 | 46.8 | |
SD (ours) | 48.7 | 49.5 | |
ResNet34 |
S+T | 56.9 | 62.3 |
MixStyle | 66.6 | 60.2 | |
SD () | 69.1 | 69.8 | |
SD (ours) | 69.9 | 70.3 |
Method | 1-shot | 3-shot |
---|---|---|
CDAN | 62.9 | 65.3 |
ENT | 59.5 | 63.6 |
MME | 64.3 | 66.8 |
APE | 65.2 | 67.3 |
SD | 67.7 | 69.7 |
Method | RSS | R to C | R to P | P to C | C to S | S to P | R to S | P to R | MEAN | ||
DANN |
58.2 | 61.4 | 56.3 | 52.8 | 57.4 | 52.2 | 70.3 | 58.4 | |||
MME | 70.0 | 67.7 | 69.0 | 56.3 | 64.8 | 61.0 | 76.1 | 66.4 | |||
APE | 70.4 | 70.8 | 72.9 | 56.7 | 64.5 | 63.0 | 76.6 | 67.6 | |||
SD | ✗ | ✗ | ✗ | 56.8 | 60.5 | 55.4 | 51.7 | 55.5 | 47.5 | 72.0 | 57.1 |
✓ | ✗ | ✗ | 68.7 | 65.6 | 68.8 | 59.2 | 64.1 | 61.6 | 78.4 | 66.6 | |
✓ | ✗ | ✓ | 71.6 | 69.1 | 70.7 | 58.7 | 65.4 | 62.0 | 79.6 | 68.2 | |
✗ | ✓ | ✗ | 67.4 | 65.0 | 67.1 | 61.2 | 64.9 | 62.7 | 77.5 | 66.5 | |
✗ | ✓ | ✓ | 73.1 | 67.1 | 70.6 | 57.7 | 65.8 | 62.4 | 73.6 | 67.2 | |
✓ | ✓ | ✗ | 69.4 | 65.7 | 69.7 | 61.3 | 65.5 | 61.7 | 78.6 | 67.4 | |
✓ | ✓ | ✓ | 73.3 | 68.9 | 73.4 | 60.8 | 68.2 | 65.1 | 79.5 | 69.9 |
|
|
a.2 Additional experimental results
Comparison with MixStyle [46].
The main difference between SD and MixStyle is that we introduce the assistant (intermediate style feature) as a guidance for the student. The assistant is designed to transfer its knowledge to the student using knowledge distillation; for this reason, we do not back-propagate gradients through the path of assistant features (see the second dotted branch in Figure 3). This strategy has not been explored before. MixStyle, which is introduced for domain generalization, directly trains the model with stylized features; the features’ predictions and given labels are used for calculating the cross-entropy loss, and the gradients are back-propagated through the features. This scheme is not adequate for SSDA for the reason that the goal of SSDA is to adapt the learner to the target domain. For comparison, we conduct experiments where we directly apply the scheme of [46] to SSDA; we only change the loss to the cross-entropy loss between assistants’ predictions and pseudo-labels. We also searched the best hyper-parameters for this model as we did for SD. We set to 9 for all experiments. For ResNet, we select AG 1,2,3 for DomainNet and AG 1,2,3,4 for Office-Home. For AlexNet, we select AG 1 for DomainNet and AG 1,2 for Office-Home. In Table 7, it is obvious that MixStyle is not effective except for the experiment of ResNet in DomainNet. The model even shows low accuracy than the simple baseline S+T in several experiments.
The effect of intermediate styles.
In Table 7, we examine the effectiveness of transferring an intermediate style rather than a teacher’s individual style. In Eq. (3.2), by controlling the value of , we can manipulate the style of the assistant feature. As the value of is close to 1, the style of the assistant approaches to the teacher’s one. SD () is the experiment that the assistant feature follows only the style of the teacher. When we compare SD () and SD, the results show that the intermediate styles are more effective than the teacher’s style to reduce the domain discrepancy.
Multiple runs.
For a fair comparison, we report the average accuracy and its standard deviation of three independent runs in Table 8. Our method less deviates than most of previous methods do, showing that our method is adequately stable and effective.
Many-shot experiments.
Extra -SNE visualization
Figure 12 visualizes how SD embeds instances from two domains over iterations. The embeddings are obtained using ResNet34 from examples of Office-Home dataset in the one-shot setting, and we visualize the first 30 classes for simplicity. We adopt -SNE [27] with the perplexity of 30.0 and 1000 iterations. We observe that the sample-to-sample self-distillation stage clearly enhances the embedding quality from the pre-training stage. Two main points of the results are: (1) target samples gradually align with source samples over iterations. (2) samples from the same class pull each other over iterations.
Comprehensive ablation study.
We evaluate different combinations of the proposed component in Table 9 and compare them with previous work of [32, 12, 20]. The performance consistently increases as more components are used, indicating that each proposed component is effective for SSDA. Note that our method with all the components sets a new state of the art, outperforming APE [20].
![]() |
![]() |
![]() |
![]() |