1 Introduction
Supervised learning is the most considered approach for dealing with classification tasks. This paradigm is based on a sufficiently representative training set to learn a classification model. This level of representativeness is usually defined by two criteria: on the one hand, the training samples must be varied, which allows the algorithm to generalize instead of memorizing; on the other hand, the application of the trained model is assumed to be carried out on samples that come from the same distribution as those of the training set Duda et al. (2001).
Building a training set fulfilling these conditions is not always straightforward. Although obtaining samples might be easy, assigning their correct labels is costly. This is why there are efforts to alleviate the aforementioned requirements. However, while the conflict between memorization and generalization has been well studied, and there exist established mechanisms to deal with it such as regularization or data augmentation Goodfellow et al. (2016)
, learning a model that is able to correctly classify samples from a different target distribution remains open to further research. This problem is generally called
transfer learning (TL) Shao et al. (2014), and when the classification labels do not vary in the target distribution it is usually referred to as domain adaptation (DA) Wang & Deng (2018).Within the context of supervised learning, deep learning represents an important breakthrough
LeCun et al. (2015). This term refers to the latest generation of artificial neural networks, for which novel mechanisms have been developed that allow training deeper networks, i.e., with many layers. These deep neural networks represent the state of the art in many classification tasks, and have managed to break the existing glass ceiling in many traditionally complex tasks. In turn, deep learning often requires a large amount of data, which makes the study of DA even more interesting.As we will review in the next section, there are several alternatives to attempt DA, both general strategies and using deep neural networks. In this work we take a different avenue and study an incremental approach. We propose to use an existing DA algorithm for deep learning to classify those samples of the target domain for which the model is confident. Assuming the assigned labels as ground truth, the model is retrained. This added knowledge allows the network to refine its behavior to correctly classify other samples of the target set. This incremental process is repeated until the entire target set is completely annotated. We will show that this incremental approach achieves noticeable improvements with respect to the underlying DA algorithm. In addition, it is competitive on different benchmarks compared to other stateoftheart DA algorithms.
The rest of the paper is structured as follows: we outline in Section 2 the existing literature about DA, with special emphasis to that based on deep neural networks; we present in Section 3 the proposed incremental methodology, as well as the underlying DA model that we consider in this work; we describe our experimental setting in Section 4, while the results are reported in Section 5; finally, the work is concluded in Section 6.
2 Background
Since the beginning of machine learning research, there exists the idea of exploiting a model beyond its use over unknown samples of the source distribution. In the literature we can find two main topics that pursue this objective: the aforementioned TL and DA strategies.
In TL, some knowledge of the model is used to solve a different classification task. For example, a pretrained DNN model can be used as initialization Simonyan & Zisserman (2015); He et al. (2016)
or its feature extraction process can be considered as the basis of another classification model
Yosinski et al. (2014). As a special case of TL, the DA challenge typically assumes that the classification task of the target distribution is the same (i.e., the set of labels is equal). In this work we focus on the latter case.In a DA scenario, we can also distinguish between semisupervised and unsupervised approaches. While semisupervised DA considers that some labeled samples of the target distribution are available Cheng & Pan (2014); Yao et al. (2015); Saito et al. (2019), unsupervised DA works with just unlabeled samples Kouw & Loog (2019). We will revisit in this section unsupervised DA techniques, as it is the case of the proposed approach.
Performing unsupervised DA is still considered an open problem from both theoretical and practical perspectives Bousmalis et al. (2017). Most approaches consider that the key is to build a good feature representation that becomes invariant to the domain BenDavid et al. (2007). A good example is the Domain Adaptation Neural Network (DANN) proposed by Ganin et al. Ganin et al. (2016), which simultaneously learns domaininvariant features from both source and target data and discriminative features from the source domain. Following this line of research, many approaches have been proposed more recently: Virtual Adversarial Domain Adaptation (VADA) proposed by Shu et al. Shu et al. (2018)
added a penalty term to the loss function to penalize class boundaries that cross highdensity feature regions. The
Deep ReconstructionClassification Networks (DRCN) Ghifary et al. (2016) consists of a neural network that forces a common representation of both the source and target domains by sample reconstruction, while learning the classification task from the source samples. The Domain Separation Networks (DSN) proposed by Bousmalis et al. Bousmalis et al. (2016) are trained to map input representations onto both a domainspecific subspace and a domainindependent subspace, in order to improve the way that the domaininvariant features are learned. Haeusser et al. Haeusser et al. (2017) proposed Associative Domain Adaptation (ADA), which is another domaininvariant feature learning approach that reinforces associations between source and target representations in an embedding space with neural networks. The Adversarial Discriminative Domain Adaptation (ADDA) strategy Tzeng et al. (2017) follows the idea of Generative Adversarial Networks, along with discriminative modeling and untied weight sharing to learn domaininvariant features, while keeping a useful representation for the discriminative task. Drop to Adapt (DTA) Lee et al. (2019) makes use of adversarial dropout to enforce discriminative domaininvariant features. Damodaran et al. Damodaran et al. (2018) proposed theDeep Joint Distribution Optimal Transport
(DeepJDOT) approach, which learns both the classifier and aligned data representations between the source and target domain following a single neural framework with a loss functions based on the Optimal Transport theory Villani (2009).A different strategy to DA consists in learning how to transform features from one domain to another. Following this idea, the Subspace Alignment (SA) method Fernando et al. (2013)
seeks to represent the source and target domains using subspaces modelled by eigenvectors. Then, it solves an optimization problem to align the source subspace with the target one. Also, Sun and Saenko proposed the
Deep Correlation Alignment (DCORAL) approach Sun & Saenko (2016), which consists of a neural network that learns a nonlinear transformation to align correlations of layer activations from the source and target distributions.While the methods outlined above seek for new ways to achieve the desired characteristics of a proper DA method, our proposed approach takes a different avenue. Specifically, we build upon the existing DANN approach, and we propose novel ways to improve its ability to adapt to the target domain by performing the adaptation incrementally.
3 Methodology
3.1 Preliminaries
Let be the input space and be the output or label space. A classification task assumes that there exist a function that assigns a label to each possible sample of the input space. For supervised learning, the goal is to learn a hypothesis function that models the unknown function with the least possible error. We refer to
as label classifier. Quite often, the approach is to estimate a posterior probability
so that the label classifier follows a maximum a posteriori decision such that . This is the case with neural networks.In the DA scenario, there exist two distributions over : and , which are referred to as source domain and target domain, respectively. We focus on the case of unsupervised domain adaptation, for which DA is only provided with a labeled source set and a completely unlabeled target domain .
The goal of a DA algorithm is to build a label classifier for by using the information provided in both and .
3.2 Domain Adaptation Neural Network
Given its importance in the context of our work, we further describe here the operation of DANN, which will be considered as the backbone for our incremental approach.
DANN is based on the theory of learning from different domains discussed by BenDavid et al. (2006, 2010). This suggests that the transfer of the knowledge gained from one domain to another must be based on learning features that do not allow to discriminate between the two domains (source and target) of the samples to be classified. For this, DANN learns a classification model from features that do not encode information about the domain of the sample to be classified, thus generalizing the knowledge from a source labeled domain to a target unlabeled domain.
More specifically, the proposed neural architecture includes a feature extractor module () and a label classifier (
), which together build a standard feedforward neural network that can be trained to classify an input sample
into one of the possible categories of the output space . The last layer of the label classifier uses a “softmax” activation, which models the posterior probability of a given input .DANN adds a new domain classifier module () to the neural network, that classifies the domain to which the input sample belongs. This classifier is built as a binary logistic regressor that models the probability that an input sample comes from the source distribution ( if ) or the target distribution ( if ), where
denotes a binary variable that indicates the domain of the sample.
The unsupervised adaptation to a target domain is achieved as follows: the domain classifier is connected to the feature extractor (which is shared with the label classifier ) through the socalled gradient reversal layer (GRL). This layer does nothing at prediction. However, while learning through backpropagation, it multiplies the gradient by a certain negative constant (). In other words, the GRL receives the gradient from the subsequent layer and multiplies it by , therefore changing its sign before passing it to the preceding layer. The idea of this operation is to force to learn generic features that do not allow discriminating the domain. In addition, since this training is carried out simultaneously with the training of (label classifier), the features must be adequate for discriminating the categories to classify, yet unbiased with respect to the input domain. According to the DA theory, this should cause to be able to correctly classify input samples regardless of their domain, given that the features from are forced to be invariant.
The DANN training simultaneously updates all modules, providing samples for both and
. This can be done by using conventional mechanisms such as Stochastic Gradient Descent, from batches that include half of the examples from each domain. During the training process, the learning of
pursues a tradeoff between appropriate features for the classification () and inappropriate features for discriminating the domain of the input sample (). The hyperparameter allows tuning this tradeoff. The training is performed until the result converges to a saddle point, which can be found as a stationary point in the gradient update defined by the following equation:(1) 
where denotes the weights of , denotes the learning rate, and and represent the loss functions for the label classifier and the domain classifier, respectively.
A graphical overview of the DANN architecture is depicted in Fig. 1.
3.3 Incremental DANN
Our main contribution within the context of DA is to propose an incremental approach to DANN (iDANN). This strategy is explained below.
Once the DANN model is trained as explained in the previous section, we can use both the feature extractor and the label classifier to predict the category of samples from both the target domain and the source domain (). The “softmax” activation used at the output of this classifier returns the posterior probability that the network considers to belong to any of the classes of the output space .
Our main assumption is that we can use the subset of samples from the target domain for which is more confident, and then add them to the source labeled domain assuming the prediction as ground truth. These samples are thereafter considered as samples of the source domain completely. Afterwards, we can retrain the DANN network to finetune its weights using the new training set. This process is repeated iteratively, moving the labeled samples with greater confidence from the target domain to the source domain after each iteration. We stop when there are no more samples to move from the target domain.
The intuitive idea behind our approach is that by adding target domain information to the source (labeled) domain, the DANN learns new domaininvariant features that better fit the eventual classification task, thereby becoming more accurate for other target domain samples. In each iteration, however, the task increases its complexity because it deals first with the simplest samples to classify (for which the DANN is more confident), leaving those that have more dissimilar features in the unlabeled target set. When the DANN is retrained with labeled samples that include target domain information, the domain classifier needs to be more specific. This forces the feature extraction module to forget the features that differentiate more complex samples from the target domain.
We formalize the process in Algorithm 1, where and
represents the number of epochs and the batch size considered, respectively,
denotes the number of epochs for the incremental stage of the algorithm, indicates the size of the subset of target domain samples to select in each iteration, and is a constant that allows us to modify this size after each iteration.In this algorithm, the samples of the target domain () are classified using the label classifier , and then it proceeds to select a subset of size to be moved from the target domain to the source domain. For this purpose, two selection criteria are proposed, which are described in the next section.
Once the iterative stage of the algorithm ends, the label classifier is used to classify the entire original target domain (see line 9 of Algorithm 1). This labeled target set is used to then train a neural network from scratch, which is therefore specialized in classifying target domain samples (more details in Section 3.5).
3.4 Selection policies
Below we describe in detail the two proposed policies to select samples during the iterative stage of Algorithm 1 (selection_policy). One policy is directly based on the confidence level that the network provides to the prediction, while the other is based on geometric properties of the learned feature space.
3.4.1 Confidence policy
As mentioned above, the output of the label classifier uses a softmax activation. Let denote the number of labels. Then, the standard softmax function is defined by Equation 2.
(2) 
This function normalizes an
dimensional vector
of unbounded real values into another dimensional vector , for which values range between and add up to . This can be interpreted as a posterior probability over the different possible labels Bridle (1990). In order to turn these probabilities into the predicted class label, we simply take the argmaxindex position of this output vector, following a Maximum a Posteriori probability criterion.Taking advantage of this interpretation, the first policy for selecting samples to move from the target domain to the source is based on the probability provided by the label classifier , which can be seen as a measure of confidence in such classification.
With this criterion, we will keep the maximum predicted probability value for each sample of the target set among the possible labels. Then, we will order all samples based on this value—from highest to lowest—in order to select the first samples to build the subset .
Algorithm 2 presents the algorithmic description of this process, where refers to the probabilistic output of the label classifier after the softmax activation, before applying argmax to select a label. The function sortr is used to sort the set in decreasing order.
Figure 2 shows an example of a set of probabilities obtained after predicting the target samples with DANN. The figure on the left shows the maximum probability values obtained for the classification of each sample—without sorting—while in the figure on the right the sorted set is shown, where the threshold has been highlighted.
3.4.2 NN policy
As in the previous case, once the network has been trained, we use the label classifier to predict the labels of the whole target domain and then we sort them based on the confidence given by the network. However, in this case, instead of directly selecting a subset of samples according to this confidence, we will also evaluate the geometric properties of the feature space. This is performed following the nearest neighbor rule.
We first obtain the feature set from the source set (using ). We then proceed to iterate the target set samples sorted by their level of confidence. Given a target sample, if the label of the nearest samples of the source domain matches the label assigned by the label classifier , then we will select the prototype. Otherwise, we will discard it. Therefore, samples are selected based on both the confidence provided by the DANN in their label and the extent they match the distribution of the source domain.
Algorithm 3 describes this process algorithmically. The function receives as parameters the query sample , the set and the value to be used, and yields the predicted label and the number of samples within its nearest neighbors from that have the same label.
The idea of this policy is to select the samples of the target domain whose features are within the cluster of the source domain for the same class. An illustrative example of this condition is shown in Fig. 3 with . The example shows two labels of the source domain as green circles and blue squares. The red stars denote the target domain examples that are being evaluated to determine if they are selected. For instance, the star on the left would be selected if, and only if, the network classified it as a green circle, since its 5nearest neighbors are green circles. Similarly, the star on the right would be selected if, and only if, the network classified it as a blue square. However, the central star would always be discarded because its 5neighbors belong to two different classes.
If we increase , the red star of the left would still be selected (if labeled as green circle) because it is located in the middle of the cluster. However, the red start of the right is closer to label boundaries, and so it would eventually be discarded.
3.5 Training a CNN with the new labeled target set
As described in Algorithm 1, once the iterative stage of the iDANN algorithm is completed, we use the label classifier to annotate the entire original target set from scratch. Then, a new CNN is trained by conventional means considering the same neural architecture of . This allows us to eventually get a neural network that is directly specialized in the classification of the target domain.
However, we assume that some part of the iterative annotation of the target set will contain noise at the label level. To mitigate the possible efects of this noise, we consider label smoothing Szegedy et al. (2016). This is an efficient and theoreticallygrounded strategy for dealing with label noise, which also makes the model less prone to overfitting.
Compared to classical onehot output representation, label smoothing changes the construction of the true probability to
(3) 
where is a small constant (or smoothing parameter) and is the total number of classes. Hence, instead of minimizing crossentropy with hard targets (0 or 1), it considers soft targets.
4 Experimental setup
4.1 Datasets
The proposed approach will be evaluated with two different classification tasks, that are common in the DA literature. The first one is that of digit classification, for which we consider the following datasets:

MNIST LeCun et al. (1998): this collection contains images representing isolated handwritten digits.

Street View House Numbers (SVHN) Netzer et al. (2011): it consists of images obtained from house numbers from Google Street View. It represents a realworld challenge of digit recognition in natural scenes, for which several digits might appear in the same image and only the central one must be classified.

Synthetic Numbers Ganin et al. (2016): images of digits generated using Windows™ fonts, with varying position, orientation, color and resolution.
In addition, we also evaluate our approach for traffic sign classification with the following datasets:

German Traffic Sign Recognition Benchmark (GTSRB) Stallkamp et al. (2012): this dataset contains images of traffic signs obtained from the real world in different sizes, positions, and lighting conditions, as well as including occlusions.

Synthetic Signs Moiseev et al. (2013): this dataset was synthetically generated by taking common street signs from Wikipedia and applying several transformations. It tries to simulate images from GTSRB although there are significant differences between them.
Table 1 summarizes the information of our evaluation corpora, including the domain to which they belong, the number of labels, the image resolution, the number of samples, and the type of image indicating whether they are in color or grayscale format. Figure 10 shows some random examples from each of these datasets.
The images of each classification task were rescaled to the same size: the digits to pixels, and the traffic signs to pixels. Concerning the preprocessing of the input data, the images were normalized within the range . The train and test partitions were those proposed by the authors of each dataset, in order to ensure a fair comparison with the results obtained in the literature.
4.2 CNN architectures
To evaluate the proposed methodology, the same three CNN architectures used in the original DANN paper have been tested. Table 2 reports a summary of these architectures.
As the authors pointed out, these topologies are not necessarily optimal and better adaptation performance might be attained if they were tweaked. However, we chose to keep the same configuration to make a fairer comparison.
As the activation function, a Rectifier Linear Unit (ReLU) was used for each convolution layer and fullyconnected layer, except for the the output layers.
neurons with softmax activation were used as output of the label classifier. For the output of the domain classifier, a single neuron with a logistic (sigmoid) activation function was used to discriminate between two possible categories (source domain or target domain).Model 1 was used for all the experiments with digit datasets, except those using SVHN. This topology is inspired by the classical LeNet5 architecture LeCun et al. (1998). Model 2 was used to evaluate the experiments with digits that include SVHN. This architecture is inspired by Srivastava et al. (2014). Finally, Model 3 was used for the experiments with traffic sings. In this case, the singleCNN baseline obtained from Cireşan et al. (2012) was used.
4.3 Training stage
To ensure a fair comparison with the original DANN algorithm, we set the same training configuration: Stochastic Gradient Descent with a learning rate of , decay of , and momentum of , as well as the same number of epochs ().
For the iterative stage of iDANN, we set to . This value was determined empirically. We observed that it allowed the network weights to be tuned with the new knowledge without taking too long to perform a new iteration. In each training iteration, the greater improvement occurs in the first epochs, after which the accuracy of the label classifier is stabilized.
Concerning the size of the subset to select from the target set (), we decided to consider a percentage of the remaining samples rather than a fixed value. Initially, we set it to , and it was increased after each iteration by () until all target domain samples are selected. This value was also obtained empirically, by observing better results and more stable training if few samples are added in the first iterations.
Different values for both the batch size and are evaluated, as will be reported in the experimentation section.
5 Results
In this section we evaluate the proposed method using the datasets, topologies, and settings described in Section 4. We first study the different hyperparameterization, as well as the two prototype selection policies proposed. Next we show the performance results obtained over the datasets and, finally, we compare with other stateoftheart methods.
5.1 Hyperparameters evaluation
In this section, we start by analyzing the influence of the batch size and the value of on the performance of the method, as these hyperparameters are those that affect the training stage the most. For this, we consider the batch sizes of and of . This means that each result comes from a total of 336 experiments (14 combinations of dataset pairs 6 batch values 4 values of ). The rest of hyperparameters are set as indicated in Section 4.3, that is: (as in the original DANN paper), , and , which were empirically determined to favor stable training and obtain good results. In addition, we evaluate the results using only the prototype selection policy based on network’s confidence, as next section will be devoted to comparing the two proposed policies with the best hyperparameters found.
As we are dealing with an unsupervised method, we mainly focus on analyzing the trend when modifying these parameters. Table 3 shows the results of this experiment, where each figure represents the average of the 14 possible combinations of source and target domain of the datasets considered and all the iterations performed by the iDANN algorithm.
The first thing to remark is that some of the hyperparameter combinations evaluated in these experiments do not converge (, for traffic signs). This could be detected automatically, since the accuracy is abruptly reduced to a value approximately equal to a random guess, for both the training set and evaluation set and for both the source and the target domain. However, these results have been kept in order to observe the general trend of the method and how these parameters affect it.
It can also be observed that the best performance is achieved with in the two types of corpora, while a batch size of and are better for the digits and traffic signs, respectively. On average, better results are reported with low values and batch sizes between and . When is greater (e.g., ), the training becomes highly unstable, especially if combined with small batch sizes.
Numbers  Traffic signs  
Batch  
16  58.74  56.21  58.65  47.54  89.67  88.65  90.08  48.16 
32  66.13  65.82  61.67  49.78  93.58  93.63  94.50  24.02 
64  65.26  66.41  66.82  62.54  91.27  91.16  91.67  31.41 
128  64.23  66.04  66.79  52.89  88.56  89.36  89.60  66.78 
256  64.55  63.94  64.24  59.36  87.34  87.73  88.67  91.39 
512  62.55  62.61  62.75  50.67  84.34  84.45  84.09  84.60 
Next, we analyze the influence of these parameters with respect to the iteration of the iDANN algorithm. Table 4 shows the average result obtained by grouping all combinations of datasets (numbers and traffic signs) and hyperparameters considered. As in the previous analysis, better results are also observed for low values and batch sizes between and (see column ‘Avg.’). In this case, it can also be seen that low values are more appropriate in the first iterations, whereas greater values are more appropriate in the last iterations. It might happen that a more stable way of proceeding (low ) is preferred in the first iterations, even at the cost of being less aggressive in the domain adaptation. Therefore, we propose to start with a low and increase its value gradually ( after each epoch).
Additionally, it is observed that each iteration of the algorithm leads to a better result than the previous one (except for ), yielding the higher leap in the first iterations and reducing this difference towards the last iterations. Including all cases, the results improve by between the first and the last iteration, on average. If we ignore those settings that do not converge, the average improvement obtained increases to .
Iterations  
Batch  1  2  3  4  5  6  7  8  9  Avg.  
16  58.46  59.71  61.46  62.81  63.91  64.86  65.39  65.78  66.04  63.16  
32  65.20  67.85  69.37  69.91  70.73  71.14  71.89  72.11  72.24  70.05  
64  63.88  67.11  68.33  68.75  69.35  70.30  70.71  71.13  71.22  68.97  
128  62.67  65.52  67.09  67.68  68.19  68.84  69.46  69.88  70.06  67.71  
256  62.82  65.20  66.92  67.83  68.65  68.84  69.57  70.09  70.34  67.81  
512  61.51  63.28  64.32  65.45  65.97  66.93  67.60  67.87  68.03  65.66  
16  56.65  57.70  59.65  60.72  61.43  62.25  62.75  63.15  63.31  60.85  
32  63.95  67.42  68.68  69.93  70.59  71.26  71.79  72.19  72.33  69.79  
64  64.66  67.39  68.78  69.89  70.41  71.32  72.06  72.44  72.57  69.95  
128  63.75  66.59  68.61  69.07  69.93  70.82  71.44  72.00  72.14  69.37  
256  62.38  65.08  66.67  67.44  67.69  68.52  69.10  69.49  69.71  67.34  
512  61.36  63.46  64.72  65.48  66.25  66.96  67.42  67.84  68.08  65.73  
16  56.73  61.57  63.37  65.13  65.81  66.31  62.94  63.13  63.26  63.14  
32  62.07  64.75  66.01  66.68  67.46  67.78  68.23  68.47  68.80  66.69  
64  64.78  67.77  69.44  70.20  71.21  71.92  72.35  72.74  72.95  70.37  
128  64.49  67.27  69.02  69.75  70.71  71.58  72.00  72.72  72.87  70.05  
256  63.16  65.55  66.75  67.49  68.32  68.76  69.51  69.81  70.21  67.73  
512  61.61  63.49  64.82  65.57  66.10  66.81  67.52  68.01  68.28  65.80  
16  46.48  50.71  52.12  53.35  53.78  42.41  42.98  43.35  43.52  47.63  
32  50.09  53.07  47.66  42.61  43.12  44.06  44.39  44.91  44.96  46.10  
64  61.64  64.02  64.24  54.01  54.48  55.44  55.96  56.47  56.59  58.09  
128  55.72  57.10  57.16  57.01  52.95  53.15  53.63  53.50  53.69  54.88  
256  60.13  62.26  62.60  63.53  64.25  64.93  65.57  66.04  66.14  63.94  
512  53.54  54.39  54.63  55.01  55.69  56.04  56.57  56.84  56.95  55.52  
Average  60.32  62.84  63.85  63.97  64.46  64.63  65.03  65.42  65.59  – 
5.2 Model analysis
We now evaluate the effect of the incremental training process on the domain adaptation approach. Figure 11 shows the evolution of the accuracy obtained over the target test set during the training process for the case Syn Numbers MNISTM combination of datasets, with a batch size of 64 and . The training epochs are represented with the horizontal axis, while the iterations (i.e., when new training samples are added) are highlighted with blue lines and marked above. It can be observed that in the first iteration (spanning 300 epochs), the accuracy slowly improves until around 150 epochs, after which becomes stable. In the subsequent iterations, the accuracy further improves, especially during iterations 2, 3 and 4. Then, the performance increase is gradually reduced until it is hardly noticeable.
To provide further analysis, we also examine the representation space learned by the network in each of these iterations, using the same combination of datasets and training parameters. We use the tDistributed Stochastic Neighbor Embedding (tSNE) van der Maaten & Hinton (2008) projection to visualize the samples according to their representation by the last hidden layer of the label predictor. Figure 12 shows a visualization of the features learned after each of the iterations, where the red color represents the target domain, the blue color represents the source domain, and the green color represents the set (selected samples) using the confidence policy. This representation reveals welldefined clusters—the 10 possible classes of the datasets considered for this analysis—around an additional central cluster. This central cluster groups the samples of the target domain (red color) whose representation does not correspond to any of the existing classes yet. This cluster would therefore correspond to target samples whose representation has not been correctly mapped onto any of the source domain classes. Iteratively, the method is selecting samples (green points) of the target domain and moving them to the source domain. In the first iterations—until the 6th one, approximately—the method selects only samples that are well located in one of the source domain clusters (that is, those samples for which the network is more confident). Due to this process, the size of the central cluster is reduced. It is important to emphasize that this cluster becomes smaller although no samples out of it are selected, which indicates that the network is learning to better map those samples because of the selected samples of previous iterations. Towards the last iterations, the method begins to select the most complex samples that are still in this additional cluster. In Fig. 12(*) (which is the same as the Fig. 12(9) but highlighting each class with a different color), the additional cluster of target samples still appears without being mapped, yet with a very small size. This cluster contains almost all the classification errors, having mapped only some isolated prototypes to the actual class clusters incorrectly.
5.3 kNN policy
We compare in this section the two policies proposed for selecting the set of target prototypes to be added to the source domain. To this end, we evaluate whether the label assigned to each of these prototypes is correct. In this case, we make use of the groundtruth of the target domain just for the sake of analysis.
We show in Fig. 13 a dotted line with the performance of the confidence policy, which may serve as a baseline here, and eight results for the kNN policy with varying values. As in the previous experiments, the reported figures are obtained for all combinations of datasets and hyperparameters considered.
It is observed that, as the number of iterations of the algorithm increases, the accuracy of the additional labels assigned to the selected prototypes decreases. However, the kNN policy generally obtains better results from the first iteration, obtaining on average (for all iterations) an improvement of 6.36 % with respect to confidence policy. This improvement is significantly greater in the last iterations, obtaining an increase up to 24.85 % between the result of the confidence policy and the best result obtained with kNN policy.
The role of the parameter is also illustrated in Fig. 13, where better results are attained as is increased. It is shown that the impact of this parameter is more noticeable in the last iterations, where a difference of up to 8.91 % is obtained between and .
Because the kNN selection policy worked better, this policy was used in all following experiments.
5.4 Accuracy on target
In this section, we evaluate the final result obtained through the proposed iDANN method with the best combination of hyperparameters previously obtained for each of the dataset pairs. We will compare this result with that obtained by the original DANN method in order to check the goodness of the incremental approach.
Table 5 reports the results of the experiment, where rows indicate the dataset pairs (source and target) and columns represent the DA method. Concerning iDANN, we report two results: the accuracy of the labels assigned during the iterative process itself (1), as well as the accuracy using the CNN trained from scratch using only the target samples (once all the target samples have been assigned a label). In addition to DANN and iDANN methods, we have also added the results obtained with the neural networks trained just with the source set (‘CNN Src.’), as well as the results obtained with the neural neural networks directly trained with the target set (‘CNN Tgt.’). The former serves as baseline, to better assess the impact of the domainadaptation mechanisms, while the second represents the upper bound of accuracy.
The first thing to remark is that the worst results obtained by the baseline (‘CNN Src.’) come from the combinations of singledigit datasets (MNIST, MNISTM) as source and complex digit datasets (SVHN, Syn Numbers) as target. Furthermore, the best results from the baseline are reported for combinations where the source and target are similar (MNISTM MNIST, SVHN Syn Numbers).
The original DANN method outperforms the results obtained by using the baseline network (‘CNN Src.’) by 10.7 %, on average, obtaining the most significant improvement for the combinations of Syn Numbers MNIST (improving by 29.31 %). It is also noticeable the impact of DANN when the dataset pair consists of similar tasks with the most complex one as target, such as MNIST MNISTM—improvement of 23 %—or Syn Signs GTSRB—improvement of 15.49 %. These results for DANN have been obtained using our own implementation, following the details given in the original paper. We observed that the accuracy matches approximately that reported by the authors (for the 4 combinations they considered), and so we assume that our implementation is correct. We can therefore faithfully report the performance in all sourcetarget combinations of our experiments.
Concerning the labels assigned during the proposed incremental approach iDANN (), the first thing to note is its improvement with respect to the underlying DANN method, which is around 16 %, on average. In the best case, this improvement reaches values around 33 %, 35 % and 36 % for the Syn Numbers MNISTM, MNISTM Syn Numbers, and MNIST Syn Numbers pairs, respectively. This confirms the goodness of our strategy, which uses the same domain adaptation method in a novel way.
Finally, if the CNN is trained from scratch with the target labels that have been automatically assigned by the iDANN (), it can further improve the results up to 1.64 %, on average, and up to 5.5 % in the best case (MNISTM Syn Numbers). It should be noted that in some specific combinations, this approach slightly outperforms the CNN trained with the correct target labels (for example, MNISTM MNIST or Syn Numbers SVHN). It might happen that the incorrectly assigned labels of the iDANN process act as a regularizer that alleviates some overfitting.
5.5 Comparison with the state of the art
To conclude the results section, we present below a comparison with other domain adaptation strategies from the state of the art. In these works, not all possible combinations of sourcetarget pairs are considered but a few combinations of them. We show in Table 6 the results reported in the literature^{1}^{1}1Unlike the results of the previous section, the DANN values of Table 6 are those reported in the original paper Ganin et al. (2016)., along with the results obtained by our proposal (iDANN). A brief description of the competing methods was provided in Section 2. Readers are referred to the corresponding references for details.
These results reveal that our method yields the best performance in 5 out of 7 sourcetarget pairs. The performance of iDANN is especially remarkable in the case of MNIST Syn Num, where the improvement reaches around 30 % compared to the literature. For the case in which our proposal does not attain the best result, we observe a dissimilar performance: it is still very competitive for the MNIST SVHN pair, whereas it is outperformed for the SVHN MNIST pair. When all the results are good, the improvement is relative, but when there is enough margin, the improvement is quite remarkable (as in the case of MNIST Syn Num).
Furthermore, it should be noted that many of the compared methods propose specific CNN architecture for each combination of datasets and/or focus on optimizing the result for a particular combination, such as DTA or DeepJDOT. In our case, we utilized the topologies proposed in the original DANN paper, so it could be assumed that if we pursue a specific architecture adapted to each of the sourcetarget pairs, our results will probably improve.
6 Conclusions and Future Work
This paper proposes an incremental strategy to the problem of domain adaptation with artificial neural networks. Our approach is built upon an existing domain adaptation approach, combined with a heuristic that, in each iteration, decides which prototypes of the target set can be added to the training set by considering the label provided by the neural network. To this end, two selection policies have been proposed: one directly based on the confidence given by the network to the prediction and another based on geometric properties of the learned feature space. We observed that the latter reported a better performance, especially in the last iterations of the algorithm. In addition, we consider a final stage in which the labeled target set is used to train a new neural network with label smoothing.
Our experiments were performed on various corpora and using several configurations of the neural network. From the results, we conclude that the incremental approach outperforms the underlying DANN model, as well as other stateoftheart methods. It is interesting to note that, in some cases, the iDANN approach improves the result obtained with the CNN trained directly with the groundtruth data of the target set, which could indicate that the incremental process also serves as a regularizer that leads to greater robustness. Furthermore, unlike the classic DANN, our approach improves results when domains are similar and helps keeping the accuracy for the source domain. We also observed a greater training stability and less dependence on the hyperparameters set.
As future work, a primary objective would be to establish a wellprincipled stop criterion that allows us to detect when the prediction over the target samples is not reliable. In addition, we want to extend the experiments to other types of input types (such as sequences), as well as to study the behavior of the incremental strategy when the underlying DA method is different—given that there currently exist several architectures for this challenge. Note that our incremental approach is independent of the underlying DA model considered, and so it could be adopted as a generic strategy that might improve to the same extent as the underlying DA algorithm improves. Other avenues to further explore this proposal is to evaluate more neural network architectures, as well as adding data augmentation to the learning process.
References
 Arbelaez et al. (2011) Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 898–916. doi:10.1109/TPAMI.2010.161.
 BenDavid et al. (2010) BenDavid, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010). A theory of learning from different domains. Machine Learning, 79, 151–175. doi:10.1007/s1099400951524.
 BenDavid et al. (2006) BenDavid, S., Blitzer, J., Crammer, K., & Pereira, F. (2006). Analysis of representations for domain adaptation. In NIPS (pp. 137–144).
 BenDavid et al. (2007) BenDavid, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations for domain adaptation. In B. Schölkopf, J. C. Platt, & T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19 (NIPS) (pp. 137–144). MIT Press.

Bousmalis et al. (2017)
Bousmalis, K., Silberman, N.,
Dohan, D., Erhan, D., &
Krishnan, D. (2017).
Unsupervised pixellevel domain adaptation with
generative adversarial networks.
In
2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017
(pp. 95–104).  Bousmalis et al. (2016) Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domain separation networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 343–351). Curran Associates, Inc.
 Bridle (1990) Bridle, J. S. (1990). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In F. F. Soulié, & J. Hérault (Eds.), Neurocomputing (pp. 227–236). Berlin, Heidelberg: Springer Berlin Heidelberg.
 Cheng & Pan (2014) Cheng, L., & Pan, S. J. (2014). Semisupervised domain adaptation on manifolds. IEEE Transactions on Neural Networks and Learning Systems, 25, 2240–2249. doi:10.1109/TNNLS.2014.2308325.
 Cireşan et al. (2012) Cireşan, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multicolumn deep neural network for traffic sign classification. Neural Networks, 32, 333 – 338. Selected Papers from IJCNN 2011.
 Damodaran et al. (2018) Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., & Courty, N. (2018). Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer Vision – ECCV 2018 (pp. 467–483). Cham: Springer International Publishing.
 Duda et al. (2001) Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification, 2nd Edition. Wiley.
 Fernando et al. (2013) Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In 2013 IEEE International Conference on Computer Vision (ICCV) (pp. 2960–2967).
 Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., & Lempitsky, V. (2016). Domainadversarial training of neural networks. Journal of Machine Learning Research, 17, 1–35.
 Ghifary et al. (2016) Ghifary, M., Kleijn, W. B., Zhang, M., Balduzzi, D., & Li, W. (2016). Deep reconstructionclassification networks for unsupervised domain adaptation. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer Vision – ECCV 2016 (pp. 597–613). Cham: Springer International Publishing.
 Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.
 Haeusser et al. (2017) Haeusser, P., Frerix, T., Mordvintsev, A., & Cremers, D. (2017). Associative domain adaptation. In The IEEE International Conference on Computer Vision (ICCV).
 He et al. (2016) He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016 (pp. 770–778).
 Kouw & Loog (2019) Kouw, W. M., & Loog, M. (2019). A review of domain adaptation without target labels. IEEE Transactions on Pattern Analysis and Machine Intelligence, (pp. 1–1).
 LeCun et al. (2015) LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradientbased learning applied to document recognition. In Proc. of the IEEE (pp. 2278–2324). volume 86.
 Lee et al. (2019) Lee, S., Kim, D., Kim, N., & Jeong, S.G. (2019). Drop to adapt: Learning discriminative features for unsupervised domain adaptation. In The IEEE International Conference on Computer Vision (ICCV).
 van der Maaten & Hinton (2008) van der Maaten, L., & Hinton, G. (2008). Visualizing data using tSNE. Journal of Machine Learning Research, 9, 2579–2605.
 Moiseev et al. (2013) Moiseev, B., Konev, A., Chigorin, A., & Konushin, A. (2013). Evaluation of traffic sign recognition methods trained on synthetically generated data. In J. BlancTalon, A. Kasinski, W. Philips, D. Popescu, & P. Scheunders (Eds.), Advanced Concepts for Intelligent Vision Systems (pp. 576–583). Cham: Springer International Publishing.
 Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.
 Saito et al. (2019) Saito, K., Kim, D., Sclaroff, S., Darrell, T., & Saenko, K. (2019). Semisupervised domain adaptation via minimax entropy. In The IEEE International Conference on Computer Vision (ICCV).
 Shao et al. (2014) Shao, L., Zhu, F., & Li, X. (2014). Transfer learning for visual categorization: A survey. IEEE Transactions on Neural Networks and Learning Systems, 26, 1019–1034.
 Shu et al. (2018) Shu, R., Bui, H., Narui, H., & Ermon, S. (2018). A DIRTt approach to unsupervised domain adaptation. In International Conference on Learning Representations.
 Simonyan & Zisserman (2015) Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for largescale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.
 Stallkamp et al. (2012) Stallkamp, J., Schlipsing, M., Salmen, J., & Igel, C. (2012). Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32, 323 – 332. doi:https://doi.org/10.1016/j.neunet.2012.02.016.
 Sun & Saenko (2016) Sun, B., & Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In G. Hua, & H. Jégou (Eds.), Computer Vision – ECCV 2016 Workshops (pp. 443–450). Cham: Springer International Publishing.
 Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2818–2826). doi:10.1109/CVPR.2016.308.
 Tzeng et al. (2017) Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 2962–2971).
 Villani (2009) Villani, C. (2009). Optimal Transport Old and New. Springer.
 Wang & Deng (2018) Wang, M., & Deng, W. (2018). Deep visual domain adaptation: A survey. Neurocomputing, 312, 135 – 153.
 Yao et al. (2015) Yao, T., Pan, Y., Ngo, C.W., Li, H., & Mei, T. (2015). Semisupervised domain adaptation with subspace learning for visual recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2142–2150).
 Yosinski et al. (2014) Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems (pp. 3320–3328).
Comments
There are no comments yet.