Incremental Unsupervised Domain-Adversarial Training of Neural Networks

01/13/2020 ∙ by Antonio-Javier Gallego, et al. ∙ University of Alicante 38

In the context of supervised statistical learning, it is typically assumed that the training set comes from the same distribution that draws the test samples. When this is not the case, the behavior of the learned model is unpredictable and becomes dependent upon the degree of similarity between the distribution of the training set and the distribution of the test set. One of the research topics that investigates this scenario is referred to as domain adaptation. Deep neural networks brought dramatic advances in pattern recognition and that is why there have been many attempts to provide good domain adaptation algorithms for these models. Here we take a different avenue and approach the problem from an incremental point of view, where the model is adapted to the new domain iteratively. We make use of an existing unsupervised domain-adaptation algorithm to identify the target samples on which there is greater confidence about their true label. The output of the model is analyzed in different ways to determine the candidate samples. The selected set is then added to the source training set by considering the labels provided by the network as ground truth, and the process is repeated until all target samples are labelled. Our results report a clear improvement with respect to the non-incremental case in several datasets, also outperforming other state-of-the-art domain adaptation algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised learning is the most considered approach for dealing with classification tasks. This paradigm is based on a sufficiently representative training set to learn a classification model. This level of representativeness is usually defined by two criteria: on the one hand, the training samples must be varied, which allows the algorithm to generalize instead of memorizing; on the other hand, the application of the trained model is assumed to be carried out on samples that come from the same distribution as those of the training set Duda et al. (2001).

Building a training set fulfilling these conditions is not always straightforward. Although obtaining samples might be easy, assigning their correct labels is costly. This is why there are efforts to alleviate the aforementioned requirements. However, while the conflict between memorization and generalization has been well studied, and there exist established mechanisms to deal with it such as regularization or data augmentation Goodfellow et al. (2016)

, learning a model that is able to correctly classify samples from a different target distribution remains open to further research. This problem is generally called

transfer learning (TL) Shao et al. (2014), and when the classification labels do not vary in the target distribution it is usually referred to as domain adaptation (DA) Wang & Deng (2018).

Within the context of supervised learning, deep learning represents an important breakthrough

LeCun et al. (2015). This term refers to the latest generation of artificial neural networks, for which novel mechanisms have been developed that allow training deeper networks, i.e., with many layers. These deep neural networks represent the state of the art in many classification tasks, and have managed to break the existing glass ceiling in many traditionally complex tasks. In turn, deep learning often requires a large amount of data, which makes the study of DA even more interesting.

As we will review in the next section, there are several alternatives to attempt DA, both general strategies and using deep neural networks. In this work we take a different avenue and study an incremental approach. We propose to use an existing DA algorithm for deep learning to classify those samples of the target domain for which the model is confident. Assuming the assigned labels as ground truth, the model is retrained. This added knowledge allows the network to refine its behavior to correctly classify other samples of the target set. This incremental process is repeated until the entire target set is completely annotated. We will show that this incremental approach achieves noticeable improvements with respect to the underlying DA algorithm. In addition, it is competitive on different benchmarks compared to other state-of-the-art DA algorithms.

The rest of the paper is structured as follows: we outline in Section 2 the existing literature about DA, with special emphasis to that based on deep neural networks; we present in Section 3 the proposed incremental methodology, as well as the underlying DA model that we consider in this work; we describe our experimental setting in Section 4, while the results are reported in Section 5; finally, the work is concluded in Section 6.

2 Background

Since the beginning of machine learning research, there exists the idea of exploiting a model beyond its use over unknown samples of the source distribution. In the literature we can find two main topics that pursue this objective: the aforementioned TL and DA strategies.

In TL, some knowledge of the model is used to solve a different classification task. For example, a pre-trained DNN model can be used as initialization Simonyan & Zisserman (2015); He et al. (2016)

or its feature extraction process can be considered as the basis of another classification model

Yosinski et al. (2014). As a special case of TL, the DA challenge typically assumes that the classification task of the target distribution is the same (i.e., the set of labels is equal). In this work we focus on the latter case.

In a DA scenario, we can also distinguish between semi-supervised and unsupervised approaches. While semi-supervised DA considers that some labeled samples of the target distribution are available Cheng & Pan (2014); Yao et al. (2015); Saito et al. (2019), unsupervised DA works with just unlabeled samples Kouw & Loog (2019). We will revisit in this section unsupervised DA techniques, as it is the case of the proposed approach.

Performing unsupervised DA is still considered an open problem from both theoretical and practical perspectives Bousmalis et al. (2017). Most approaches consider that the key is to build a good feature representation that becomes invariant to the domain Ben-David et al. (2007). A good example is the Domain Adaptation Neural Network (DANN) proposed by Ganin et al. Ganin et al. (2016), which simultaneously learns domain-invariant features from both source and target data and discriminative features from the source domain. Following this line of research, many approaches have been proposed more recently: Virtual Adversarial Domain Adaptation (VADA) proposed by Shu et al. Shu et al. (2018)

added a penalty term to the loss function to penalize class boundaries that cross high-density feature regions. The

Deep Reconstruction-Classification Networks (DRCN) Ghifary et al. (2016) consists of a neural network that forces a common representation of both the source and target domains by sample reconstruction, while learning the classification task from the source samples. The Domain Separation Networks (DSN) proposed by Bousmalis et al. Bousmalis et al. (2016) are trained to map input representations onto both a domain-specific subspace and a domain-independent subspace, in order to improve the way that the domain-invariant features are learned. Haeusser et al. Haeusser et al. (2017) proposed Associative Domain Adaptation (ADA), which is another domain-invariant feature learning approach that reinforces associations between source and target representations in an embedding space with neural networks. The Adversarial Discriminative Domain Adaptation (ADDA) strategy Tzeng et al. (2017) follows the idea of Generative Adversarial Networks, along with discriminative modeling and untied weight sharing to learn domain-invariant features, while keeping a useful representation for the discriminative task. Drop to Adapt (DTA) Lee et al. (2019) makes use of adversarial dropout to enforce discriminative domain-invariant features. Damodaran et al. Damodaran et al. (2018) proposed the

Deep Joint Distribution Optimal Transport

(DeepJDOT) approach, which learns both the classifier and aligned data representations between the source and target domain following a single neural framework with a loss functions based on the Optimal Transport theory Villani (2009).

A different strategy to DA consists in learning how to transform features from one domain to another. Following this idea, the Subspace Alignment (SA) method Fernando et al. (2013)

seeks to represent the source and target domains using subspaces modelled by eigenvectors. Then, it solves an optimization problem to align the source subspace with the target one. Also, Sun and Saenko proposed the

Deep Correlation Alignment (D-CORAL) approach Sun & Saenko (2016), which consists of a neural network that learns a nonlinear transformation to align correlations of layer activations from the source and target distributions.

While the methods outlined above seek for new ways to achieve the desired characteristics of a proper DA method, our proposed approach takes a different avenue. Specifically, we build upon the existing DANN approach, and we propose novel ways to improve its ability to adapt to the target domain by performing the adaptation incrementally.

3 Methodology

3.1 Preliminaries

Let be the input space and be the output or label space. A classification task assumes that there exist a function that assigns a label to each possible sample of the input space. For supervised learning, the goal is to learn a hypothesis function that models the unknown function with the least possible error. We refer to

as label classifier. Quite often, the approach is to estimate a posterior probability

so that the label classifier follows a maximum a posteriori decision such that . This is the case with neural networks.

In the DA scenario, there exist two distributions over : and , which are referred to as source domain and target domain, respectively. We focus on the case of unsupervised domain adaptation, for which DA is only provided with a labeled source set and a completely unlabeled target domain .

The goal of a DA algorithm is to build a label classifier for by using the information provided in both and .

3.2 Domain Adaptation Neural Network

Given its importance in the context of our work, we further describe here the operation of DANN, which will be considered as the backbone for our incremental approach.

DANN is based on the theory of learning from different domains discussed by Ben-David et al. (2006, 2010). This suggests that the transfer of the knowledge gained from one domain to another must be based on learning features that do not allow to discriminate between the two domains (source and target) of the samples to be classified. For this, DANN learns a classification model from features that do not encode information about the domain of the sample to be classified, thus generalizing the knowledge from a source labeled domain to a target unlabeled domain.

More specifically, the proposed neural architecture includes a feature extractor module () and a label classifier (

), which together build a standard feed-forward neural network that can be trained to classify an input sample

into one of the possible categories of the output space . The last layer of the label classifier uses a “softmax” activation, which models the posterior probability of a given input .

DANN adds a new domain classifier module () to the neural network, that classifies the domain to which the input sample belongs. This classifier is built as a binary logistic regressor that models the probability that an input sample comes from the source distribution ( if ) or the target distribution ( if ), where

denotes a binary variable that indicates the domain of the sample.

The unsupervised adaptation to a target domain is achieved as follows: the domain classifier is connected to the feature extractor (which is shared with the label classifier ) through the so-called gradient reversal layer (GRL). This layer does nothing at prediction. However, while learning through back-propagation, it multiplies the gradient by a certain negative constant (). In other words, the GRL receives the gradient from the subsequent layer and multiplies it by , therefore changing its sign before passing it to the preceding layer. The idea of this operation is to force to learn generic features that do not allow discriminating the domain. In addition, since this training is carried out simultaneously with the training of (label classifier), the features must be adequate for discriminating the categories to classify, yet unbiased with respect to the input domain. According to the DA theory, this should cause to be able to correctly classify input samples regardless of their domain, given that the features from are forced to be invariant.

The DANN training simultaneously updates all modules, providing samples for both and

. This can be done by using conventional mechanisms such as Stochastic Gradient Descent, from batches that include half of the examples from each domain. During the training process, the learning of

pursues a trade-off between appropriate features for the classification () and inappropriate features for discriminating the domain of the input sample (). The hyper-parameter allows tuning this trade-off. The training is performed until the result converges to a saddle point, which can be found as a stationary point in the gradient update defined by the following equation:

(1)

where denotes the weights of , denotes the learning rate, and and represent the loss functions for the label classifier and the domain classifier, respectively.

A graphical overview of the DANN architecture is depicted in Fig. 1.

Figure 1: Graphical overview of the DANN architecture, consisting of three blocks: feature extractor (), label classifier (), and domain classifier (). The GRL circle denotes the gradient reversal layer that multiplies the gradient by a negative factor.

3.3 Incremental DANN

Our main contribution within the context of DA is to propose an incremental approach to DANN (iDANN). This strategy is explained below.

Once the DANN model is trained as explained in the previous section, we can use both the feature extractor and the label classifier to predict the category of samples from both the target domain and the source domain (). The “softmax” activation used at the output of this classifier returns the posterior probability that the network considers to belong to any of the classes of the output space .

Our main assumption is that we can use the subset of samples from the target domain for which is more confident, and then add them to the source labeled domain assuming the prediction as ground truth. These samples are thereafter considered as samples of the source domain completely. Afterwards, we can retrain the DANN network to fine-tune its weights using the new training set. This process is repeated iteratively, moving the labeled samples with greater confidence from the target domain to the source domain after each iteration. We stop when there are no more samples to move from the target domain.

The intuitive idea behind our approach is that by adding target domain information to the source (labeled) domain, the DANN learns new domain-invariant features that better fit the eventual classification task, thereby becoming more accurate for other target domain samples. In each iteration, however, the task increases its complexity because it deals first with the simplest samples to classify (for which the DANN is more confident), leaving those that have more dissimilar features in the unlabeled target set. When the DANN is retrained with labeled samples that include target domain information, the domain classifier needs to be more specific. This forces the feature extraction module to forget the features that differentiate more complex samples from the target domain.

We formalize the process in Algorithm 1, where and

represents the number of epochs and the batch size considered, respectively,

denotes the number of epochs for the incremental stage of the algorithm, indicates the size of the subset of target domain samples to select in each iteration, and is a constant that allows us to modify this size after each iteration.

Input : 
Output : 
1 while  do
2      
3      
4      
5      
6      
7      
8      
9 end while
10
11
Algorithm 1 Incremental DANN (iDANN)

In this algorithm, the samples of the target domain () are classified using the label classifier , and then it proceeds to select a subset of size to be moved from the target domain to the source domain. For this purpose, two selection criteria are proposed, which are described in the next section.

Once the iterative stage of the algorithm ends, the label classifier is used to classify the entire original target domain (see line 9 of Algorithm 1). This labeled target set is used to then train a neural network from scratch, which is therefore specialized in classifying target domain samples (more details in Section 3.5).

3.4 Selection policies

Below we describe in detail the two proposed policies to select samples during the iterative stage of Algorithm 1 (selection_policy). One policy is directly based on the confidence level that the network provides to the prediction, while the other is based on geometric properties of the learned feature space.

3.4.1 Confidence policy

As mentioned above, the output of the label classifier uses a softmax activation. Let denote the number of labels. Then, the standard softmax function is defined by Equation 2.

(2)

This function normalizes an

-dimensional vector

of unbounded real values into another -dimensional vector , for which values range between and add up to . This can be interpreted as a posterior probability over the different possible labels Bridle (1990). In order to turn these probabilities into the predicted class label, we simply take the argmax-index position of this output vector, following a Maximum a Posteriori probability criterion.

Taking advantage of this interpretation, the first policy for selecting samples to move from the target domain to the source is based on the probability provided by the label classifier , which can be seen as a measure of confidence in such classification.

With this criterion, we will keep the maximum predicted probability value for each sample of the target set among the possible labels. Then, we will order all samples based on this value—from highest to lowest—in order to select the first samples to build the subset .

Algorithm 2 presents the algorithmic description of this process, where refers to the probabilistic output of the label classifier after the softmax activation, before applying argmax to select a label. The function sortr is used to sort the set in decreasing order.

Input : 
Output : 
1
2
3 foreach  do
4      
5       if  then
6             break
7            
8       end if
9      
10 end foreach
Algorithm 2 Confidence policy

Figure 2 shows an example of a set of probabilities obtained after predicting the target samples with DANN. The figure on the left shows the maximum probability values obtained for the classification of each sample—without sorting—while in the figure on the right the sorted set is shown, where the threshold has been highlighted.

Figure 2: Example of probabilities obtained with DANN. Left: maximum probability of each sample. Right: ordered set of maximum probabilities, where the threshold has been highlighted.

3.4.2 NN policy

As in the previous case, once the network has been trained, we use the label classifier to predict the labels of the whole target domain and then we sort them based on the confidence given by the network. However, in this case, instead of directly selecting a subset of samples according to this confidence, we will also evaluate the geometric properties of the feature space. This is performed following the -nearest neighbor rule.

We first obtain the feature set from the source set (using ). We then proceed to iterate the target set samples sorted by their level of confidence. Given a target sample, if the label of the -nearest samples of the source domain matches the label assigned by the label classifier , then we will select the prototype. Otherwise, we will discard it. Therefore, samples are selected based on both the confidence provided by the DANN in their label and the extent they match the distribution of the source domain.

Algorithm 3 describes this process algorithmically. The function receives as parameters the query sample , the set and the value to be used, and yields the predicted label and the number of samples within its -nearest neighbors from that have the same label.

Input : 
Output : 
1
2
3
4 foreach  do
5      
6      
7       if  then
8            
9             if  then
10                   break
11                  
12             end if
13            
14       end if
15      
16 end foreach
Algorithm 3 NN policy

The idea of this policy is to select the samples of the target domain whose features are within the cluster of the source domain for the same class. An illustrative example of this condition is shown in Fig. 3 with . The example shows two labels of the source domain as green circles and blue squares. The red stars denote the target domain examples that are being evaluated to determine if they are selected. For instance, the star on the left would be selected if, and only if, the network classified it as a green circle, since its 5-nearest neighbors are green circles. Similarly, the star on the right would be selected if, and only if, the network classified it as a blue square. However, the central star would always be discarded because its 5-neighbors belong to two different classes.

If we increase , the red star of the left would still be selected (if labeled as green circle) because it is located in the middle of the cluster. However, the red start of the right is closer to label boundaries, and so it would eventually be discarded.

Figure 3:

Example of sample selection using the kNN policy with

. Green circles and blue squares represent samples of two different classes from the source domain. Red stars represent the samples of the target domain that are evaluated to determine whether they are chosen.

3.5 Training a CNN with the new labeled target set

As described in Algorithm 1, once the iterative stage of the iDANN algorithm is completed, we use the label classifier to annotate the entire original target set from scratch. Then, a new CNN is trained by conventional means considering the same neural architecture of . This allows us to eventually get a neural network that is directly specialized in the classification of the target domain.

However, we assume that some part of the iterative annotation of the target set will contain noise at the label level. To mitigate the possible efects of this noise, we consider label smoothing Szegedy et al. (2016). This is an efficient and theoretically-grounded strategy for dealing with label noise, which also makes the model less prone to overfitting.

Compared to classical one-hot output representation, label smoothing changes the construction of the true probability to

(3)

where is a small constant (or smoothing parameter) and is the total number of classes. Hence, instead of minimizing cross-entropy with hard targets (0 or 1), it considers soft targets.

4 Experimental setup

4.1 Datasets

The proposed approach will be evaluated with two different classification tasks, that are common in the DA literature. The first one is that of digit classification, for which we consider the following datasets:

  • MNIST LeCun et al. (1998): this collection contains images representing isolated handwritten digits.

  • MNIST-M Ganin et al. (2016): this dataset was synthetically generated by merging MNIST samples with random color patches from BSDS500 Arbelaez et al. (2011).

  • Street View House Numbers (SVHN) Netzer et al. (2011): it consists of images obtained from house numbers from Google Street View. It represents a real-world challenge of digit recognition in natural scenes, for which several digits might appear in the same image and only the central one must be classified.

  • Synthetic Numbers Ganin et al. (2016): images of digits generated using Windows™ fonts, with varying position, orientation, color and resolution.

In addition, we also evaluate our approach for traffic sign classification with the following datasets:

  • German Traffic Sign Recognition Benchmark (GTSRB) Stallkamp et al. (2012): this dataset contains images of traffic signs obtained from the real world in different sizes, positions, and lighting conditions, as well as including occlusions.

  • Synthetic Signs Moiseev et al. (2013): this dataset was synthetically generated by taking common street signs from Wikipedia and applying several transformations. It tries to simulate images from GTSRB although there are significant differences between them.

Table 1 summarizes the information of our evaluation corpora, including the domain to which they belong, the number of labels, the image resolution, the number of samples, and the type of image indicating whether they are in color or grayscale format. Figure 10 shows some random examples from each of these datasets.

width=keepaspectratio Set # labels Domain Resolution (px) Gray/Color # samples Numbers 10 MNIST Gray 65,000 MNIST-M Color 65,000 SVHN Color 99,289 Syn. Numbers Color 488,953 Traffic signs 43 GTSRB Color 51,839 Syn. Signs Color 100,000

Table 1: Description of the datasets used in the experimentation.
(a) MNIST
(b) MNIST-M
(c) SVHN
(d) Syn. Numbers
(e) GTSRB
(f) Syn. Signs
Figure 10: Random examples from the datasets used in experimentation.

The images of each classification task were rescaled to the same size: the digits to pixels, and the traffic signs to pixels. Concerning the pre-processing of the input data, the images were normalized within the range . The train and test partitions were those proposed by the authors of each dataset, in order to ensure a fair comparison with the results obtained in the literature.

4.2 CNN architectures

To evaluate the proposed methodology, the same three CNN architectures used in the original DANN paper have been tested. Table 2 reports a summary of these architectures.

As the authors pointed out, these topologies are not necessarily optimal and better adaptation performance might be attained if they were tweaked. However, we chose to keep the same configuration to make a fairer comparison.

As the activation function, a Rectifier Linear Unit (ReLU) was used for each convolution layer and fully-connected layer, except for the the output layers.

neurons with softmax activation were used as output of the label classifier. For the output of the domain classifier, a single neuron with a logistic (sigmoid) activation function was used to discriminate between two possible categories (source domain or target domain).

Model 1 was used for all the experiments with digit datasets, except those using SVHN. This topology is inspired by the classical LeNet-5 architecture LeCun et al. (1998). Model 2 was used to evaluate the experiments with digits that include SVHN. This architecture is inspired by Srivastava et al. (2014). Finally, Model 3 was used for the experiments with traffic sings. In this case, the single-CNN baseline obtained from Cireşan et al. (2012) was used.

width=keepaspectratio Feature extractor Model Layer 1 Layer 2 Layer 3 Label classifier Domain classifier 1 Conv(32, 5, 5) MaxPool(2, 2) Conv(48, 5, 5) MaxPool(2, 2) FC(100) FC(100) FC(L) FC(100) FC(1) 2 Conv(64, 5, 5) MaxPool(3, 3) Conv(64, 5, 5) MaxPool(3, 3) Conv(128, 5, 5) FC(3072) FC(2048) FC(L) FC(1024) FC(1024) FC(1) 3 Conv(96, 5, 5) MaxPool(2, 2) Conv(144, 3, 3) MaxPool(2, 2) Conv(256, 5, 5) MaxPool(2, 2) FC(512) FC(L) FC(1024) FC(1024) FC(1)

Table 2: CNN network configurations considered. Notation: stands for a layer with convolution operators with a kernel of size pixels,

stands for the max-pooling operator of dimensions

pixels—with a in all cases—and represents a fully-connected layer of neurons. In the output layer of the label classifier a fully-connected layer of neurons with softmax activation is added, where denotes the number of categories of the dataset at issue.

4.3 Training stage

To ensure a fair comparison with the original DANN algorithm, we set the same training configuration: Stochastic Gradient Descent with a learning rate of , decay of , and momentum of , as well as the same number of epochs ().

For the iterative stage of iDANN, we set to . This value was determined empirically. We observed that it allowed the network weights to be tuned with the new knowledge without taking too long to perform a new iteration. In each training iteration, the greater improvement occurs in the first epochs, after which the accuracy of the label classifier is stabilized.

Concerning the size of the subset to select from the target set (), we decided to consider a percentage of the remaining samples rather than a fixed value. Initially, we set it to , and it was increased after each iteration by () until all target domain samples are selected. This value was also obtained empirically, by observing better results and more stable training if few samples are added in the first iterations.

Different values for both the batch size and are evaluated, as will be reported in the experimentation section.

5 Results

In this section we evaluate the proposed method using the datasets, topologies, and settings described in Section 4. We first study the different hyper-parameterization, as well as the two prototype selection policies proposed. Next we show the performance results obtained over the datasets and, finally, we compare with other state-of-the-art methods.

5.1 Hyper-parameters evaluation

In this section, we start by analyzing the influence of the batch size and the value of on the performance of the method, as these hyper-parameters are those that affect the training stage the most. For this, we consider the batch sizes of and of . This means that each result comes from a total of 336 experiments (14 combinations of dataset pairs 6 batch values 4 values of ). The rest of hyper-parameters are set as indicated in Section 4.3, that is: (as in the original DANN paper), , and , which were empirically determined to favor stable training and obtain good results. In addition, we evaluate the results using only the prototype selection policy based on network’s confidence, as next section will be devoted to comparing the two proposed policies with the best hyper-parameters found.

As we are dealing with an unsupervised method, we mainly focus on analyzing the trend when modifying these parameters. Table 3 shows the results of this experiment, where each figure represents the average of the 14 possible combinations of source and target domain of the datasets considered and all the iterations performed by the iDANN algorithm.

The first thing to remark is that some of the hyper-parameter combinations evaluated in these experiments do not converge (, for traffic signs). This could be detected automatically, since the accuracy is abruptly reduced to a value approximately equal to a random guess, for both the training set and evaluation set and for both the source and the target domain. However, these results have been kept in order to observe the general trend of the method and how these parameters affect it.

It can also be observed that the best performance is achieved with in the two types of corpora, while a batch size of and are better for the digits and traffic signs, respectively. On average, better results are reported with low values and batch sizes between and . When is greater (e.g., ), the training becomes highly unstable, especially if combined with small batch sizes.

Numbers Traffic signs
Batch
16 58.74 56.21 58.65 47.54 89.67 88.65 90.08 48.16
32 66.13 65.82 61.67 49.78 93.58 93.63 94.50 24.02
64 65.26 66.41 66.82 62.54 91.27 91.16 91.67 31.41
128 64.23 66.04 66.79 52.89 88.56 89.36 89.60 66.78
256 64.55 63.94 64.24 59.36 87.34 87.73 88.67 91.39
512 62.55 62.61 62.75 50.67 84.34 84.45 84.09 84.60
Table 3: Influence of hyper-parameter setting on the performance (accuracy, in %) of the iDANN algorithm. Figures report classification accuracy over the target set, averaging with the respective datasets and iterations of the algorithm.

Next, we analyze the influence of these parameters with respect to the iteration of the iDANN algorithm. Table 4 shows the average result obtained by grouping all combinations of datasets (numbers and traffic signs) and hyper-parameters considered. As in the previous analysis, better results are also observed for low values and batch sizes between and (see column ‘Avg.’). In this case, it can also be seen that low values are more appropriate in the first iterations, whereas greater values are more appropriate in the last iterations. It might happen that a more stable way of proceeding (low ) is preferred in the first iterations, even at the cost of being less aggressive in the domain adaptation. Therefore, we propose to start with a low and increase its value gradually ( after each epoch).

Additionally, it is observed that each iteration of the algorithm leads to a better result than the previous one (except for ), yielding the higher leap in the first iterations and reducing this difference towards the last iterations. Including all cases, the results improve by between the first and the last iteration, on average. If we ignore those settings that do not converge, the average improvement obtained increases to .

Iterations
Batch 1 2 3 4 5 6 7 8 9 Avg.
16 58.46 59.71 61.46 62.81 63.91 64.86 65.39 65.78 66.04 63.16
32 65.20 67.85 69.37 69.91 70.73 71.14 71.89 72.11 72.24 70.05
64 63.88 67.11 68.33 68.75 69.35 70.30 70.71 71.13 71.22 68.97
128 62.67 65.52 67.09 67.68 68.19 68.84 69.46 69.88 70.06 67.71
256 62.82 65.20 66.92 67.83 68.65 68.84 69.57 70.09 70.34 67.81
512 61.51 63.28 64.32 65.45 65.97 66.93 67.60 67.87 68.03 65.66
16 56.65 57.70 59.65 60.72 61.43 62.25 62.75 63.15 63.31 60.85
32 63.95 67.42 68.68 69.93 70.59 71.26 71.79 72.19 72.33 69.79
64 64.66 67.39 68.78 69.89 70.41 71.32 72.06 72.44 72.57 69.95
128 63.75 66.59 68.61 69.07 69.93 70.82 71.44 72.00 72.14 69.37
256 62.38 65.08 66.67 67.44 67.69 68.52 69.10 69.49 69.71 67.34
512 61.36 63.46 64.72 65.48 66.25 66.96 67.42 67.84 68.08 65.73
16 56.73 61.57 63.37 65.13 65.81 66.31 62.94 63.13 63.26 63.14
32 62.07 64.75 66.01 66.68 67.46 67.78 68.23 68.47 68.80 66.69
64 64.78 67.77 69.44 70.20 71.21 71.92 72.35 72.74 72.95 70.37
128 64.49 67.27 69.02 69.75 70.71 71.58 72.00 72.72 72.87 70.05
256 63.16 65.55 66.75 67.49 68.32 68.76 69.51 69.81 70.21 67.73
512 61.61 63.49 64.82 65.57 66.10 66.81 67.52 68.01 68.28 65.80
16 46.48 50.71 52.12 53.35 53.78 42.41 42.98 43.35 43.52 47.63
32 50.09 53.07 47.66 42.61 43.12 44.06 44.39 44.91 44.96 46.10
64 61.64 64.02 64.24 54.01 54.48 55.44 55.96 56.47 56.59 58.09
128 55.72 57.10 57.16 57.01 52.95 53.15 53.63 53.50 53.69 54.88
256 60.13 62.26 62.60 63.53 64.25 64.93 65.57 66.04 66.14 63.94
512 53.54 54.39 54.63 55.01 55.69 56.04 56.57 56.84 56.95 55.52
Average 60.32 62.84 63.85 63.97 64.46 64.63 65.03 65.42 65.59
Table 4: Influence of hyper-parameter setting on the performance of the iDANN algorithm with respect to number of iterations. Figures report classification accuracy over the target set, averaging with the respective datasets.

5.2 Model analysis

We now evaluate the effect of the incremental training process on the domain adaptation approach. Figure 11 shows the evolution of the accuracy obtained over the target test set during the training process for the case Syn Numbers MNIST-M combination of datasets, with a batch size of 64 and . The training epochs are represented with the horizontal axis, while the iterations (i.e., when new training samples are added) are highlighted with blue lines and marked above. It can be observed that in the first iteration (spanning 300 epochs), the accuracy slowly improves until around 150 epochs, after which becomes stable. In the subsequent iterations, the accuracy further improves, especially during iterations 2, 3 and 4. Then, the performance increase is gradually reduced until it is hardly noticeable.

Figure 11: Accuracy curve with respect to training epochs and iterations of the incremental approach.

To provide further analysis, we also examine the representation space learned by the network in each of these iterations, using the same combination of datasets and training parameters. We use the t-Distributed Stochastic Neighbor Embedding (t-SNE) van der Maaten & Hinton (2008) projection to visualize the samples according to their representation by the last hidden layer of the label predictor. Figure 12 shows a visualization of the features learned after each of the iterations, where the red color represents the target domain, the blue color represents the source domain, and the green color represents the set (selected samples) using the confidence policy. This representation reveals well-defined clusters—the 10 possible classes of the datasets considered for this analysis—around an additional central cluster. This central cluster groups the samples of the target domain (red color) whose representation does not correspond to any of the existing classes yet. This cluster would therefore correspond to target samples whose representation has not been correctly mapped onto any of the source domain classes. Iteratively, the method is selecting samples (green points) of the target domain and moving them to the source domain. In the first iterations—until the 6th one, approximately—the method selects only samples that are well located in one of the source domain clusters (that is, those samples for which the network is more confident). Due to this process, the size of the central cluster is reduced. It is important to emphasize that this cluster becomes smaller although no samples out of it are selected, which indicates that the network is learning to better map those samples because of the selected samples of previous iterations. Towards the last iterations, the method begins to select the most complex samples that are still in this additional cluster. In Fig. 12(*) (which is the same as the Fig. 12(9) but highlighting each class with a different color), the additional cluster of target samples still appears without being mapped, yet with a very small size. This cluster contains almost all the classification errors, having mapped only some isolated prototypes to the actual class clusters incorrectly.

Figure 12: t-SNE representation of the feature space of the neural network (from the last hidden layer) with respect to the iteration of the approach. The red color represents the target domain, the blue color represents the source domain, and the green color represents the set . Last representation (*) depicts each sample according to its actual category by using different colors.

5.3 kNN policy

We compare in this section the two policies proposed for selecting the set of target prototypes to be added to the source domain. To this end, we evaluate whether the label assigned to each of these prototypes is correct. In this case, we make use of the ground-truth of the target domain just for the sake of analysis.

We show in Fig. 13 a dotted line with the performance of the confidence policy, which may serve as a baseline here, and eight results for the kNN policy with varying values. As in the previous experiments, the reported figures are obtained for all combinations of datasets and hyper-parameters considered.

It is observed that, as the number of iterations of the algorithm increases, the accuracy of the additional labels assigned to the selected prototypes decreases. However, the kNN policy generally obtains better results from the first iteration, obtaining on average (for all iterations) an improvement of 6.36 % with respect to confidence policy. This improvement is significantly greater in the last iterations, obtaining an increase up to 24.85 % between the result of the confidence policy and the best result obtained with kNN policy.

The role of the parameter is also illustrated in Fig. 13, where better results are attained as is increased. It is shown that the impact of this parameter is more noticeable in the last iterations, where a difference of up to 8.91 % is obtained between and .

Figure 13: Accuracy of the labels assigned to the selected prototypes set () from the target domain according to the selection policy.

Because the kNN selection policy worked better, this policy was used in all following experiments.

5.4 Accuracy on target

In this section, we evaluate the final result obtained through the proposed iDANN method with the best combination of hyper-parameters previously obtained for each of the dataset pairs. We will compare this result with that obtained by the original DANN method in order to check the goodness of the incremental approach.

Table 5 reports the results of the experiment, where rows indicate the dataset pairs (source and target) and columns represent the DA method. Concerning iDANN, we report two results: the accuracy of the labels assigned during the iterative process itself (1), as well as the accuracy using the CNN trained from scratch using only the target samples (once all the target samples have been assigned a label). In addition to DANN and iDANN methods, we have also added the results obtained with the neural networks trained just with the source set (‘CNN Src.’), as well as the results obtained with the neural neural networks directly trained with the target set (‘CNN Tgt.’). The former serves as baseline, to better assess the impact of the domain-adaptation mechanisms, while the second represents the upper bound of accuracy.

The first thing to remark is that the worst results obtained by the baseline (‘CNN Src.’) come from the combinations of single-digit datasets (MNIST, MNIST-M) as source and complex digit datasets (SVHN, Syn Numbers) as target. Furthermore, the best results from the baseline are reported for combinations where the source and target are similar (MNIST-M MNIST, SVHN Syn Numbers).

The original DANN method outperforms the results obtained by using the baseline network (‘CNN Src.’) by 10.7 %, on average, obtaining the most significant improvement for the combinations of Syn Numbers MNIST (improving by 29.31 %). It is also noticeable the impact of DANN when the dataset pair consists of similar tasks with the most complex one as target, such as MNIST MNIST-M—improvement of 23 %—or Syn Signs GTSRB—improvement of 15.49 %. These results for DANN have been obtained using our own implementation, following the details given in the original paper. We observed that the accuracy matches approximately that reported by the authors (for the 4 combinations they considered), and so we assume that our implementation is correct. We can therefore faithfully report the performance in all source-target combinations of our experiments.

Concerning the labels assigned during the proposed incremental approach iDANN (), the first thing to note is its improvement with respect to the underlying DANN method, which is around 16 %, on average. In the best case, this improvement reaches values around 33 %, 35 % and 36 % for the Syn Numbers MNIST-M, MNIST-M Syn Numbers, and MNIST Syn Numbers pairs, respectively. This confirms the goodness of our strategy, which uses the same domain adaptation method in a novel way.

Finally, if the CNN is trained from scratch with the target labels that have been automatically assigned by the iDANN (), it can further improve the results up to 1.64 %, on average, and up to 5.5 % in the best case (MNIST-M Syn Numbers). It should be noted that in some specific combinations, this approach slightly outperforms the CNN trained with the correct target labels (for example, MNIST-M MNIST or Syn Numbers SVHN). It might happen that the incorrectly assigned labels of the iDANN process act as a regularizer that alleviates some overfitting.

width=keepaspectratio Source Target CNN Src. DANN iDANN CNN Tgt. Acc. Acc. Acc. MNIST MNIST-M 55.71 78.70 96.09 96.67 97.34 SVHN 16.26 31.32 35.83 36.49 90.93 Syn Numbers 32.14 44.66 80.79 84.82 99.34 MNIST-M MNIST 97.95 98.65 99.04 99.59 98.94 SVHN 32.91 41.41 61.87 61.89 90.93 Syn Numbers 46.34 54.02 89.49 94.99 99.34 SVHN MNIST 59.04 67.08 82.72 84.50 98.94 MNIST-M 43.49 47.42 66.40 67.62 97.34 Syn Numbers 88.42 89.56 96.43 98.10 99.34 Syn Numbers MNIST 60.04 89.35 98.13 99.35 98.94 MNIST-M 41.84 54.38 87.10 90.26 97.34 SVHN 85.16 87.24 91.42 91.95 90.93 GTSRB Syn signs 76.39 86.22 98.28 98.57 99.74 Syn signs GTSRB 69.79 85.28 96.31 98.00 97.89 Average 57.53 68.23 84.28 85.91 96.95

Table 5: Accuracy (%) over the the target dataset for different strategies: ‘CNN Src.’ indicates a neural network trained only with source samples; ‘DANN’ denotes the original DANN strategy; ‘iDANN’ yields two results from the incremental strategy: refers to the accuracy of the labels assigned to the target samples during the iterative process, while refers to the classification after training a new CNN from scratch using the labels assigned to the target samples; and ‘CNN Tgt.’ denotes a CNN trained using the ground truth of the target samples.

5.5 Comparison with the state of the art

To conclude the results section, we present below a comparison with other domain adaptation strategies from the state of the art. In these works, not all possible combinations of source-target pairs are considered but a few combinations of them. We show in Table 6 the results reported in the literature111Unlike the results of the previous section, the DANN values of Table 6 are those reported in the original paper Ganin et al. (2016)., along with the results obtained by our proposal (iDANN). A brief description of the competing methods was provided in Section 2. Readers are referred to the corresponding references for details.

These results reveal that our method yields the best performance in 5 out of 7 source-target pairs. The performance of iDANN is especially remarkable in the case of MNIST Syn Num, where the improvement reaches around 30 % compared to the literature. For the case in which our proposal does not attain the best result, we observe a dissimilar performance: it is still very competitive for the MNIST SVHN pair, whereas it is outperformed for the SVHN MNIST pair. When all the results are good, the improvement is relative, but when there is enough margin, the improvement is quite remarkable (as in the case of MNIST Syn Num).

width=keepaspectratio MNIST SVHN Syn Num Methods MNIST-M SVHN Syn Num MNIST Syn Num MNIST SVHN SA Fernando et al. (2013) 56.9 59.32 86.44 DRCN Ghifary et al. (2016) 40.05 81.97 DSN Bousmalis et al. (2016) 83.2 82.7 91.2 DANN Ganin et al. (2016) 76.66 12.4 22.9 73.85 96.9 87.6 91.09 D-CORAL Sun & Saenko (2016) 35 55.8 76.3 95.5 89.9 78.8 ADDA Tzeng et al. (2017) 76 ADA Haeusser et al. (2017) 12.9 34.8 96.3 95.5 97.1 88.1 VADA Shu et al. (2018) 18.6 45.9 92.9 96.8 96.2 85.3 DeepJDOT Damodaran et al. (2018) 92.4 96.7 DTA Lee et al. (2019) 99.4 iDANN 96.67 36.49 84.82 84.50 98.10 99.35 91.95

Table 6: Comparison of accuracy (%) between state-of-the-art DA approaches and iDANN. The first two rows denote the source and target dataset, respectively. The best result for each combination (column) is highlighted in bold typeface. Empty cells indicate that the result is not reported in the literature.

Furthermore, it should be noted that many of the compared methods propose specific CNN architecture for each combination of datasets and/or focus on optimizing the result for a particular combination, such as DTA or DeepJDOT. In our case, we utilized the topologies proposed in the original DANN paper, so it could be assumed that if we pursue a specific architecture adapted to each of the source-target pairs, our results will probably improve.

6 Conclusions and Future Work

This paper proposes an incremental strategy to the problem of domain adaptation with artificial neural networks. Our approach is built upon an existing domain adaptation approach, combined with a heuristic that, in each iteration, decides which prototypes of the target set can be added to the training set by considering the label provided by the neural network. To this end, two selection policies have been proposed: one directly based on the confidence given by the network to the prediction and another based on geometric properties of the learned feature space. We observed that the latter reported a better performance, especially in the last iterations of the algorithm. In addition, we consider a final stage in which the labeled target set is used to train a new neural network with label smoothing.

Our experiments were performed on various corpora and using several configurations of the neural network. From the results, we conclude that the incremental approach outperforms the underlying DANN model, as well as other state-of-the-art methods. It is interesting to note that, in some cases, the iDANN approach improves the result obtained with the CNN trained directly with the ground-truth data of the target set, which could indicate that the incremental process also serves as a regularizer that leads to greater robustness. Furthermore, unlike the classic DANN, our approach improves results when domains are similar and helps keeping the accuracy for the source domain. We also observed a greater training stability and less dependence on the hyper-parameters set.

As future work, a primary objective would be to establish a well-principled stop criterion that allows us to detect when the prediction over the target samples is not reliable. In addition, we want to extend the experiments to other types of input types (such as sequences), as well as to study the behavior of the incremental strategy when the underlying DA method is different—given that there currently exist several architectures for this challenge. Note that our incremental approach is independent of the underlying DA model considered, and so it could be adopted as a generic strategy that might improve to the same extent as the underlying DA algorithm improves. Other avenues to further explore this proposal is to evaluate more neural network architectures, as well as adding data augmentation to the learning process.

References

  • Arbelaez et al. (2011) Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 898–916. doi:10.1109/TPAMI.2010.161.
  • Ben-David et al. (2010) Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010). A theory of learning from different domains. Machine Learning, 79, 151–175. doi:10.1007/s10994-009-5152-4.
  • Ben-David et al. (2006) Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2006). Analysis of representations for domain adaptation. In NIPS (pp. 137–144).
  • Ben-David et al. (2007) Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations for domain adaptation. In B. Schölkopf, J. C. Platt, & T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19 (NIPS) (pp. 137–144). MIT Press.
  • Bousmalis et al. (2017) Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., & Krishnan, D. (2017). Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

    (pp. 95–104).
  • Bousmalis et al. (2016) Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domain separation networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 343–351). Curran Associates, Inc.
  • Bridle (1990) Bridle, J. S. (1990). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In F. F. Soulié, & J. Hérault (Eds.), Neurocomputing (pp. 227–236). Berlin, Heidelberg: Springer Berlin Heidelberg.
  • Cheng & Pan (2014) Cheng, L., & Pan, S. J. (2014). Semi-supervised domain adaptation on manifolds. IEEE Transactions on Neural Networks and Learning Systems, 25, 2240–2249. doi:10.1109/TNNLS.2014.2308325.
  • Cireşan et al. (2012) Cireşan, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multi-column deep neural network for traffic sign classification. Neural Networks, 32, 333 – 338. Selected Papers from IJCNN 2011.
  • Damodaran et al. (2018) Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., & Courty, N. (2018). Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer Vision – ECCV 2018 (pp. 467–483). Cham: Springer International Publishing.
  • Duda et al. (2001) Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification, 2nd Edition. Wiley.
  • Fernando et al. (2013) Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In 2013 IEEE International Conference on Computer Vision (ICCV) (pp. 2960–2967).
  • Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., & Lempitsky, V. (2016). Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17, 1–35.
  • Ghifary et al. (2016) Ghifary, M., Kleijn, W. B., Zhang, M., Balduzzi, D., & Li, W. (2016). Deep reconstruction-classification networks for unsupervised domain adaptation. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer Vision – ECCV 2016 (pp. 597–613). Cham: Springer International Publishing.
  • Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.
  • Haeusser et al. (2017) Haeusser, P., Frerix, T., Mordvintsev, A., & Cremers, D. (2017). Associative domain adaptation. In The IEEE International Conference on Computer Vision (ICCV).
  • He et al. (2016) He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 (pp. 770–778).
  • Kouw & Loog (2019) Kouw, W. M., & Loog, M. (2019). A review of domain adaptation without target labels. IEEE Transactions on Pattern Analysis and Machine Intelligence, (pp. 1–1).
  • LeCun et al. (2015) LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. In Proc. of the IEEE (pp. 2278–2324). volume 86.
  • Lee et al. (2019) Lee, S., Kim, D., Kim, N., & Jeong, S.-G. (2019). Drop to adapt: Learning discriminative features for unsupervised domain adaptation. In The IEEE International Conference on Computer Vision (ICCV).
  • van der Maaten & Hinton (2008) van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
  • Moiseev et al. (2013) Moiseev, B., Konev, A., Chigorin, A., & Konushin, A. (2013). Evaluation of traffic sign recognition methods trained on synthetically generated data. In J. Blanc-Talon, A. Kasinski, W. Philips, D. Popescu, & P. Scheunders (Eds.), Advanced Concepts for Intelligent Vision Systems (pp. 576–583). Cham: Springer International Publishing.
  • Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.
  • Saito et al. (2019) Saito, K., Kim, D., Sclaroff, S., Darrell, T., & Saenko, K. (2019). Semi-supervised domain adaptation via minimax entropy. In The IEEE International Conference on Computer Vision (ICCV).
  • Shao et al. (2014) Shao, L., Zhu, F., & Li, X. (2014). Transfer learning for visual categorization: A survey. IEEE Transactions on Neural Networks and Learning Systems, 26, 1019–1034.
  • Shu et al. (2018) Shu, R., Bui, H., Narui, H., & Ermon, S. (2018). A DIRT-t approach to unsupervised domain adaptation. In International Conference on Learning Representations.
  • Simonyan & Zisserman (2015) Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.
  • Stallkamp et al. (2012) Stallkamp, J., Schlipsing, M., Salmen, J., & Igel, C. (2012). Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32, 323 – 332. doi:https://doi.org/10.1016/j.neunet.2012.02.016.
  • Sun & Saenko (2016) Sun, B., & Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In G. Hua, & H. Jégou (Eds.), Computer Vision – ECCV 2016 Workshops (pp. 443–450). Cham: Springer International Publishing.
  • Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2818–2826). doi:10.1109/CVPR.2016.308.
  • Tzeng et al. (2017) Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 2962–2971).
  • Villani (2009) Villani, C. (2009). Optimal Transport Old and New. Springer.
  • Wang & Deng (2018) Wang, M., & Deng, W. (2018). Deep visual domain adaptation: A survey. Neurocomputing, 312, 135 – 153.
  • Yao et al. (2015) Yao, T., Pan, Y., Ngo, C.-W., Li, H., & Mei, T. (2015). Semi-supervised domain adaptation with subspace learning for visual recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2142–2150).
  • Yosinski et al. (2014) Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems (pp. 3320–3328).