The cost of generating labeled data for a new learning task is often an obstacle for applying machine learning methods. There is thus great incentive to develop ways of exploiting data from one problem that generalizes to another. Domain adaptation focuses on the situation where we have data generated from two different, but somehow similar, distributions. One example is in the context of sentiment analysis in written reviews, where we might want to distinguish between the positive from the negative ones. While we might have labeled data for reviews of one type of products (, movies), we might want to be able to generalize to reviews of other products (, books). Domain adaptation tries to achieve such a transfer by exploiting an extra set of unlabeled training data for the new problem to which we wish to generalize (, unlabeled reviews of books).
One of the main approach to achieve such a transfer is to learn a classifierand a representation which will favor the transfer. A large body of work exists on training both a classifier and a representation that are linear (BruzzoneM10S; pbda; CortesM14). However, recent research has shown that non-linear neural networks can also be successful (Glorot+al-ICML-2011). Specifically, a variant of the denoising autoencoder (VincentP2008), known as marginalized stacked denoising autoencoders (mSDA) (Chen12), has demonstrated state-of-the-art performance on this problem. By learning a representation which is robust to input corruption noise, they have been able to learn a representation which is also more stable across changes of domain and can thus allow cross-domain transfer.
In this paper, we propose to control the stability of representation between domains explicitly into a neural network learning algorithm. This approach is motivated by theory on domain adaptation (BenDavid-NIPS06; BenDavid-MLJ2010) that suggests that a good representation for cross-domain transfer is one for which an algorithm cannot learn to identify the domain of origin of the input observation. We show that this principle can be implemented into a neural network learning objective that includes a term where the network’s hidden layer is working adversarially towards output connections predicting domain membership. The neural network is then simply trained by gradient descent on this objective. The success of this domain-adversarial neural network (DANN) is confirmed by extensive experiments on both toy and real world datasets. In particular, we show that DANN achieves better performances than a regular neural network and a SVM on a sentiment analysis classification benchmark. Moreover, we show that DANN can reach state-of-the-art performance by taking as input the representation learned by mSDA, confirming that minimizing domain discriminability explicitly improves over on only relying on a representation which is robust to noise.
2 Domain Adaptation
We consider binary classification tasks where is the input space and is the label set. Moreover, we have two different distributions over , called the source domain and the target domain . A domain adaptation learning algorithm is then provided with a labeled source sample drawn i.i.d. from , and an unlabeled target sample drawn i.i.d. from , where is the marginal distribution of over .
The goal of the learning algorithm is to build a classifier with a low target risk
while having no information about the labels of .
2.1 Domain Divergence
To tackle the challenging domain adaptation task, many approaches bound the target error by the sum of the source error and a notion of distance between the source and the target distributions. These methods are intuitively justified by a simple assumption: the source risk is expected to be a good indicator of the target risk when both distributions are similar. Several notions of distance have been proposed for domain adaptation (BenDavid-NIPS06; BenDavid-MLJ2010; Mansour-COLT09; MansourMR09; pbda). In this paper, we focus on the -divergence used by BenDavid-NIPS06; BenDavid-MLJ2010, and based on the earlier work of kifer-2004.
Definition 1 (BenDavid-NIPS06; BenDavid-MLJ2010; kifer-2004).
Given two domain distributions and over , and a hypothesis class , the -divergence between and is
That is, the -divergence relies on the capacity of the hypothesis class to distinguish between examples generated by from examples generated by . BenDavid-NIPS06; BenDavid-MLJ2010 proved that, for a symmetric hypothesis class , one can compute the empirical -divergence between two samples and by computing
where is the indicator function which is if predicate is true, and otherwise.
2.2 Proxy Distance
BenDavid-NIPS06; BenDavid-MLJ2010 suggested that, even if it is generally hard to compute exactly (, when is the space of linear classifiers on ), we can easily approximate it by running a learning algorithm on the problem of discriminating between source and target examples. To do so, we construct a new dataset
where the examples of the source sample are labeled and the examples of the target sample are labeled . Then, the risk of the classifier trained on new dataset approximates the “” part of Equation (1). Thus, given a test error on the problem of discriminating between source and target examples, this Proxy A-distance
(PAD) is given by
In the experiments section of this paper, we compute the PAD value following the approach of Glorot+al-ICML-2011; Chen12, , we train a linear SVM on a subset of dataset (Equation (2)), and we use the obtained classifier error on the other subset as the value of in Equation (3).
2.3 Generalization Bound on the Target Risk
The work of BenDavid-NIPS06; BenDavid-MLJ2010 also showed that the -divergence
is upper bounded by its empirical estimateplus a constant complexity term that depends on the VC dimension of and the size of samples and . By combining this result with a similar bound on the source risk, the following theorem is obtained.
Theorem 2 (BenDavid-NIPS06).
Let be a hypothesis class of VC dimension .
. With probabilityover the choice of samples and , for every :
with , and
is the empirical source risk.
The previous result tells us that can be low only when the term is low, , only when there exists a classifier that can achieve a low risk on both distributions. It also tells us that, to find a classifier with a small in a given class of fixed VC dimension, the learning algorithm should minimize (in that class) a trade-off between the source risk and the empirical -divergence . As pointed-out by BenDavid-NIPS06, a strategy to control the -divergence is to find a representation of the examples where both the source and the target domain are as indistinguishable as possible. Under such a representation, a hypothesis with a low source risk will, according to Theorem 2, perform well on the target data. In this paper, we present an algorithm that directly exploits this idea.
3 A Domain-Adversarial Neural Network
The originality of our approach is to explicitly implement the idea exhibited by Theorem 2 into a neural network classifier. That is, to learn a model that can generalize well from one domain to another, we ensure that the internal representation of the neural network contains no discriminative information about the origin of the input (source or target), while preserving a low risk on the source (labeled) examples.
3.1 Source Risk Minimization (Standard NN)
Let us consider the following standard neural network (NN) architecture with one hidden layer:
Note that each component of denotes the conditional probability that the neural network assigns to class . Given a training source sample , the natural classification loss to use is the negative log-probability of the correct label:
This leads to the following learning problem on the source domain.
We view the output of the hidden layer (Equation (4)) as the internal representation of the neural network. Thus, we denote the source sample representations as
3.2 A Domain Adaptation Regularizer
Now, consider an unlabeled sample from the target domain and the corresponding representations . Based on Equation (1), the empirical -divergence of a symmetric hypothesis class between samples and is given by
Let us consider
as the class of hyperplanes in the representation space. Inspired by the Proxy A-distance (see Section2.2), we suggest estimating the “” part of Equation (6) by a logistic regressor that model the probability that a given input (either or ) is from the source domain (denoted ) or the target domain (denoted ):
where is either or . Hence, the function is a domain regressor.
This enables us to add a domain adaptation term to the objective of Equation (5), giving the following problem to solve:
where the hyper-parameter weights the domain adaptation regularization term and
In line with Theorem 2, this optimization problem implements a trade-off between the minimization of the source risk and the divergence . The hyper-parameter is then used to tune the trade-off between these two quantities during the learning process.
3.3 Learning Algorithm (DANN)
We see that Equation (8) involves a maximization operation. Hence, the neural network (parametrized by ) and the domain regressor (parametrized by ) are competing against each other, in an adversarial way, for that term. The obtained domain adversarial neural network (DANN) is illustrated by Figure 1. In DANN, the hidden layer maps an example (either source or target) into a representation in which the output layer accurately classifies the source sample, while the domain regressor is unable to detect if an example belongs to the source sample or the target sample.
To optimize Equation (8), one option would be to follow a hard-EM approach, where we would alternate between optimizing until convergence the adversarial parameters and the other regular neural network parameters
. However, we’ve found that a simpler stochastic gradient descent (SGD) approach is sufficient and works well in practice. Here, an SGD approach consists in sampling a pair of source and target exampleand updating a gradient step update of all parameters of DANN. Crucially, while the update of the regular parameters follows as usual the opposite direction of the gradient, for the adversarial parameters the step must follow the gradient’s direction (since we maximize with respect to them, instead of minimizing). The algorithm is detailed in Algorithm 1. In the pseudocode, we use
to represent a “one-hot” vector, consisting of alls except for a at position . Also, is the element-wise product.
For each experiment described in this paper, we used early stopping as the stopping criteria: we split the source labeled sample to use as the training set and the remaining as a validation set . We stop the learning process when the risk on is minimal.
4 Related Work
As mentioned previously, the general approach of achieving domain adaptation by learning a new data representation has been explored under many facets. A large part of the literature however has focused mainly on linear hypothesis (see for instance BlitzerMP06; BruzzoneM10S; pbda; BaktashmotlaghM2013; CortesM14). More recently, non-linear representations have become increasingly studied, including neural network representations (Glorot+al-ICML-2011; LiY2014) and most notably the state-of-the-art mSDA (Chen12). That literature has mostly focused on exploiting the principle of robust representations, based on the denoising autoencoder paradigm (VincentP2008). One of the contribution of this work is to show that domain discriminability is another principle that is complimentary to robustness and can improve cross-domain adaptation.
What distinguishes this work from most of the domain adaptation literature is DANN’s inspiration from the theoretical work of BenDavid-NIPS06; BenDavid-MLJ2010. Indeed, DANN directly optimizes the notion of -divergence. We do note the work of HuangY12, in which HMM representations are learned for word tagging using a posterior regularizer that is also inspired by BenDavid-MLJ2010’s work. In addition to the tasks being different (word tagging versus sentiment classification), we would argue that DANN learning objective more closely optimizes the -divergence, with HuangY12 relying on cruder approximations for efficiency reasons.
The idea of learning representations that are indiscriminate of some auxiliary label has also been explored in other contexts. For instance, ZemelR2013 proposes the notion of fair representations, which are indiscriminate to whether an example belongs to some identified set of groups. The resulting algorithm is different from DANN and not directly derived from the -divergence.
Finally, we mention some related research on using an adversarial (minimax) formulation to learn a model, such as a classifier or a neural network, from data. There has been work on learning linear classifiers that are robust to changes in the input distribution, based on a minimax formulation (BagnellD2005; LiuA2014). This work however assumes that a good feature representation of the input for a linear classifier is available and doesn’t address the problem of learning it. We also note the work of GoodfellowI2014, who propose generative adversarial networks to learn a good generative model of the true data distribution. This work shares with DANN the use of an adversarial objective, applying it instead to the unsupervised problem of generative modeling.
5.1 Toy Problem
As a first experiment, we study the behavior of the proposed DANN algorithm on a variant of the inter-twinning moons 2D problem, where the target distribution is a rotation of the source distribution. For the source sample , we generate a lower moon and an upper moon labeled and respectively, each of which containing examples. The target sample is obtained by generating a sample in the same way as (without keeping the labels) and then by rotating each example by . Thus, contains unlabeled examples. In Figure 2, the examples from are represented by “”and “”, and the examples from are represented by black dots.
We study the adaptation capability of DANN by comparing it to the standard NN. In our experiments, both algorithms share the same network architecture, with a hidden layer size of neurons. We even train NN using the same procedure as DANN. That is, we keep updating the domain regressor component using target sample (with a hyper-parameter ; the same value used for DANN), but we disable the adversarial back-propagation into the hidden layer. To do so, we execute Algorithm 1 by omitting the lines numbered 21 and 28. In this way, we obtain a NN learning algorithm – based on the source risk minimization of Equation (5) – and simultaneously train the domain regressor of Equation (7) to discriminate between source and target domains. Using this toy experiment, we will first illustrate how DANN adapts its decision boundary compared to NN. Moreover, we will also illustrate how the representation given by the hidden layer is less adapted to the domain task with DANN than it is with NN. The results are illustrated in Figure 2, where the graphs in part (a) relate to the standard NN, and the graphs in part (b) relate to DANN. By looking at the corresponding (a) and (b) graphs in each column, we compare NN and DANN from four different perspectives, described in detail below.
Label classification. The first column of Figure 2 shows the decision boundaries of DANN and NN on the problem of predicting the labels of both source and the target examples. As expected, NN accurately classifies the two classes of the source sample , but is not fully adapted to the target sample . On the contrary, the decision boundary of DANN perfectly classifies examples from both source and target samples. DANN clearly adapts here to the target distribution.
Representation PCA. To analyze how he domain adaptation regularizer affects the representation provided by the hidden layer, the second column of Figure 2
presents a principal component analysis (PCA) on the set of all representations of source and target data points, ,. Thus, given the trained network (NN or DANN), every point from and is mapped into a -dimensional feature space through the hidden layer, and projected back into a two-dimensional plane defined by the first two principal components. In the DANN-PCA representation, we observe that target points are homogeneously spread out among the source points. In the NN-PCA representation, clusters of target points containing very few source points are clearly visible. Hence, the task of labeling the target points seems easier to perform on the DANN-PCA representation.
To push the analysis further, four crucial data points identified by A, B, C and D in the graphs of the first column (which correspond to the moon extremities in the original space) are represented again on the graphs of the second column. We observe that points A and B are very close to each other in the NN-PCA representation, while they clearly belong to different classes. The same happens to points C and D. Conversely, these four points are located at opposite corners in the DANN-PCA representation. Note also that the target point A (resp. D) – which is difficult to classify in the original space – is located in the “”cluster (resp. “”cluster) in the DANN-PCA representation. Therefore, the representation promoted by DANN is more suited for the domain adaptation task.
Domain classification. The third column of Figure 2 shows the decision boundary on the domain classification problem, which is given by the domain regressor of Equation (7). More precisely, is classified as a source example when , and is classified as a domain example otherwise. Remember that, during the learning process of DANN, the
regressor struggles to discriminate between source and target domains, while the hidden representationis adversarially updated to prevent it to succeed. As explained above, we trained the domain regressor during the learning process of NN, but without allowing it to influence the learned representation .
On one hand, the DANN domain regressor utterly fails to discriminate the source and target distributions. On the other hand, the NN domain regressor shows a better (although imperfect) discriminant. This again corroborates that the DANN representation doesn’t allow discriminating between domains.
Hidden neurons. In the plot of the last column of Figure 2, the lines show the decision surfaces of the hidden layer neurons (defined by Equation (4)). In other words, each of the fifteen plot line corresponds to the points for which the th component of equals , for .
We observe that the neurons of NN are grouped in three clusters, each one allowing to generate a straight line part of the curved decision boundary for the label classification problem. However, most of these neurons are also able to (roughly) capture the rotation angle of the domain classification problem. Hence, we observe that the adaptation regularizer of DANN prevents these kinds of neurons to be produced. It is indeed striking to see that the two predominant patterns in the NN neurons (, the two parallel lines crossing the plane from the lower left corner to the upper right corner) are absent among DANN neurons.
5.2 Sentiment Analysis Dataset
In this section, we compare the performance of our proposed DANN algorithm to a standard neural network with one hidden layer (NN) described by Equation (5
), and a Support Vector Machine (SVM) with a linear kernel. To select the hyper-parameters of each of these algorithms, we use grid search and a very small validation set which consists inlabeled examples from the target domain. Finally, we select the classifiers having the lowest target validation risk.
We compare the algorithms on the Amazon reviews dataset, as pre-processed by Chen12. This dataset includes four domains, each one composed of reviews of a specific kind of product (books, dvd disks, electronics, and kitchen appliances). Reviews are encoded in dimensional feature vectors of unigrams and bigrams, and labels are binary: “” if the product is ranked up to stars, and “” if the product is ranked or stars.
We perform twelve domain adaptation tasks. For example, “books dvd” corresponds to the task for which books is the source domain and dvd disks the target one. All learning algorithms are given labeled source examples and unlabeled target examples. Then, we evaluate them on separate target test sets (between and examples). Note that NN and SVM don’t use the unlabeled target sample for learning. Here are more details about the procedure used for each learning algorithms.
DANN. The adaptation parameter is chosen among 9 values between and on a logarithmic scale. The hidden layer size is either or . Finally, the learning rate is fixed at .
NN. We use exactly the same hyper-parameters and training procedure as DANN above, except that we don’t need an adaptation parameter. Note that one can train NN by using the DANN implementation (Algorithm 1) with .
SVM. The hyper-parameter of the SVM is chosen among 10 values between and on a logarithmic scale. This range of values is the same used by Chen12 in their experiments.
The “Original data” part of Table 2(a) shows the target test risk of all algorithms, and Table 2(b) reports the probability that one algorithm is significantly better than another according to the Poisson binomial test (lacoste-2012). We note that DANN has a significantly better performance than NN and SVM, with respective probabilities 0.90 and 0.97. As the only difference between DANN and NN is the domain adaptation regularizer, we conclude that our approach successfully helps to find a representation suitable for the target domain.
5.3 Combining DANN with Autoencoders
We now wonder whether our DANN algorithm can improve on the representation learned by the state-of-the-art Marginalized Stacked Denoising Autoencoders (mSDA) proposed by Chen12. In brief, mSDA is an unsupervised algorithm that learns a new robust feature representation of the training samples. It takes the unlabeled parts of both source and target samples to learn a feature map from the input space to a new representation space. As a denoising autoencoder, it finds a feature representation from which one can (approximately) reconstruct the original features of an example from its noisy counterpart. Chen12 showed that using mSDA with a linear SVM classifier gives state-of-the-art performance on the Amazon reviews datasets. As an alternative to the SVM, we propose to apply our DANN algorithm on the same representations generated by mSDA (using representations of both source and target samples). Note that, even if mSDA and DANN are two representation learning approaches, they optimize different objectives, which can be complementary.
We perform this experiment on the same amazon reviews dataset described in the previous subsection. For each pair source-target, we generate the mSDA representations using a corruption probability of and a number of layers of . We then execute the three learning algorithms (DANN, NN, and SVM) on these representations. More precisely, following the experimental procedure of Chen12, we use the concatenation of the output of the layers and the original input as the new representation. Thus, each example is now encoded in a vector of dimensions. Note that we use the same grid search as in Subsection 5.2, but with a learning rate of for both DANN and NN. The results of “mSDA representation” columns in Table 2(a) confirm that combining mSDA and DANN is a sound approach. Indeed, the Poisson binomial test shows that DANN has a better performance than NN and SVM with probabilities 0.82 and 0.88 respectively, as reported in Table 2(b).
5.4 Proxy A-distance
The theoretical foundation of DANN is the domain adaptation theory of BenDavid-NIPS06; BenDavid-MLJ2010. We claimed that DANN finds a representation in which the source and the target example are hardly distinguishable. Our toy experiment of Section 5.1 already points out some evidences, but we want to confirm it on real data. To do so, we compare the Proxy A-distance (PAD) on various representations of the Amazon Reviews dataset. These representations are obtained by running either NN, DANN, mSDA, or mSDA and DANN combined. Recall that PAD, as described in Section 2.2, is a metric estimating the similarity of the source and the target representations. More precisely, to obtain a PAD value, we use the following procedure: (1) we construct the dataset of Equation (2) using both source and target representations of the training samples; (2) we randomly split in two subsets of equal size; (3) we train linear SVMs on the first subset of using a large range of values; (4) we compute the error of all obtained classifiers on the second subset of ; and (5) we use the lowest error to compute the PAD value of Equation (3).
Firstly, Figure 3(a) compares the PAD of DANN representations obtained in the experiments of Section 5.2 (using the hyper-parameters values leading to the results of Table 1) to the PAD computed on raw data. As expected, the PAD values are driven down by the DANN representations.
Secondly, Figure 3(b) compares the PAD of DANN representations to the PAD of standard NN representations. As the PAD is influenced by the hidden layer size (the discriminating power tends to increase with the dimension of the representation), we fix here the size to neurons for both algorithms. We also fix the adaptation parameter of DANN to as it was the value that has been selected most of the time during our preceding experiments on the Amazon Reviews dataset. Again, DANN is clearly leading to the lowest PAD values.
Lastly, Figure 3(c) presents two sets of results related to Section 5.3 experiments. On one hand, we reproduce the results of Chen12, which noticed that the mSDA representations gave greater PAD values than those obtained with the original (raw) data. Although the mSDA approach clearly helps to adapt to the target task, it seems to contradict the theory of BenDavid-NIPS06. On the other hand, we observe that, when running DANN on top of mSDA (using the hyper-parameters values leading to the results of Table 1), the obtained representations have much lower PAD values. These observations might explain the improvements provided by DANN when combined with the mSDA procedure.
6 Conclusion and Future Work
In this paper, we have proposed a neural network algorithm, named DANN, that is strongly inspired by the domain adaptation theory of BenDavid-NIPS06; BenDavid-MLJ2010. The main idea behind DANN is to encourage the network’s hidden layer to learn a representation which is predictive of the source example labels, but uninformative about the domain of the input (source or target). Extensive experiments on the inter-twinning moons toy problem and Amazon reviews sentiment analysis dataset have shown the effectiveness of this strategy. Notably, we achieved state-of-the-art performances when combining DANN with the mSDA autoencoders of Chen12, which turned out to be two complementary representation learning approaches.
We believe that our domain adaptation regularizer that we develop for the DANN algorithm can be incorporated into many other learning algorithms. Natural extensions of our work would be deeper network architectures, multi-source adaptation problems and other learning tasks beyond the basic binary classification setting. We also intend to meld the DANN approach with denoising autoencoders, to potentially improve on the two steps procedure of Section 5.3.