Domain adaptation performance of a learning algorithm on a target domain is a function of its source domain error and a divergence measure between the data distribution of these two domains. We present a study of various distance-based measures in the context of NLP tasks, that characterize the dissimilarity between domains based on sample estimates. We first conduct analysis experiments to show which of these distance measures can best differentiate samples from same versus different domains, and are correlated with empirical results. Next, we develop a DistanceNet model which uses these distance measures, or a mixture of these distance measures, as an additional loss function to be minimized jointly with the task's loss function, so as to achieve better unsupervised domain adaptation. Finally, we extend this model to a novel DistanceNet-Bandit model, which employs a multi-armed bandit controller to dynamically switch between multiple source domains and allow the model to learn an optimal trajectory and mixture of domains for transfer to the low-resource target domain. We conduct experiments on popular sentiment analysis datasets with several diverse domains and show that our DistanceNet model, as well as its dynamic bandit variant, can outperform competitive baselines in the context of unsupervised domain adaptation.READ FULL TEXT VIEW PDF
In situations where large-scale annotated datasets are available, supervised learning algorithms have achieved remarkable progress in various NLP challenges[lecun2015deep]. Most supervised learning algorithms rely on the assumption that data distribution during training is the same as that during test. However, in many real-life scenarios, the data distribution of interest at test-time might be different from that during training. The process of collecting new datasets that reflect the new distribution is usually not scalable due to monetary as well as time constraints. Hence, the goal of domain adaptation is to construct a learning algorithm, which, given samples of observations from a source domain, is able to adapt its performance to a target domain where the data distribution could be different.
Two major research areas in domain adaptation include supervised domain adaptation and unsupervised domain adaptation. In the former setup, limited training data from the target domain is available to provide supervision signals [daume2009frustratingly], whereas in the latter case, only unlabeled data from the target domain is available [ganin2016domain, long2017deep, bousmalis2016domain, sun2016return, sun2016deep, tzeng2017adversarial]. In this work, we focus on the unsupervised domain adaptation. It has been shown that the domain adaptation performance is influenced by three major (and orthogonal) factors [ben2010theory]. The first factor is the model performance on the source task, which benefits from recent advancements in neural models and is orthogonal to our focus. The second factor is the difference in the labeling functions across domains, which is inherent to the nature of the dataset and expected to be small in practice [ben2010theory]. The third factor represents a measure of divergence of data distributions – if the data distribution between the source and target domain is similar, we can reasonably expect a model trained on the source domain to perform well on the target domain. Our work primarily focuses on the last factor and aims to study the following two questions in the context of NLP: how to accurately estimate the dissimilarity between a pair of domains (Sec. 3 and Sec. 6), and how to leverage these domain dissimilarity measures to improve domain adaptation learning (Sec. 4 and Sec. 7).
To this end, we first provide a detailed study (comparison, models, and analyses) of several domain distance measures from the literature, with the goal of scalability (easy to calculate), differentiability (can be minimized), and interpretability (in a simple analytical form with well-studied properties), namely , Maximum Mean Discrepancy (MMD) [gretton2012kernel], Fisher Linear Discriminant (FDA) [friedman2001elements], Cosine, and Correlation Alignment (CORAL) [sun2016return]. We start by defining these distance measures in Sec. 3, and provide a set of analyses to assess them in Sec. 6
: (1) the ability of these distance measures to separate domains, and (2) the correlation between these distance measures and empirical results. From these analyses, we note that there does not exist a single best distance measure that fits all, and each measure provides an estimate of domain distance that could be complementary (e.g., based on discrepancy versus class separation). Thus, we also propose to use a mixture of distance measures, where we additionally introduce an unsupervised criterion to select the best distance measures so as to reduce the number of extra weight hyperparameters when mixing them.
Motivated by the aforementioned analysis, we next present a simple ‘DistanceNet’ model (in Sec. 4) that integrates these measures into the training optimization. In particular, we augment the classification task loss function with an additional distance measure. By minimizing the representational distances between features from source and target domains, the model learns better domain-agnostic features. Finally, when data from multiple source domains are present, we learn a dynamic scheduling of these domains that maximizes the learning performance on the no-training target task by framing the problem of dynamic domain selection as a multi-armed bandit problem, where each arm represents a candidate source domain.
We conduct our analyses and experiments on a popular sentiment analysis dataset with several diverse domains from liu2017adversarial liu2017adversarial, and present the domain adaptation results in Sec. 7. We first show that a subset of the domain discrepancy measures is able to separate samples from source and target domains. Then we show that our DistanceNet model, which uses one or a mixture of multiple domain discrepancies as an extra loss term, can outperform multiple competitive baselines. Finally, we show that our dynamic, bandit variant of the DistanceNet can also outperform a fairly comparable multi-source baseline that has access to the same amount of data.
Building an algorithm for domain adaptation is an open theoretical as well as practical problem [blitzer2006domain, pan2010survey, glorot2011domain, blitzer2011domain, kulis2011you, saito2018maximum, kuroki2019unsupervised, lee2019domain].111Due to AAAI page limit, we discuss the primary related work here, but we will add an extended version in the arxiv version. When labeled data from target domain is available, supervised domain adaptation can achieve state-of-the-art results via fine-tuning, especially when source domain has orders of magnitude more data than target domain [devlin2018bert, radford2018improving]. For unsupervised domain adaptation (no labels for target domains), there exist multiple approaches that have achieved remarkable progress, such as instance selection/reweighting [huang2007correcting, gong2013connecting, remus2012domain] and feature space transformation [pan2011domain, baktashmotlagh2013unsupervised]. In this work we mainly focus on measuring domain discrepancy.
The works of kifer2004detecting kifer2004detecting, ben2007analysis ben2007analysis, and ben2010theory ben2010theory provide an upper bound on the performance of a classifier under domain shift. They introduce the idea of training a binary classifier to distinguish samples from source/target domains, and the error-divergence provides an estimate of the discrepancy between domains. A tractable approximation, proxy -distance, applies a trained linear classifier to minimize a modified Huber loss [ben2007analysis].
Recent works further aim to provide more efficient estimates of the domain discrepancy. One popular choice is matching the distribution means in the kernel-reproducing Hilbert space (RKHS) [huang2007correcting, gong2013connecting, tzeng2014deep, long2015learning, bousmalis2016domain, long2016unsupervised, long2017deep, zellinger2017central, rozantsev2019beyond] using Maximum Mean Discrepancy (MMD) [gretton2012kernel]. These methods have also been used in generative models [li2015generative, dziugaite2015training]
. Other methods explored in the literature include central moment discrepancy (CMD)[zellinger2017central], correlation alignment (CORAL) [sun2016return, sun2016deep], canonical correlation analysis (CCA) [blitzer2011domain]benaim2017one]. In addition to these directly-computable metrics, another successful approach is to encourage learned representations to fool a classifier whose goal is to distinguish samples from the source domain and target domain [ganin2016domain, shen2017wasserstein].
When multiple domain adaptation criteria are available, ruder2017learning ruder2017learning use Bayesian optimization to decide the choice of metric, and ying2018transfer ying2018transfer use a meta-learning formulation. In our work, we provide a study of multiple domain distance measures (introduced in statistical learning/vision communities) in the context of NLP classification tasks such as sentiment analysis, where we analyze the domain-separability skills of these metrics and explore multiple ways of integrating them into the training dynamics (e.g., in the loss and as a multi-armed bandit).
Many problems can be cast as a multi-armed bandit problem. For example, graves2017automated graves2017automated use a multi-armed bandit (MAB) [bubeck2012regret] to learn a curriculum of tasks to maximize learning efficiency, sharma2017online sharma2017online use MAB to choose which domain of data to feed as input to a single model (in the context of Atari games), and guo2019autosem guo2019autosem use MAB for task selection during multi-task learning of text classification. In our work, we instead use a MAB controller with upper confidence bound (UCB) [auer2002finite] for the task of multi-source domain selection for domain adaptation.
In Sec. 1, we described that domain adaptation performance is related to domain distance/dissimilarity. Here, we will first describe our individual distance measures. Then we will describe our mixture of distances. Later in Sec. 6, we will provide detailed analysis of these distance measures. Given source domain samples as well as target domain samples , where we assume are the embedding representations of the input data (e.g., sentences) produced from some feature extractors (e.g., LSTM-RNN), the goal of the distance measure is to estimate how different these two domains are. We will introduce five such methods: distance, Cosine distance, Maximum Mean Discrepancy (MMD), Fisher Linear Discriminant (FLD), as well as CORAL.222We also experimented with proxy -distance from ben2007analysis ben2007analysis, which scored favorably on most of our evaluations. However, due to its non-differential nature as well as high computation cost, we do not include it here.
The distance measures the Euclidean distance between source domain and target domain samples. Define and , the distance is: .
Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle of these vectors:, and cosine distance is .
Given two sets of source domain and target domain samples independently and identically distributed (i.i.d.) from and, and the alternative hypothesis
via comparing test statistic, which is described next. Maximum Mean Discrepancy or MMD[gretton2012kernel]
, also known as kernel two-sample test, is a frequentist estimator for answering the above question. MMD works by comparing statistics between the two samples, and if they are similar then they are likely to come from the same distribution. This is known as an integral probability metric (IPC)[muller1997integral] in statistics literature. Formally, let be a class of functions , and the maximum mean discrepancy is:
Note that this equation involves a maximization over a family of functions. However, gretton2012kernel gretton2012kernel show that when the function class is the unit ball in a reproducing kernel Hilbert space (RKHS) endowed with a characteristic kernel , this can be solved in closed form. A corresponding unbiased finite sample estimate is:
For universal kernels like the Gaussian kernel with bandwidth , minimizing MMD is analogous to minimizing a distance between all moments of the two distributions [li2015generative]. Here we will use .
Fisher linear discriminant analysis (FLD) [friedman2001elements] finds a projection (parameterized by
) where class separation is maximized. In particular, the goal of FLD is to give a large separation of class means while simultaneously keeping in-class variance small. This is formulated as, where is the between-class covariance matrix which is defined as , is the within-class covariance matrix which is defined as , is the class mean and here refers to source/target domain. The optimal can be solved analytically as: . Though the optimal is usually desired, here we use the optimal as a proxy of domain distance, and thus define our Fisher distance as , which is a measure of difference between source/target representation means normalized by a measure of within-class scatter matrix. Note that computing the is analogous to approximating the divergence between two domains by training an FLD to discriminate between unlabeled instances from source and target domains.
The CORAL (correlation alignment) [sun2016deep, sun2016return] loss is defined as the distance between the second-order statistics of the source and target samples: , where denotes the squared matrix Frobenius norm, represents feature dimension, and and are the covariance matrices of source and target samples.
As we will demonstrate in Sec. 6
, no single distance measure outperforms all the others in our analyses. Also, note that while different distance measures provide different estimates of domain distances, each distance measure has its pathological cases. For example, samples from a Gaussian distribution and a Laplace distribution with same mean and variance might have smalldistances even though they are different, whereas MMD can differentiate between them [gretton2012kernel]. It is thus useful to consider a mixture of distances:
where is the coefficient for -th distance. While appealing at first, naively adding all the distance measures to the mixture introduces unnecessary hyper-parameters. In Sec. 6.3, we will introduce simple unsupervised criteria to only include a subset of these distance measures.
We will first describe the baseline and our DistanceNet model (based on a single source domain) which actively minimizes the distance between the source and target domain during the model training for domain adaptation. Then we introduce the multi-source variant of DistanceNet that additionally utilizes a multi-armed bandit controller to learn a dynamic curriculum of multiple source domains for training a domain adaptation model.
Given a sequence of tokens , we first embed these tokens into vector representations . Let be the output of the LSTM-RNN parameterized by
. The probability distribution of labels is produced by, where
is a fully connected neural network with parameters. The model is trained to minimize the cross entropy between predicted outputs and ground truth with training examples and classes: .
The work of ben2010theory ben2010theory shows that domain adaptation performance is related to source domain performance and source/target domain distance. The first part (source domain performance) is already handled by the cross entropy loss (Sec. 4.1), and it is thus natural to additionally encourage the model to minimize the representational distances between source and target samples. To that end, we augment the classification task’s loss function with a domain distance term. Given a sequence of tokens from the source domain , a sequence of tokens from the target domain , and model parameterized by , the new loss function for our DistanceNet (see Fig. 1) is then:
where are the predicted and ground truth outputs of source domain, are the representations of source and target domain, and is the choice of distance measure from Sec. 3.
In the previous section, we described our method for fitting a model on a pair of source/target domains. However, when we have access to multiple source domains, we need a better way to take advantage of these extra learning signals. One simple method is to treat these multiple source domains as a single (big) source domain, and apply algorithms described previously as usual. But as the model representation changes throughout the training, the domain that can provide the most informative training signal might change over time and based on the training curriculum history. This is also related to learning importance weights [ben2010theory] of each source domain over time for the target domain. Thus, it might be more favorable to dynamically select the sequence of source domains to deliver the best outcome on the target domain task.
Here, we introduce a novel multi-armed bandit controller for dynamically changing the source domain during training (Fig. 2). We model the controller as an -armed bandits (where is the number of candidate domains) whose goal is to select a sequence of actions/arms to maximize the expected future payoffs. At each round, the controller selects an action (candidate domain) based on noisy value estimates and observes a reward. More specifically, as the training progresses, the controller picks one of the training domains and have the task model train on the selected domain using the loss function specified in Eq. 2, and the performance on the validation data will be used as the reward provided to the bandit as feedback. We use upper confidence bound (UCB) [auer2002finite] bandit algorithm, which chooses the action (i.e., the source domain to use next) based on the performance upper bound: , where represents the action at iteration time , counts the number of times the action has been selected, and represents the set of candidate actions (i.e., the set of candidate source domains). represents the action-value of the action, and is calculated as the running average of rewards.333 One could also consider weighting each domain based on the distances, but these keep changing as DistanceNet’s training evolves (which minimizes the distance). Further, our bandit decides the arm to pull based on DistanceNet’s performance, thus already behaving similar to the distance-weighting approach (while also automatically learning these weights as a curriculum).
Dataset: We evaluate our methods using the datasets collected by liu2017adversarial liu2017adversarial444The datasets include “unlabeled” split., which contains datasets of product reviews [blitzer2007biographies] and movie reviews [maas2011learning, pang2005seeing], where the task is to classify these reviews as positive or negative. The performance of a model on this task is measured by accuracy. Since the number of experiments scales and for single- and multi-source experiments, we only evaluate on and datasets555MR, Apparel, Baby for single-source experiments. MR, Apparel, Baby, Books, Camera for multi-source experiments. for experiments in Sec. 7, respectively.666Note that for tasks, there will be source/target domain pairs experiments, and multi-source/single-target domain pairs experiments. However, we still use the full set of domains for the analysis in Sec. 6.
Training Details: Our baseline model is similar to that of liu2017adversarial liu2017adversarial. We use a single-layer Bi-directional LSTM-RNN as sentence encoder and a two-layer fully-connected with non-linearity layer to produce the final model outputs. The word embeddings are initialized with GloVe [pennington2014glove]. We train the model using Adam optimizer [kingma2014adam]. Following ruder2017learning ruder2017learning and bousmalis2016domain bousmalis2016domain, we chose to use a small number of target domain examples as validation set (for both tuning as well as providing rewards for the multi-armed bandit controller).777Note that the two models in Table 5 should be fairly comparable, since they have access to the same validation dataset for tuning or “refining” their hyper-parameters or as weak reward feedback. Further, there are scenarios in which querying the scalar rewards on a small validation dataset is easier than accessing the rich gradient information through them [bousmalis2016domain]. We use the adaptive experimentation platform Ax888https://github.com/facebook/Ax to tune the rest of the hyperparameters and the search space for these hyperparameters are: learning rate , dropout rate , , and . We run each model for times. We use the average validation performance as our validation criteria, and report average test performance.
Given our 5 distance measures (described in Sec. 3), we first want to ask which of these distance measures are able to measure domain (dis)similarities.
Specifically, we conduct experiments to answer the following questions:
Q1. Is the distance measure able to differentiate samples from the same versus different domains?
Q2. Does the distance measure correlate well with empirical results?
These two questions are answered next in Sec. 6.1 and Sec. 6.2, respectively. After that, we will describe our unsupervised criteria for choosing a subset of distance measures (Sec. 6.3) to be used in the mixture of distance measures introduced in Sec. 3.6.
Given two sets of source and target domain samples: and , which are independently and identically distributed (i.i.d.) from and , respectively. The goal here is to find whether these samples come from the same domain or not. For this, we compute the distance between the source and target samples, , via distance measure (selected from the distance measures defined in Sec. 3):
For distance measure to estimate domain similarity, we expect to be low when , and high otherwise (similar to two sample test statistic [gretton2012kernel]).
Fig. 3 visualizes the results of our experiments, where the distance between exhaustive source/target domain pairs are measured on datasets. We take examples from each domain999We take source domain samples from the training set and target domain samples from the validation set to avoid overlapping examples when sampling from the same domain., and embed the sentences using pre-trained model101010https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1, after which the distances are calculated. In particular, the entries on the diagonal refer to the in-domain distances (i.e., source and target domain is the same), and off-diagonal entries refer to the between-domain distances. As we want the in-domain distances to be small and between-domain distances to be large, we expect the visualization of a good distance measure to have a dark line on the diagonal (indicating low values) and bright otherwise. From the visualization plots (Fig. 3), we can see that and , are able to separate domains well. However, all these measures have different scales and sensitivity, hence, we next define two statistics to quantitatively compare different distance measures , which are denoted by and corresponding to method-1 and method-2, respectively. These statistics are shown in Table. 1. We can see that most of these methods are able to separate domains, with the exception of . Next, we describe these methods.
The first method assess whether distances between samples from the same domain are lower than those between the samples from different domains, . This statistic is appealing because it is invariant to scaling and translation, but does not concern how smaller in-domain distances are w.r.t. off-domain distances. Specifically, we compute the as:
We can see that achieves the highest score, whereas achieves the lowest.
The second method assesses how smaller the value of is in comparison to . To compute this, we first standardize111111 Subtracting the means and normalize by standard deviations.
Subtracting the means and normalize by standard deviations.the matrix , and then apply function to ensure that all entries are positive. Then we compute the sum of the diagonal entries of the transformed matrix as our second quantitative assessment (, note that smaller is better):
We can see that obtains the lowest/best value, and scores the largest value.
The methods described previously answer the question of whether given the samples and . However, the assessment we are interested in ultimately is whether the distance measures correlate with the true domain distances. As the true domain distance is latent, here we will use a proxy. We denote as the performance of the baseline model trained on the source domain and evaluated on the target domain. We want to measure the correlation between and . Specifically, we train and evaluate baseline models on all source/target domain pairs, and then compute the Pearson correlation coefficient between the results (averaged over three runs) and distance measures. The values are shown in Table 1, where we can see that most of the distance measures are correlated with actual performance, with having the lowest correlation with empirical performance (hence we ignore for all future experiments, given that it is the worst by large margins on all 3 analysis methods above).
Lastly, we present the basis for deciding which distances to (not) include in the mixture formulation described in Sec. 3.6. Specifically, our goal is to remove redundant distance measures from the mixture, subject to the constraint that the reduced mixture still provides sufficient information about the distances between two domains. We approach this problem via estimating the ‘informativeness’ of each distance measure. This is analogous to influence functions, a classic technique from robust statistics [cook1980characterizations, koh2017understanding]. To motivate our approach, let’s say our mixture includes all distance measures which are previously defined (Sec, 3). Suppose we have a function which can give us an estimate of the quality of the mixture. Now, we proceed by removing one metric (say ) from the mixture and apply the function to give us an estimate of the quality of the reduced mixture, . We can now define an estimate of distance measure’s informativeness:
If is small, we can say the removed metric is not informative given other components in the mixture. Here, we use the optimal statistics121212We do not use because it is not differentiable (calculated as multiple binary comparisons), and already achieved almost-maximum scores (Table 1) thus making the optimization less useful. Also, since we evaluate using an unsupervised criterion, we decided not to use correlation because it is a supervised evaluation. (which is unsupervised) defined in Sec. 6.1 as the mixture evaluation function:
where we estimate the maximum value using gradient descent (via the JAX library). We found that removing has far lower impact on the optimal , and thus in our experiments using mixture of distances, we do not include (see Table 2 for detailed scores of informativeness for all the distance measures).
In this section, we show domain-adaptation experimental results for the sentiment classification task on the target domain (using out-of-domain source training data). We start with comparing our (in-domain) baseline to previous work, where the source and target domain are the same. Then we will show the results of our DistanceNet (with both single distance and mixture-of-distance measures), when the source domain and target domain is different. Lastly, we will show the results of our multi-source DistanceNet baseline versus our multi-source DistanceNet bandit model which dynamically selects source domains. Based on the results of Sec. 6, we do not include in our DistanceNet experiments, and do not include both and in our DistanceNet with mixture-of-distance experiments.
In Table 3 we show the results of our (in-domain) baseline compared with similar models in liu2017adversarial liu2017adversarial. We can see that our baseline is stronger than comparable previous work in four of the five domains we considered.
Table 4 shows the results of baselines and DistanceNet models when the source and target domain is different, where the last column shows the average results.131313Note that the single-distance methods, e.g., MMD, have been used in previous works [bousmalis2016domain, tzeng2014deep, benaim2017one] and can also be considered as baselines. First, comparing the numbers to those in Table 3, we can see that performance drops when there is a shift in the data distribution. Next, we can see that by adding our domain distance measure as an additional loss term, the model is able to reduce the gap between in-domain performance and out-of-domain performance. In particular, all of our models perform better than our baseline in terms of average results, with MMD model better than the baseline by one corresponding standard deviation.141414To calculate the standard deviation of the average results, we first compute the average results for each run, and compute the standard deviation of the average results. This is equivalent to computing the standard deviation of a single large prediction by concatenating model outputs for all tasks as a single output.
Table 4 shows the results of our DistanceNet with mixture of distance measures experiments. From the results, we can see that leveraging the power of multiple distance measures additionally improves the results in out-of-domain settings, and achieving the highest average results (better than baseline by two standard deviations). We also compare our DistanceNet models to other domain-adaptation approaches. DANN encourages similar latent features by augmenting the model with a few standard layers and a new gradient reversal layer [ganin2016domain]. DataSel instead relies on data selection based on domain similarity and complexity variance [remus2012domain]. From the results, we can see that our DistanceNet with mixture of distance measures outperforms these approaches (better w.r.t. standard deviation margins).
Table 5 shows the results for our multi-source experiments, where the source domains include all but the target domain, thus we have one result for each target domain. Here the baseline is the DistanceNet with mixture of distance measures, which selects domains in a round-robin fashion. Our model instead applies a dynamic controller to select the source domain to use. We can see from the results that using the dynamic controller improves the individual results, and the average results (better by two standard deviations).151515Our single-source experiments suggested that “MR” and “Books” are not helpful for the learning of the other three tasks, thus we mask the DistanceNet loss from these domains when the target domain is not “MR” or “Books”. In general, we observed that the bandit always improves over the non-bandit baseline (with two std. deviations) even when we simply reuse the best hyperparameters found in the single-source experiments, and when we employ a bandit without the DistanceNet loss (i.e., just cross-entropy).
Fig. 4 provides example visualizations of the usefulness of each source domain for a given target domain during the training trajectory of multi-source bandit experiments. We provide a brief summary of our observations from these examples here. When the target task is “MR”, we observed that “Books” and “Apparel” are more beneficial. When the target task is “Apparel”, we found that “Camera” as well as “Baby” are beneficial; moreover, there the bandit learns to switch between “Books” and “MR” over time. When the target task is “Baby”, we see that “Camera” and “Apparel” are beneficial. When “Books” is the target task, we found that “MR” seemed to be less helpful. Finally, when the target-task is “Camera”, we see that “Books” had the highest value.
In this work, we presented a study of multiple domain distance measures to address the problem of domain adaptation. We provided analyses of these measures based on their ability to separate same/different domains and correlation with results. Next, we introduced our model, DistanceNet, which augments the loss function with the distance measures. Later, we extended our DistanceNet to the multi-source setup via a multi-armed bandit controller. Our experiment results suggest that our DistanceNet, as well as its variant with the multi-armed bandit, is able to outperform corresponding baselines.
We thank the reviewers and Boyang Li for their helpful comments. This work was supported by DARPA (YFA17-D17AP00022), NSF-CAREER Award #1846185, ONR Grant #N00014-18-1-2871, Google, Facebook, Baidu, and Salesforce. The views contained in this article are those of the authors and not of the funding agency.