Knowledge Adaptation: Teaching to Adapt

02/07/2017 ∙ by Sebastian Ruder, et al. ∙ 0

Domain adaptation is crucial in many real-world applications where the distribution of the training data differs from the distribution of the test data. Previous Deep Learning-based approaches to domain adaptation need to be trained jointly on source and target domain data and are therefore unappealing in scenarios where models need to be adapted to a large number of domains or where a domain is evolving, e.g. spam detection where attackers continuously change their tactics. To fill this gap, we propose Knowledge Adaptation, an extension of Knowledge Distillation (Bucilua et al., 2006; Hinton et al., 2015) to the domain adaptation scenario. We show how a student model achieves state-of-the-art results on unsupervised domain adaptation from multiple sources on a standard sentiment analysis benchmark by taking into account the domain-specific expertise of multiple teachers and the similarities between their domains. When learning from a single teacher, using domain similarity to gauge trustworthiness is inadequate. To this end, we propose a simple metric that correlates well with the teacher's accuracy in the target domain. We demonstrate that incorporating high-confidence examples selected by this metric enables the student model to achieve state-of-the-art performance in the single-source scenario.



There are no comments yet.


page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real-world applications such as sentiment classification Pang and Lee (2008), a model trained on one domain may not work well when directly applied to another domain due to the difference in the data distribution between the domains. At the same time, labeled data in new domains is scarce or non-existent and manual labeling of large amounts of target domain data is expensive. Domain adaptation allows models to reduce the domain discrepancy and adapt to new domains. While fine-tuning is a commonly used method for supervised domain adaptation, there is no cheap equivalent in the unsupervised case as existing Deep Learning-based approaches need to be trained jointly on source and target domain data. This is prohibitive in scenarios with a large number of domains, such as sentiment classification on the plethora of real-world review categories, blog types, or communities Hamilton et al. (2016)

. Additionally, re-training a model on source data is unfeasible for evolving domains, such as spam detection where attackers continuously adapt their strategy, scene classification where the scene changes over time

Hoffman et al. (2014)

, or a conversational agent for a user with a rapidly evolving style, such as a child or second language learner.

Rather than re-training, we would like to be able to leverage our trained model in the source domain to inform the predictions of a new model trained on the target domain. This objective aligns organically with the idea of Knowledge Distillation Bucilua et al. (2006); Hinton et al. (2015), which we extend as Knowledge Adaptation to the domain adaptation scenario. While Knowledge Distillation concentrates on training a student model on the predictions of a (possibly larger) teacher model, Knowledge Adaptation focuses on determining what part of the teacher’s expertise can be trusted and applied to the target domain.

In this context, determining when to trust the teacher is key. This circumstance is paralleled in real-world teacher-student and adviser-advisee relationships: Children learn early on to trust familiar advisers but to moderate that trust depending on the adviser’s recent history of accuracy or inaccuracy Corriveau and Harris (2009), while adults may surround themselves with advisers, e.g. to make a financial investment and gradually learn whose expertise to trust Johnson and Grayson (2005).

We demonstrate how domain similarity metrics can be used as a measure of relative trust in a teacher for unsupervised domain adaptation with multiple source domains and show state-of-the-art results for a student model that learns from multiple domain-specific teachers.

When learning from a single teacher in the single-source scenario, using a general measure of domain similarity is inadequate as the student has no other, more relevant teacher to turn to for advice in case its teacher is untrustworthy. To this end, we propose a simple measure, which correlates well with the teacher’s accuracy in the target domain and allows the student to gauge the teacher’s confidence in its predictions. We demonstrate that by incorporating high-confidence examples selected by this metric in the training process, the student model is able to outperform the state-of-the-art in single-source unsupervised domain adaptation.

Crucially, our models are the first Deep Learning-based models for domain adaptation that perform adaptation without expensive re-training on the source domain data. They are thus able to make use of readily available trained source domain models and are particularly apt for scenarios where domains change or occur in large numbers.

2 Related work

Distilling knowledge. Bucilua et al. Bucilua et al. (2006) first proposed a method to compress the knowledge of a source model, which was later improved by Hinton et al. Hinton et al. (2015). Romero et al. Romero et al. (2015) showed how this method can be adapted to train deep and thin models, while Kim and Rush Kim and Rush (2016) apply the technique to sequence-level models. In addition, Hu et al. Hu et al. (2016) use it to constrain a student model with logic rules. Our goal differs from the previous methods due to the difference in data distributions between source and target data, which necessitates to learn from the teacher’s knowledge only insofar as it is useful for the target domain. Similar in spirit to Knowledge Distillation is the KL-divergence based objective by Yu et al. (2013) Yu et al. and Li et al. (2014) for adapting an acoustic model and the Adaptive Mixture of Experts model Nowlan and Hinton (1990), which also learns which expert to trust for a given example. Both, though, require labeled samples, that are scarce for domain adaptation, while our model is entirely unsupervised.

Domain adaptation. Domain adaptation has a long history of research: Blitzer et al. Blitzer et al. (2006) proposed a structural correspondence learning algorithm. Daumé III Daumé III (2007) introduced a kernel function that maps source and target domain data to a space that encourages in-domain similarity, while Pan et al. Pan et al. (2010) proposed a spectral feature alignment algorithm to align domain-specific words into meaningful clusters, while Long and Wang Long and Wang (2015) use multi-task learning to avoid negative transfer.

Deep learning-based domain adaptation. Deep learning-based approaches to domain adaptation are more recent and have focused mainly on learning domain-invariant representations: Glorot et al. Glorot et al. (2011) first employed stacked Denoising Auto-encoders (SDA) to extract meaningful representations. Chen et al. Chen et al. (2012) in turn extended SDA to marginalized SDA by addressing SDA’s high computational cost and lack of scalability to high-dimensional features, while Zhuang et al. Zhuang et al. (2015)

proposed to use deep auto-encoders for transfer learning.

Ajakan et al. (2016) added a Gradient Reversal Layer that hinders the model’s ability to discriminate between domains. Finally, Zhou et al. Zhou et al. (2016)

transferred the source examples to the target domain and vice versa using Bi-Transferring Deep Neural Networks, while

Bousmalis et al. Bousmalis et al. (2016) propose Domain Separation Networks. All of these approaches, however, require to jointly train the model on source and target data for every new target domain.

Domain adaptation from multiple sources. For domain adaptation from multiple sources, Mansour Mansour (2009) proposed a distribution weighted hypothesis with theoretical guarantees. Duan et al. Duan et al. (2009)

proposed a method to learn a least-squares SVM classifer by leveraging source classifiers, while

Chattopadhyay et al. (2012) assign pseudo-labels to the target data. Finally, Wu and Huang Wu and Huang (2016) exploit general sentiment knowledge and word-level sentiment polarity relations for multi-source domain adaptation.

3 Knowledge Adaptation

3.1 Problem definition

In the following, we describe domain adaptation within the knowledge adaptation framework: We are provided with one or multiple source domains and a target domain . For each of the source domains, we are provided with a teacher model that was trained on examples and their labels from . In the target domain , we only have access to the examples without knowledge of their labels. Note that we omit source and target domain indexes in the following for simplicity in cases where examples are unambigous. Our task is now to train a student model that performs well on unseen examples from the target domain .

3.2 Single teacher-student model

Our teacher and student models are simple multilayer perceptrons (MLP). The basic MLP consists of an input layer, one or multiple intermediate layers, and an output layer. Each intermediate layer

learns to embed the output of the previous layer into a latent representation where and are the weights and bias of the layer, while

is the activation, typically ReLU

for hidden layers and softmax units for the output layer.

In the single source setting, the teacher has an output softmax where

are the logits of the teacher’s output layer.

is trained to minimize the loss where refers to the cross-entropy and is the label of the training example in the source domain .

The student

similarly models an output probability

where are the logits of the student’s output layer. In the context of knowledge distillation Hinton et al. (2015), the student is trained so that its output is similar to the teacher’s output and to the true labels. In practice, the output probability of the teacher is smoothed with a temperature to soften the signal and provide more information during training. The same temperature is applied to the output of the student network for the comparison:


For unsupervised domain adaptation, true labels in the target domain are not available. Thus the student is trained solely to mimic the teacher’s softened output with the following loss, which is similar to treating source input modalities as privileged information Lopez-Paz et al. (2016):

(a) Teacher model
(b) Student model
(c) Student model with multiple teachers
Figure 1: Training procedures for a) the teacher model, b) the student model, and c) the student model with multiple teachers. The teacher is trained on examples and their true labels in the source domain , while the student is trained on the softened predictions of one or multiple teachers of examples in the target domain .

3.3 Multiple teacher-student model

The teacher-student paradigm lends itself naturally to the scenario with multiple source domains. Intuitively, the trust that a student should place in a teacher should be proportional to the degree of similarity between the teacher’s domain and the student’s domain.

To this end, we consider three measures of domain similarity, which have been successfully used in domain adaptation research: Jensen-Shannon divergence Remus (2012) and Renyi divergence Van Asch and Daelemans (2010)

, which are both based on Kullback-Leibler divergence and are computed with regard to the domains’ term distributions; and Maximum Mean Discrepancy

Tzeng et al. (2014), which we compute with respect to the teacher’s latent representation. These measures are computed between the target domain and every source domain (additional information with regard to our choice and use of domain similarity measures can be found in the appendix A.1).

The student model with multiple teachers is then trained to imitate the sum of the teacher’s individual predictions weighted with the normalized similarity of their respective source domain to the target domain :


3.4 Leveraging a single teacher’s knowledge

General measures of domain similarity are useful in the multi-source setting, where we can rely on multiple teachers and choose to trust one more than the others. In the scenario with a single teacher, it is not helpful to know whether we can trust the teacher in general. We rather want a measure that allows us to determine if we can trust the teacher for a specific example.

To arrive at such a measure, we revisit the representations the teacher learns from the input data: In order to make accurate predictions, the teacher model learns to separate the representation of different output classes in its hidden representation (we use a one-layer MLP in our experiments as detailed in §

4.2; in deeper networks, this would be an intermediate layer). Even though the teacher model is trained on the source domain, this separation still holds -- albeit with decreased accuracy -- in the target domain. This can be seen in Figure 3, where examples in the target domain that were predicted as positive and negative by the teacher form distinct clusters (refer to §4.1 for details with regard to the data and task). Importantly, many of these predictions are incorrect.

Figure 2: PCA visualization of a teacher’s latent representations of target domain examples for the K->D domain pair (see §4.1 for details). A darker color reflects a higher MCD value. Best viewed in close-up.
Figure 3: Accuracy of the teacher’s predictions on the top target domain examples with the highest MCD value for the K->D domain pair.

As evidenced in Figure 3, incorrect predictions are frequent along the decision boundary and infrequent along the cluster edges, where examples are less ambiguous. More precisely, the accuracy of the teacher’s predictions on the target domain is proportional to the absolute difference in similarity of the teacher’s representation with the cluster centroids, which we refer to as Maximum Cluster Difference (MCD) and define as follows:


where and are the centroids of the positive and negative cluster respectively as predicted by the teacher, i.e. the mean representation of all examples assigned to the cluster by the teacher. Note that while we are focusing on binary classification involving two clusters, the measure is equally applicable to the multi-class setting, as demonstrated in Appendix A.2.

Evidence of the efficacy of this measure for obtaining the trustworthiness of a teacher for an example can be found in the PCA visualization111A visualization using t-SNE revealed the same cluster. However, PCA showed a clearer decision boundary. in Figure 3, where incorrect predictions are far less common for (more darkly colored) examples with higher MCD values. Additionally, the MCD score of a target domain example and the accuracy of the teacher’s prediction correlate with an average Pearson’s of 0.33 and across all domain pairs of the data described in §4.1. We furthermore plot the teacher’s accuracy for the top target domain examples with the highest MCD values in Figure 3. While the measure becomes less accurate as increases, it is very accurate for low .

For this reason, rather than weighing all examples with MCD, we propose to add unlabeled training examples with the highest MCD with their teacher-assigned label as pseudo-supervised examples on which we train the student with the following objective:

Book DVD Electronics Kitchen
None 0.7821 0.7913 0.8181 0.8529
Renyi divergence 0.7722 0.7727 0.8133 0.8420
Maximum Mean Discrepancy 0.7811 0.7839 0.7890 0.8273
Jensen-Shannon divergence 0.7918 0.7968 0.8203 0.8523
Table 1:

Comparison of the impact of different domain similarity measures on the student’s performance when used for interpolating the predictions of the source domain teacher models. For the results in each column, the domain in the column header is used as target domain and the remaining three domains are used as source domains.

where is the indicator array containing at the index and at all other indexes, while determines the contribution of the soft targets. This can be seen as a representation-based variant of instance adaptation Jiang and Zhai (2007), which uses MCD as a measure of confidence as it correlates better with teacher accuracy than teacher prediction probability. In practice, we alternate unsupervised training with the objective in equation 2 and pseudo-supervised training with the objective in equation 5, although other curricula are imaginable.

4 Experiments

4.1 Data set

We use the Amazon product reviews sentiment analysis dataset of Blitzer et al. Blitzer et al. (2006), a common benchmark for domain adaptation. The dataset consists of 4 different domains: Book (B), DVDs (D), Electronics (E) and Kitchen (K). We follow the conventions of past work and evaluate on the binary classification task where reviews with more than 3 stars are considered positive and reviews with 3 stars or fewer are considered negative. Each domains contains 1,000 positive, 1,000 negative, and approximately 4,000 unlabeled reviews. For fairness of comparison, we use the raw bag-of-words unigram/bigram features pre-processed with tf-idf as input Blitzer et al. (2006).

For single-source adaptation, we replicate the set-up of previous methods and train our teacher models on all 2,000 labeled examples, of which we reserve 200 as dev set. For domain adaptation from multiple sources, we follow the conventions of Bollegala et al. Bollegala et al. (2011)

and limit the total number of training examples for all teachers to 1,600, i.e. given three source domains, each teacher is only trained on about 533 labeled samples. We also train a general teacher on the same 1,600 examples of the three domains. In both scenarios, the student is evaluated on all 2,000 labeled samples of the target domain. As we have not found a universally applicable way to optimize hyperparameters or perform early stopping for unsupervised domain adaptation, we choose to use a small number of unlabeled examples as a labeled validation set similar to

Bousmalis et al. (2016).

4.2 Hyperparameters

Both student and teacher models are one-layer MLPs with 1,000 hidden dimensions. We use a vocabulary size of 10,000, a temperature of 5, a batch size of 10, and Adam Kingma and Ba (2015) as optimizer with a learning rate of 0.001. For every experiment, we report the average of 10 runs.

4.3 Domain adaptation from multiple sources

As it is easier for the student to assign trust when learning from multiple teachers, we first conduct experiments on the sentiment analysis benchmark for domain adaptation from multiple sources. For each experiment, one of the four domains is used as the target domain, while the remaining ones are treated as source domains.

Domain similarity. We first evaluate the performance of our student depending on different measures of domain similarity, with which we interpolate the predictions of the teachers. As evidenced in Table 1, Jensen-Shannon divergence generally performs best. We thus use this measure for the remainder of the experiments.

Book DVD Electronics Kitchen
SCL Blitzer et al. (2006) 0.7457 0.7630 0.7893 0.8207
SFA Pan et al. (2010) 0.7598 0.7848 0.7808 0.8210
SCL-com 0.7523 0.7675 0.7918 0.8247
SFA-com 0.7629 0.7869 0.7864 0.8258
SST Bollegala et al. (2011) 0.7632 0.7877 0.8363 0.8518
IDDIWP Yoshida et al. (2011) 0.7524 0.7732 0.8167 0.8383
DWHC Mansour (2009) 0.7611 0.7821 0.8312 0.8478
DAM Duan et al. (2009) 0.7563 0.7756 0.8284 0.8419
CP-MDA Chattopadhyay et al. (2012) 0.7597 0.7792 0.8331 0.8465
SDAMS-SVM Wu and Huang (2016) 0.7786 0.7902 0.8418 0.8578
SDAMS-Log Wu and Huang (2016) 0.7829 0.7913 0.8406 0.8629
Teacher-only 0.7565 0.7765 0.7960 0.8210
Student (source teachers) 0.7918 0.7968 0.8203 0.8523
Student (general teacher) 0.8014 0.8062 0.8365 0.8675
Student (source teachers + general) 0.8010 0.8088 0.8311 0.8647
Table 2: Average results for domain adaptation from multiple sources for the comparison models and ours on the sentiment analysis benchmark. For the results in each column, the domain in the column header is used as target domain and the remaining three domains are used as source domains.

Our models. For multi-source domain adaptation, we first consider a teacher-only baseline (Teacher-only), where teacher sentiment probabilities are combined, weighted with Jensen-Shannon divergence, and the most likely sentiment is chosen. We further train our student on a) the source domain-specific teachers as detailed in §3.3, b) the general teacher trained on all source domains as described in §4.1, and on c) the combination of source domain and general teachers.

Comparison models. We compare our models against the following methods: domain adaptation with structural correspondence learning (SCL) Blitzer et al. (2006); domain adaptation based on spectral feature alignment (SFA) Pan et al. (2010); adaptations of SCL and SFA via majority voting to the multi-source scenario (SCL-com and SFA-com); cross-domain sentiment classification by constructing a sentiment-sensitive thesaurus (SST) Bollegala et al. (2011); multiple-domain sentiment analysis by identifying domain dependent/independent word polarity (IDDIWP) Yoshida et al. (2011); three general-purpose multiple source domain adaptation methods (DWHC, Mansour (2009)), (DAM, Duan et al. (2009)), (CP-MDA, Chattopadhyay et al. (2012)); cross-domain sentiment classification by transferring sentiment along a sentiment graph with hinge loss and logistic loss respectively (SDAMS-SVM and SDAMS-Log) Wu and Huang (2016). Numbers are taken from Wu and Huang Wu and Huang (2016).

Results. All results are depicted in Table 2. Evaluating the combination of the source teacher models directly on the target domain (Teacher-only) produces the worst results, which underscores the need for methods that allow adaptation to the target domain. Training the student model on the soft targets of the teachers allows us to improve upon the teacher-only baseline significantly, which demonstrates the appropriateness of the teacher-student paradigm to the domain adaptation scenario. The student model outperforms comparison methods that rely on source model predictions by combining Mansour (2009) or predicting Duan et al. (2009) them. This showcases the usefulness of learning from soft targets in the domain adaptation scenario. Training on a general teacher model as well as on a combination of the general teacher and the source domain teachers allows us to improve results even further. Both models improve over existing approaches to domain adaptation from multiple sources and outperform approaches that rely on sentiment analysis-specific information Wu and Huang (2016) in all but the electronics domain.

4.4 Single-source domain adaptation

We additionally evaluate the ability of the student to only learn from a single teacher. This scenario is more challenging as the student cannot consider other teachers that might provide more relevant predictions. For each target domain, each of the three other domains is used as source domain, yielding 12 domain pairs.

Our models. On these domain pairs, we firstly evaluate our student-teacher (TS) model. For training a model that incorporates high-confidence predictions of the teacher (TS-MCD), we cross-validate the interpolation parameter in equation 5 and the number of examples with the highest MCD scores . We find that a low (around 0.2) generally yields the best results in the domain adaptation setting, as the high-confidence predictions are helpful to guide the student’s learning during training. Additionally, using the top 500 unlabeled target domain examples with the highest MCD scores for pseudo-supervised training of the student produces the best results.

Figure 4: Average results for single-source domain adaptation for the comparison models and our models on the sentiment analysis benchmark. B: Book. D: DVD. E: Electronics. K: Kitchen.

Comparison models. For the single-source case, we similarly compare against SCL Blitzer et al. (2006) and SFA Pan et al. (2010), as well as against multi-label consensus training (MCT), which combines base classifiers trained with SCL Li and Zong (2008) and against an approach that links heterogeneous input features with points via non-negative matrix factorization (PJNMF) Zhou et al. (2015). We additionally compare against the following deep learning-based approaches: stacked denoising auto-encoders (SDA) Glorot et al. (2011); marginalized SDA (mSDA) Chen et al. (2012); transfer learning with deep auto-encoders (TLDA) Zhuang et al. (2015); and bi-transferring deep neural networks (BTDNN) Zhou et al. (2016).

Results. The results can be seen in Figure 4. The student trained on the source domain teacher (TS) achieves convincing results and outperforms the state-of-the-art on three domain pairs -- twice with the Book domain as source domain, showing that knowledge acquired from the Book domain might perhaps be more easily transferable to a student model. For many domain pairs, the student still falls significantly short compared to the performance of the state-of-the-art, which highlights that solely relying on a single teacher’s predictions is insufficient to bridge the discrepancy between the domains. Instead, additional methods are necessary to provide evidence for the student when to trust the teacher’s predictions. Leveraging the teacher’s knowledge by incorporating high-confidence examples selected by MCD into the training (TS-MCD) improves the performance of the student in almost all cases significantly. This allows the student to outperform the state-of-the-art on 8 out of 12 domain pairs without expensive joint training on source and target data and with the sole dependence of a single model trained on the source domain, which is typically readily available.

5 Conclusion

In this work, we have proposed Knowledge Adaptation, an extension of the Knowledge Distillation idea to the domain adaptation scenario. This method -- in contrast to prevalent domain adaptation methods -- is able to perform adaptation without re-training. We firstly demonstrated the benefit of this paradigm by showing that a student model that takes into account the predictions of multiple teachers and their domain similarities is able to outperform the state-of-the-art for multi-source unsupervised domain adaptation on a standard sentiment analysis benchmark. We additionally introduced a simple measure to gauge the trustworthiness of a single teacher and showed how this measure can be used to achieve state-of-the-art results on 8 out of 12 domain pairs for single-source unsupervised domain adaptation.


Appendix A Appendix

a.1 Domain similarity measures

We use three measures of domain similarity in our experiments: Jensen-Shannon divergence, Renyi divergence, and Maximum Mean Discrepancy (MMD).

Jensen-Shannon divergence is a smoothed, symmetric variant of KL divergence. The Jensen-Shannon divergence between two different probability distributions

and can be written as:


where , i.e. the average distribution of and , and is the KL divergence:


Renyi divergence similarly generalizes KL divergence by assigning different weights to the probability distributions of the source and target domain and is defined as follows:


If , Renyi divergence reduces to KL divergence. In our experiments, we set following Van Asch and Daelemans (2010).

These domain similarity measures are typically based on the term distributions of the source and target domains, i.e. the probability distribution of a domain is the term distribution where is the relative probability of word appearing in the domain and is the size of the vocabulary of the domain. The intuition behind using term distributions is that similar domains usually have more terms in common than dissimilar domains. While term distributions are efficient to compute and have proven effective in previous work Van Asch and Daelemans (2010); Wu and Huang (2016), they only capture shallow occurrence statistics.

Another form of similarity metrics such as MMD are based on representations. MMD measures the distance between a source and target distribution with respect to a particular representation . The MMD between the source data and the target data is defined as follows:


The representation is usually obtained by embedding the source data and target data in a Reproducing Kernel Hilbert Space via a specifically chosen kernel, e.g. Bousmalis et al. (2016) use a linear combination of RBF kernels. Similarly to Tzeng et al. (2014), we use the hidden representation of a neural network as basis for , as we are interested in how well the teacher’s representation captures difference in domain.

In our experiments, MMD does not outperform the more traditional term distribution-based similarity measure, which we attribute to two reasons: 1) Due to the limited amount of data, our teacher model is not deep enough to capture the difference in domain in its single hidden layer; Tzeng et al. (2014) in contrast identify the fully-connected layer in the AlexNet architecture as the layer minimizing MMD. 2) The teacher is only trained on the source domain data. Its representation is thus not sensitive to detect the domain shift to the target domain. Training a separate model to minimize MMD alleviates this, but incurs additional computational costs and requires retraining on the source data during adaptation, which we set out to avoid to enable efficient adaptation.

Another commonly used measure of domain similarity is -distance. Ben-David et al. (2007) show that computing the -distance between two domains reduces to minimizing the empirical risk of a classifier that tries to discriminate between the examples in those domains. Previous work Blitzer et al. (2007) uses the Huber loss and a linear classifier for computing the -distance. In our experiments, -distance did not outperform Jensen-Shannon divergence, while its reliance on training a classifier is a downside in our scenario with multiple or changing target domains, where we would prefer more efficient measures of domain similarity.

a.2 Multi-class MCD

Maximum cluster difference can be easily extended to the multi-class setting. For classes, we compute cluster centroids for the clusters whose members have been assigned the same class by the model. We then create a set containing all unique pairs of cluster centroids. Finally, we compute the sum of pair-wise differences of the model’s representation with regard to the cluster centroid pairs: