Transfer Learning for Relation Extraction via Relation-Gated Adversarial Learning

08/22/2019 ∙ by Ningyu Zhang, et al. ∙ University of Oxford Zhejiang University 21

Relation extraction aims to extract relational facts from sentences. Previous models mainly rely on manually labeled datasets, seed instances or human-crafted patterns, and distant supervision. However, the human annotation is expensive, while human-crafted patterns suffer from semantic drift and distant supervision samples are usually noisy. Domain adaptation methods enable leveraging labeled data from a different but related domain. However, different domains usually have various textual relation descriptions and different label space (the source label space is usually a superset of the target label space). To solve these problems, we propose a novel model of relation-gated adversarial learning for relation extraction, which extends the adversarial based domain adaptation. Experimental results have shown that the proposed approach outperforms previous domain adaptation methods regarding partial domain adaptation and can improve the accuracy of distance supervised relation extraction through fine-tuning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Relation extraction (RE) is devoted to extracting relational facts from sentences, which can be applied to many natural language processing (NLP) applications such as knowledge base construction

Wu and Weld (2010) and question answering Dai et al. (2016). Given a sentence with an entity pair (,), this task aims to identify the relation between and .

Typically, existing methods follow the supervised learning paradigm, and they require extensive annotations from domain experts, which are expensive and time-consuming. To alleviate such drawbacks, bootstrap learning has been proposed to build relation extractors with a small set of seed instances or human-crafted patterns, but it suffers from the semantic drift problem

Nakashole et al. (2011)

. Besides, distant supervision (DS) methods leverage existing relational pairs of Knowledge Graphs (KGs) such as Freebase to automatically generate training data

Mintz et al. (2009). However, because of the incompleteness of KGs and a large number of relations among entities, generating sufficient noise-free labels via DS is still prohibitive.

Figure 1: Knowlege transfer for RE from the general domain (e.g.,Wikipedia) to specific domains.

Domain adaptation (DA) methods Pan et al. (2010) enable leveraging labeled data from a different but related domain, which is beneficial to RE. On the one hand, it is beneficial to adapt from a fully labeled source domain to a similar but less labeled target domain. On the other hand, it is beneficial to adapt from a general domain (e.g., Wikipedia) to a specific domain (e.g., financial, medical domain). Moreover, it is beneficial to apply adaptations from a domain with high-quality labels to a domain with noisy labels in the DS settings. However, as shown in Figure 1, there are at least three challenges when adapting a RE system to a new domain short of labels or with noisy labels.

  • Linguistic variation. First, the same semantic relation can be expressed using different surface patterns in different domains. For example, the relation subsidiary can be expressed such as ”DeepMind is a subsidiary of Alphabet” ”Microsoft is holding Mojang.” It is challenging to learn general domain-invariant textual features that can disentangle the factors of linguistic variations underlying domains and close the linguistic gap between domains.

  • Imbalanced relation distribution. Second, the marginal distribution of relation types varies from domain to domain. For example, a domain about GEO locations may consist of a large number of relational facts about located_in, whereas a domain about persons may be more focused on was_born_in

    . Although the two domains may have the same set of relations, they probably have different marginal distributions on the relations. This can lead to a

    negative transfer phenomenon Rosenstein et al. (2005) where the out-of-domain samples degrade the performance on the target domain.

  • Partial Adaptation. Third, existing adaptation models generally assume the same label spaces across the source and target domains. However, it is a common requirement to partially adapt from a general domain such as Wikipedia to a small vertical domain such as news or finance that may have a smaller label space. For instance, as shown in Figure 1, by using Wikidata that consists of more than 4,000 relations as a general source domain and a scientific dataset in a specific domain with a few relations as a target domain, relation type capital_of will trigger the negative transfer problem when discriminating the target relation types subsidiary and was_born_in.

To address the aforementioned issues, we propose a general framework called relation-gated adversarial learning (R-Gated), which consists of three modules: (1) Instance encoder, which learns transferable features that can disentangle the explanatory factors of linguistic variations

cross domains. We implement the instance encoder with a convolutional neural network (CNN) considering both model performance and time efficiency. Other neural architectures such as recurrent neural networks (RNN) can also be used as sentence encoders. (2) Adversarial domain adaptation, which looks for a domain discriminator that can distinguish between samples having different relation distributions. Adversarial learning helps learn a neural network that can map a target sample to a feature space such that the discriminator will no longer distinguish it from a source sample. (3) The relation-gate mechanism, which identifies the unrelated source data and down-weight their importance automatically to tackle the problem of negative transfer introduced either by

imbalanced relation distribution or partial transfer.

Figure 2: Overview of our approach. The parameters of the instance encoder for the source (

) and relation classifier

are pre-learned and subsequently fixed. The yellow part denotes the probabilities of assigning the target data to the source classifier to obtain category weights. The orange part denotes the output probabilities of auxiliary domain discriminator to obtain instance weights. The blue part denotes the relation-gate. The red part denotes the traditional adversarial domain classifier. GRL Ganin et al. (2016) denotes the gradient reversal layer.

2 Related Work

Relation Extraction. RE aims to detect and categorize semantic relations between a pair of entities. To alleviate the annotations given by human experts, weak supervision and distant supervision have been employed to automatically generate annotations based on KGs (or seed patterns/instances) Zeng et al. (2015); Lin et al. (2016); Ji et al. (2017); He et al. (2018); Zhang et al. (2018b); Zeng et al. (2018); Qin et al. (2018); Zhang et al. (2019). However, all these models merely focus on extracting facts from a single domain, ignoring the rich information in other domains.

Recently, there have been only a few studies on DA for RE Plank and Moschitti (2013); Nguyen et al. (2014); Nguyen and Grishman (2014); Nguyen et al. (2015); Fu et al. (2017). Of these, Nguyen et al. (2014) followed the supervised DA paradigm. In contrast, Plank and Moschitti (2013); Nguyen and Grishman (2014) worked on unsupervised DA. Fu et al. (2017); Rios et al. (2018) presented adversarial learning algorithms for unsupervised DA tasks. However, their methods suffer from the negative transfer bottleneck when encountered partial DA. To the best of our knowledge, the current approach is the first partial DA work in RE even in NLP.

Adversarial Domain Adaptation. Generative adversarial nets (GANs) Goodfellow et al. (2014) have become a popular solution to reduce domain discrepancy through an adversarial objective concerning a domain classifier Ganin et al. (2016); Tzeng et al. (2017); Shen et al. (2018). Recently, only a few DA algorithms Cao et al. (2017); Chen et al. (2018) that can handle imbalanced relation distribution or partial adaptation have been proposed. Cao et al. (2018)

proposed a method to simultaneously alleviate negative transfer by down-weighting the data of outlier source classes in category level.

Zhang et al. (2018a) proposed an adversarial nets-based partial domain adaptation method to identify the source samples in instance level.

However, most of these studies concentrate on image classification. There is a lack of systematic research on adopting DA for NLP tasks. Different from images, the text is more diverse and nosier. We believe these methods may transfer to the RE setting, but the effect of exact modifications is not apparent. We make the very first attempt to investigate the empirical results of these methods for RE. Moreover, we propose a relation-gate mechanism to explicitly model both coarse-grained and fine-grained knowledge transfer to lower the negative transfer effects from categories and samples.

3 Methodology

3.1 Problem Definition

Given a source domain of labeled samples drawn from distribution associated with classes and a target domain of unlabeled samples drawn from distribution associated with classes, where is a subset of , we have in partial DA. We denote classes but as outlier classes. The goal of this paper is to design a deep neural network that enables learning of transferable features and adaptive classifier for the target domain.

3.2 Instance Encoder

Given a sentence , where is the -th word in the sentence, the input is a matrix consisting of vectors , where corresponds to and consists of word embedding and its position embedding Zeng et al. (2014)

. We apply non-linear transformation to the vector representation of

to derive a feature vector . We choose two convolutional neural architectures, CNN Zeng et al. (2014) and PCNN Zeng et al. (2015) to encode input embeddings into instance embeddings. Other neural architectures such as RNN Zhang and Wang (2015) and more sophisticated approaches such as ELMo Peters et al. (2018) or BERT Devlin et al. (2018) can also be used.

We adopt the unshared feature extractors for both domains since unshared extractors are able to capture more domain specific features Tzeng et al. (2017). We train the source discriminative model for the classification task by learning the parameters of the source feature extractor and classifier :

(1)

where is the label of the source data ,

is the loss function for classification. Afterwards, the parameters of

and are fixed. Notice that, it is easy to obtain a pretrained RE model from the source domain, which is convenient in real scenarios.

3.3 Adversarial Domain Adaptation

To address the issue of linguistic variations

between domains, we utilize adversarial DA, which is a popular solution in both computer vision

Tzeng et al. (2017) and NLP Shah et al. (2018). The general idea of adversarial DA is to learn both class discriminative and domain invariant features, where the loss of the label predictor of the source data is minimized while the loss of the domain classifier is maximized.

3.4 Relation-Gate Mechanism

To address the issue of imbalanced relation distribution and partial adaptation, we introduce a relation-gate mechanism to explicitly model instance and category impact in the source domain.

Category Weights Learning. Given that not all classes in the source domain are beneficial and can be adapted to the target domain, it is intuitive to assign different weights to different classes to lower the negative transfer effect of outlier classes in the source domain as the target label space is a subset of the source label space. For example, given that the relation capital_of in the source domain does not exist in the target domain as shown in Figure 1, it is necessary to lower this relation to mitigate negative transfer.

We average the label predictions on all target data from the source classifier as class weights Cao et al. (2018). Practically, the source classifier

reveals a probability distribution over the source label space

. This distribution characterizes well the probability of assigning to each of the classes. We average the label predictions on all target data since it is possible that the source classifier can make a few mistakes on some target data and assign large probabilities to false classes or even to outlier classes. The weights indicating the contribution of each source class to the training can be computed as follows:

(2)

where is a -dimensional weight vector quantifying the contribution of each source class.

Instance Weights Learning. Although the category weights provide a global weights mechanism to de-emphasize the effect of outlier classes, different instances have different impacts, and not all instances are transferable. Considering the relation educated_at as an example, given an instance James Alty graduated from Liverpool University from target domain, semantically, a more similar instance of Chris Bohjalian graduated from Amherst College will provide more reference while a dissimilar instance ”He was a professor at Reed College where he taught Steve Jobs may have little contributions. It is necessary to learn fine-grained instance weights to lower the effects of samples that are nontransferable.

Given the sentence encoder of the source and target domains, we utilize a pretrained auxiliary domain classifier for instance weights learning. We regard the output of the optimal parameters of the auxiliary domain classifier as instance weights. The concept is that if the activation of the auxiliary domain classifier is large, the sample can be almost correctly discriminated from the target domain by the discriminator, which means that the sample is likely to be nontransferable Zhang et al. (2018a).

Practically, given the learned from the instance encoder, a domain adversarial loss is used to reduce the shift between domains by optimizing and auxiliary domain classifier :

(3)

To avoid a degenerate solution, we initialize using the parameter of . The auxiliary domain classifier is given by where x is the input from the source and the target domains. If , then it is likely that the sample is nontransferable, because it can be almost perfectly discriminated from the target distribution by the domain classifier. The contribution of these samples should be small. Hence, the weight function should be inversely related to , and a natural way to define the weights of the source samples is:

(4)

Relation-Gate. Both category and instance weights are helpful. However, it is obvious that the weights of different granularity have different contributions to different target relations. On the one hand, for target relations (e.g., located_in) with relatively less semantically similar source relations, it is advantageous to strengthen the category weights to reduce the negative effects of outlier classes. On the other hand, for target relations (e.g., educated_in) with many semantically similar source relations (e.g., live_in, was_born_in), it is difficult to differentiate the impact of different source relations, which indicates the necessity of learning fine-grained instance weights.

For an instance in the source domain with label , the weight of this instance is:

(5)

where is the value in the th-dimension of . We normalize the weight . is the output of relation-gate to explicitly balance the instance and category weights which is computed as below.

(6)

where

is the activation function,

is the weight matrix.

3.5 Initialization and Implementation Details

The overall objectives of our approach are , and:

(7)

where is the domain adversarial. Note that, weights 111The weights can be updated in an iterative fashion when changes. However, we found no improvement in experiments, so we compute the weights and fix them. are automatically computed and assigned to the source domain data to de-emphasize the outlier classes and nontransferable instances regarding partial DA, which can mitigate negative transfer. The overall training procedure222Training details and hyper-parameters settings can be found in supplementary materials is shown below.

1.Pre-train and on the source domain and fix all parameters afterward.

2.Compute category weights by Equation 2.

3.Pre-train and by Equation 3 and compute instance weights, then fix parameters of .

4.Train and by Equation 7, update the parameters of through GRL.

Algorithm 1 Overall Training Procedure

4 Experiments

4.1 Datasets and Evaluation

ACE05 Dataset. We use the ACE05333https://catalog.ldc.upenn.edu/LDC2006T06 dataset to evaluate our approach by dividing the articles from its six genres into respective domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and weblogs (wl). We use the same data split followed by Fu et al. (2017), in which bn nw are used as the source domain, half of bc, cts, and wl are used as the target domain for training (no label available in the unsupervised setting), and the other half of bc, cts, and wl are used as target domain for test. We split 10% of the training set to form the development set to fine-tune hyper-parameters such as . We conducted two kinds of experiments. The first is normal DA, in which the source and target domain have the same classes. The second is partial DA, in which the target domain has only half of the source domain classes.

Wiki-NYT Dataset. For DS setting, we utilize two existing datasets NYT-Wikidata Zeng et al. (2017), which align Wikidata with New York Times corpus (NYT), and Wikipedia-Wikidata Sorokin and Gurevych (2017), which align Wikidata with Wikipedia. We filter 60 shared relations to construct a new dataset Wiki-NYT444We will release our dataset., in which Wikipedia is the source domain and NYT corpus is the target domain. We split the dataset into three sets: 80% training, 10% dev, and 10% test. We conducted partial DA experiments (60 classes 30 classes). We randomly choose half of the classes to sample the target domain data.

4.2 Parameter Settings

To fairly compare the results of our models with those baselines, we set most of the experimental parameters following Fu et al. (2017); Lin et al. (2016). We train GloVe Pennington et al. (2014) word embeddings on the Wikipedia and NYT corpus with 300 dimensions. In both the training and test set, we truncate sentences with more than 120 words into 120 words.

4.3 Evaluation Results on ACE05

To evaluate the performance of our proposed approach, we compared our model with various DA models: CNN+R-Gated is our approach, CNN+DANN is an unsupervised adversarial DA method Fu et al. (2017), Hybrid is a composition model that combines traditional feature-based method, CNN and RNN Nguyen and Grishman (2016), and FCM is a compositional embedding model. From the evaluation results as shown in Table 1, we observe that (1) our model achieves performance comparable to that of CNN+DANN, which is a state-of-the-art model, in normal DA scenario and significantly outperforms the vanilla models without adversarial learning. This shows that domain adversarial learning is effective for learning domain-invariant features to boost performance. (2) Our model significantly outperforms the plain adversarial DA model, CNN+DANN, in partial DA. This demonstrates the efficacy of our hybrid weights mechanism555Since adversarial DA method significantly outperforms traditional methods Fu et al. (2017), we skip the performance comparison with FCM and Hybrid for partial DA..

Normal DA bc wl cts avg
FCM 61.90 N/A N/A N/A
Hybrid 63.26 N/A N/A N/A
CNN+DANN 65.16 55.55 57.19 59.30
CNN+R-Gated 66.15* 56.56* 56.10 59.60*
Partial DA bc wl cts avg
CNN+DANN 63.17 53.55 53.32 56.68
CNN+R-Gated 65.32* 55.53* 54.52* 58.92*
Table 1: F1 score of normal and partial DA on ACE05 dataset. * indicates

for t-test evaluation.

4.4 Evaluation Results on Wiki-NYT

For DS setting, we consider the setting of (1) unsupervised adaptation in which the target labels are removed, (2) supervised adaptation in which the target labels are retained to fine-tune our model.

Unsupervised Adaptation. Target labels are unnecessary in unsupervised Adaptation. We report the results of our approach and various baselines: PCNN+R-Gated is our unsupervised adaptation approach, PCNN (No DA) and CNN (No DA) are the methods trained on the source domain by PCNN Lin et al. (2016) and CNN Zeng et al. (2014) and tested on the target domain. Following Lin et al. (2016), we perform both held-out evaluation as the precision-recall curves shown in Figure 3 and manual evaluation in which we manually check the top-500 prediction results, as shown in Table 2.

We observe that (1) our approach achieves the best performance among all the other unsupervised DA models, including CNN+DANN. This further demonstrates the effectiveness of hybrid weights mechanism. (2) Our unsupervised DA model achieves nearly the same performance even with the supervised approach CNN; however, it does not outperform PCNN. This setting could be advantageous as in many practical applications, the knowledge bases in a vertical target domain may not exist at all or must be built from scratch.

Figure 3: Unsupervised adaptation results.
Figure 4: Supervised adaptation results.

Supervised Adaptation. Supervised Adaptation does require labeled target data; however, the target labels might be few or noisy. In this setting, we fine-tune our model with target labels. We report the results of our approach and various baselines: PCNN+R-Gated+ implies fine-tuning our model using of the target domain data, PCNN and CNN are the methods trained on the target domain by PCNN Lin et al. (2016) and CNN Zeng et al. (2014), and Rank+ExATT is the method trained on the target domain which integrates PCNN with a pairwise ranking framework Ye et al. (2017).

As shown in Figure 4 and Table 2, we observe that (1) our fine-tuned model +100% outperforms both CNN and PCNN and achieves results comparable to that of Rank+ExATT. The case study results in Table 5 further shows that our model can correct noisy labels to some extent due to the relatively high quality of source domain data. (2) The extent of improvement from using 0% to 25% of target training data is consistently more significant than others such as using 25% to 50%, and fine-tuned model with only thousands labeled samples (+25%) matches the performance of training from scratch with 10

more data, clearly demonstrating the benefit of our approach. (3) The top 100 precision of fine-tuned model degrades from 75% to 100%. This indicates that there exits noisy data which contradict with the data from the source domain. We will address this by adopting additional denoising mechanisms like reinforcement learning, which will be part of our future work.

Precision Top 100 Top 200 Top 500 Avg.
CNN (No DA) 0.62 0.60 0.59 0.60
PCNN (No DA) 0.66 0.63 0.61 0.63
CNN+DANN 0.80 0.75 0.67 0.74
CNN 0.85 0.80 0.69 0.78
PCNN 0.87 0.84 0.74 0.81
Rank+ExATT 0.89 0.84 0.73 0.82
PCNN+R-Gated 0.85* 0.83* 0.73* 0.80*
+25% 0.88 0.84 0.75 0.82
+50% 0.89 0.85 0.76 0.82
+75% 0.90 0.85 0.77 0.83
+100% 0.88 0.86* 0.77* 0.83*
Table 2: Precision values of the top 100, 200 and 500 sentences for unsupervised and supervised adaptation. * indicates for t-test evaluation.

4.5 Ablation Study

To better demonstrate the performance of different strategies in our model, we separately remove the category and instance weights. The experimental results on Wiki-NYT dataset are summarized in Table 3. PCNN+R-Gated is our method; w/o gate is the method without relation-gate ( is fixed.); w/o category is the method without category weights () ; w/o instance is the method without instance weights () ; w/o both is the method without both weights. We observe that (1) the performance significantly degrades when we remove ”relation-gate.” This is reasonable because the category and instance play different roles for different relations, while w/o gate treat weights equally which hurts the performance. (2) the performance degrades when we remove ”category weights” or ”instance weights.” This is reasonable because different weights have different effects in de-emphasizing those outlier classes or instances.

Precision Top 100 Top 200 Top 500 Ave.
PCNN+R-Gated 0.85* 0.83* 0.73* 0.80*
w/o gate 0.85 0.79 0.70 0.78
w/o category 0.81 0.76 0.66 0.74
w/o instance 0.84 0.78 0.69 0.77
w/o both 0.80 0.75 0.65 0.73
Table 3: Precision values of the top 100, 200 and 500 sentences for ablation study. * indicates for t-test evaluation.

4.6 Parameter Analysis

Relation-Gate. To further explore the effects of relation-gate, we visualize for all target relations on Wiki-NYT dataset. From the results shown in Figure 5 (a), we observe the following: (1) the instance and category weights have different influences on performance for different relations. Our relation-gate mechanism is powerful to find that instance weights is more important for those relation (e.g., educated_at, live_in a.k.a., relations with highest score in Figure 5 (a)) while category weights are more useful for other relations. (2) The category weights have relatively more influence on the performance than instance weights for most of the relations due to the noise and variations in instances; however, the category weights are averaged on all target data and thus less noisy.

(a) w.r.t #Relations
(b) F1 w.r.t #Target Classes
Figure 5: Parameter analysis reuslts.

Different Number of Target Classes. We investigated partial DA by varying the number of target classes in the Wiki-NYT dataset666The target classes are sampled three times randomly, and the results are averaged.. Figure 5 (b) shows that when the number of target classes decreases, the performance of CNN+DANN degrades quickly, implying the severe negative transfer. We observe that PCNN+R-Gated outperforms CNN+DANN when the number of target classes decreases. Note that, PCNN+R-Gated performs comparably to CNN+DANN in standard DA when the number of target classes is 60. This means that the weights mechanism will not wrongly filter out classes when there are no outlier classes.

4.7 Case Study

We select samples from shared and outlier relations for detail analysis in Case 1, Case 2 and Case 3 and give examples to show that our model can correct noisy labels in Case 4.

Case 1: Relation-gate. We give some examples of how our relation-gate balance the weights for classes and instances. In Table 4, we display the of different relations. For relation capital_of, there are lots of dissimilar relations so category weights are more important, which results in a small . For relation educated_in (edu_in), the instance difference is more important so is relatively large.

Case 2: Category Weights. We give some examples of how our approach assign different weights for classes to mitigate the negative effect of outlier classes. In Table 4, we display the sentences from shared classes and outlier classes. The relation capital_of is an outlier class whereas director is a shared class. We observe that our model can automatically find outlier classes and assign lower weights to them.

Instances Relations
He was born in Rio_de_Janeiro, Brazil to a German father and a Panama nian mother. capital_of 0.1 0.2 0.1
In 2014, he made his Tamil film debut in Malini_22_Palayamkottai directed by Sripriya. director 0.7 0.8 0.4
Sandrich was replaced by George_Stevens for the teams 1935 film The_Nitwits. director 0.7 0.5 0.5
Camp is a 2003 independent musical_film written and directed by Todd_Graff. director 0.7 0.3 0.4
Chris Bohjalian graduated from Amherst College edu_in 0.4 0.7 0.9
Table 4: Examples for Case 1, 2 and 3, and denote category and instance weights, respectively.

Case 3: Instance Weights. We give some examples of how our approach assign different weights for instances to de-emphasize the nontransferable samples. In Table 4, we observe that (1) our model can automatically assign lower weights to instances in outlier classes. (2) Our model can assign different weights for instances in the same class space to down-weight the negative effect of nontransferable instances. (3) Although our model can identify some nontransferable instances, it still assigns incorrect weights to some instances (The end row in Table 4) which is semantically similar and transferable. We will address this by adopting additional mechanisms like transferable attention Wang et al. (2019), which will be part of our future work.

Instances DS R-Gated
They are trying to create a united front at home in the face of the pressures Syria is facing,“ said Sami Moubayed, a political analyst and writer here. p_of_birth NA
Iran injected Syria with much confidence: stand up, show defiance,“ said Sami Moubayed, a political analyst and writer in Damascus. p_of_birth NA
Table 5: Examples for Case 3, is for short.

Case 4: Noise Reduction. We give some examples of how our approach takes effect in correcting the noisy labels of the target domain. In Table 5, we display the sentences that are wrongly marked in DS settings and show their labels predicted by our approach. We observe that our model can correct some noisy labels, verifying that our model can be used to adapt from a source domain with high-quality labels to a target domain with noisy distant labels. This is reasonable because Wikidata is partly aligned with the NYT corpus, entity pairs with fewer sentences are more likely to be false positive, which is the major noise factor. However, Wikidata can be relatively better aligned with Wikipedia, which can create more true positive samples.

5 Conclusion and Future Work

In this paper, we propose a novel model of relation-gated adversarial learning for RE. Extensive experiments demonstrate that our model achieves results that are comparable with that of state-of-the-art DA baselines and can improve the accuracy of distance supervised RE through fine-tuning. In the future, we intend to improve the DA using only a few supervisions, namely few-shot adversarial DA. It will also be promising to apply our method to other NLP scenarios.

References

  • Z. Cao, M. Long, J. Wang, M. I. Jordan, and M. KLiss (2017) Partial transfer learning with selective adversarial networks. arXiv preprint arXiv:1707.07901. Cited by: §2.
  • Z. Cao, L. Ma, M. Long, and J. Wang (2018) Partial adversarial domain adaptation. In Proceddings of ECCV, Vol. 1, pp. 4. Cited by: §2, §3.4.
  • Q. Chen, Y. Liu, Z. Wang, I. Wassell, and K. Chetty (2018) Re-weighted adversarial adaptation network for unsupervised domain adaptation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 7976–7985. Cited by: §2.
  • Z. Dai, L. Li, and W. Xu (2016) Cfo: conditional focused neural question answering with large-scale knowledge bases. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.2.
  • L. Fu, T. H. Nguyen, B. Min, and R. Grishman (2017) Domain adaptation for relation extraction with domain adversarial neural network. In Proceedings of IJCNLP, Vol. 2, pp. 425–429. Cited by: §2, §4.1, §4.2, §4.3, footnote 5.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. JMLR 17 (1), pp. 2096–2030. Cited by: Figure 2, §2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In processing of NIPS, pp. 2672–2680. Cited by: §2.
  • Z. He, W. Chen, Z. Li, M. Zhang, W. Zhang, and M. Zhang (2018) SEE: syntax-aware entity embedding for neural relation extraction. In Proceedings of AAAI, Cited by: §2.
  • G. Ji, K. Liu, S. He, J. Zhao, et al. (2017) Distant supervision for relation extraction with sentence-level attention and entity descriptions.. In AAAI, pp. 3060–3066. Cited by: §2.
  • Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun (2016) Neural relation extraction with selective attention over instances. In Proceedings of ACL, Vol. 1, pp. 2124–2133. Cited by: §2, §4.2, §4.4, §4.4.
  • M. Mintz, S. Bills, R. Snow, and D. Jurafsky (2009) Distant supervision for relation extraction without labeled data. In Proceedings of ACL, pp. 1003–1011. Cited by: §1.
  • N. Nakashole, M. Theobald, and G. Weikum (2011) Scalable knowledge harvesting with high precision and high recall. In Proceedings of WSDM, pp. 227–236. Cited by: §1.
  • M. L. Nguyen, I. W. Tsang, K. M. A. Chai, and H. L. Chieu (2014) Robust domain adaptation for relation extraction via clustering consistency. In Proceedings of ACL, Vol. 1, pp. 807–817. Cited by: §2.
  • T. H. Nguyen and R. Grishman (2014) Employing word representations and regularization for domain adaptation of relation extraction. In Proceedings of ACL, Vol. 2, pp. 68–74. Cited by: §2.
  • T. H. Nguyen and R. Grishman (2016) Combining neural networks and log-linear models to improve relation extraction. In Proceedings of IJCAI Workshop DLAI, Cited by: §4.3.
  • T. H. Nguyen, B. Plank, and R. Grishman (2015) Semantic representations for domain adaptation: a case study on the tree kernel-based method for relation extraction. In Proceedings of ACL, Vol. 1, pp. 635–644. Cited by: §2.
  • S. J. Pan, Q. Yang, et al. (2010) A survey on transfer learning. TKDE 22 (10), pp. 1345–1359. Cited by: §1.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of EMNLP, pp. 1532–1543. Cited by: §4.2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §3.2.
  • B. Plank and A. Moschitti (2013) Embedding semantic similarity in tree kernels for domain adaptation of relation extraction. In Proceedings of ACL, Vol. 1, pp. 1498–1507. Cited by: §2.
  • P. Qin, W. Xu, and W. Y. Wang (2018) DSGAN: generative adversarial training for distant supervision relation extraction. In Proceedings of ACL, Cited by: §2.
  • A. Rios, R. Kavuluru, and Z. Lu (2018) Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics 1, pp. 9. Cited by: §2.
  • M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich (2005) To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, Vol. 898, pp. 1–4. Cited by: 2nd item.
  • D. J. Shah, T. Lei, A. Moschitti, S. Romeo, and P. Nakov (2018) Adversarial domain adaptation for duplicate question detection. arXiv preprint arXiv:1809.02255. Cited by: §3.3.
  • J. Shen, Y. Qu, W. Zhang, and Y. Yu (2018) Wasserstein distance guided representation learning for domain adaptation.. In Proceddings of AAAI, Cited by: §2.
  • D. Sorokin and I. Gurevych (2017) Context-aware representations for knowledge base relation extraction. In Proceedings of EMNLP, pp. 1784–1789. Cited by: §4.1.
  • E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In Proceddings of CVPR, Vol. 1, pp. 4. Cited by: §2, §3.2, §3.3.
  • X. Wang, L. Li, W. Ye, M. Long, and J. Wang (2019) Transferable attention for domain adaptation. Cited by: §4.7.
  • F. Wu and D. S. Weld (2010) Open information extraction using wikipedia. In Proceedings of ACL, pp. 118–127. Cited by: §1.
  • H. Ye, W. Chao, Z. Luo, and Z. Li (2017) Jointly extracting relations with class ties via effective deep ranking. In Proceedings of ACL, Vol. 1, pp. 1810–1820. Cited by: §4.4.
  • D. Zeng, K. Liu, Y. Chen, and J. Zhao (2015) Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of EMNLP, pp. 1753–1762. Cited by: §2, §3.2.
  • D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao (2014) Relation classification via convolutional deep neural network. In Proceedings of COLING, pp. 2335–2344. Cited by: §3.2, §4.4, §4.4.
  • W. Zeng, Y. Lin, Z. Liu, and M. Sun (2017) Incorporating relation paths in neural relation extraction. In Proceddings of EMNLP, Cited by: §4.1.
  • X. Zeng, S. He, K. Liu, and J. Zhao (2018) Large scaled relation extraction with reinforcement learning. In Processings of AAAI, Vol. 2, pp. 3. Cited by: §2.
  • D. Zhang and D. Wang (2015) Relation classification via recurrent neural network. arXiv preprint arXiv:1508.01006. Cited by: §3.2.
  • J. Zhang, Z. Ding, W. Li, and P. Ogunbona (2018a) Importance weighted adversarial nets for partial domain adaptation. In Proceedings of CVPR, pp. 8156–8164. Cited by: §2, §3.4.
  • N. Zhang, S. Deng, Z. Sun, G. Wang, X. Chen, W. Zhang, and H. Chen (2019) Long-tail relation extraction via knowledge graph embeddings and graph convolution networks. arXiv preprint arXiv:1903.01306. Cited by: §2.
  • N. Zhang, S. Deng, Z. Sun, X. Chen, W. Zhang, and H. Chen (2018b) Attention-based capsule networks with dynamic routing for relation extraction. In Proceedings of EMNLP, Cited by: §2.