Given a piece of text mentioning an entity pair , the goal of relation extraction (RE) is to predict the relationship between and which can be inferred from . Creating datasets for RE is difficult, requiring the need of human annotators to label each instance. One way to mitigate this bottleneck is to employ distant supervision (DS) Mintz et al. (2009). Let denote the set of all instances (pieces of text) in the corpus mentioning entity pair and let denote all the directed relation edges from node to node in the KB of the corpus. Distantly supervised relation extraction (DS-RE), under the “at-least one” framework (Hoffmann et al., 2011) assumes that , there exists at-least one instance in that expresses the relation between and .
Under the MI-ML framework (Surdeanu et al., 2012)
, models trained on DS-RE are trained and evaluated on the task of predicting all possible relation classes for the ordered pairusing multiple instances from (often called as “bag” in the literature).
Early advances in MI-ML DS-RE revealed that aggregation of instance representations followed by multi-label classification leads to an improvement in performance (as opposed to classifying each instance separately and then pooling the results for the bag). A notable paper in this direction introduced intra-bag attention(Lin et al., 2016) for the instance-aggregation step.
The intra-bag attention formulation has been used in many state-of-the-art models for DS-RE in the past, although often treated as a black box. In all such models, there is a clear distinction between the instance-encoding and instance-aggregation step and the innovation is often limited to the instance-encoding part. Let denote instances sampled from . In all models using intra-bag attention for instance-aggregation, each is independently encoded to form the instance representation, , following which the relation triple representation for the triple is given by . Here is any one of the relation classes present in the dataset and is the normalized attention score allotted to instance representation
by relation query vectorfor relation . The model then predicts whether the relation triple is a valid one by sending each
through a feed-forward neural network. In some variants,is replaced with a shared query vector for all relation-classes, , resulting in a bag-representation corresponding to as opposed to triple-representation.
The best performing models in DS-RE include RESIDE Vashishth et al. (2018), DISTRE Alt et al. (2019), REDSandT Christou and Tsoumakas (2021) and the current state-of-the-art for the task, CIL Chen et al. (2021)
. All of these models use intra-bag attention for instance aggregation, with differences in the choice of encoders, pre-training, use of side-information and loss function.
The last three models mentioned above use pre-trained language models for encoding instances. Consider BERT (Devlin et al., 2019), a popularly used pre-trained language model. BERT has enough capacity to generate contextualized embeddings of every token in a multi-sentence paragraph (of 512 tokens or less). Further, it’s pre-training objective includes the “next sentence prediction” (NSP) task, where 50% of the time, two sentences from different contexts are encoded together. These two properties suggest that encoders like BERT should have enough capacity to encode multiple instances from a bag in one-go.
We test this hypothesis by formulating a new instance-aggregation scheme called Passsage-Att. Under this scheme, we first construct a “passage of facts”, say by concatenating all the sentences mentioning a particular entity-pair . The passage is sent through a transformer encoder to generate contextualized embeddings of each token. Using a relation query vector for a particular relation class , we summarize the passage using dot-product attention and generate the triple representation for the triple . We then use a binary classifier (shared across all relation classes) to predict whether is a valid triple or not. Such a formulation will have the following additional properties as opposed to the intra-bag attention based aggregation:
Tokens present in one sentence will be able to exchange information with tokens present in some other sentence in the bag during self-attention step in every layer of the BERT encoder
Under our framework, the “atleast-one” assumption is slightly modified: , there exists a sub-section in that expresses the relation between . The tokens present in such a sub-section need not come from the same sentence. This is a softer assumption than the previous variant.
We train and test a BERT+Passage-Att model on three datasets, NYT-10d (Riedel et al., 2010), NYT-10m Gao et al. (2021) and DiS-ReX (Bhartiya et al., 2021). The three datasets represent three different settings: (1) training and testing on a DS dataset (2) training on a DS train set but testing on a manually annotated test set (3) training and testing on a multilingual DS dataset. Our model achieves the state-of-the-art performance in each of the three settings. To encourage replicability of results, we publicly release our code and model 222https://github.com/dair-iitd/DSRE.
2 Related Works
Mintz et al. (2009)
proposed distant supervision as an alternative to manually annotating relation extraction datasets. The original paper also proposed a multi-class logistic regression model to train and test on distantly-supervised dataset. NYT-10dRiedel et al. (2010)
is the premier dataset for this task, still used to this day as a widely popular benchmarking dataset for DS-RE models. One of the first neural network approaches to DS-RE, under the MI-ML framework, was PCNN (piece-wise convolutional neural network)(Zeng et al., 2015)
. Here the multi-label classification scores of each instance were calculated independently and then pooled together using max-pooling.Lin et al. (2016) showed that by aggregating instance representations using intra-bag attention, one can obtain significant improvement in performance while using the same instance encoder. RESIDE, DISTRE, REDSandT and CIL belong to the category of models using intra-bag attention for instance aggregation. Each model has previously obtained the then state-of-the-art performance on the NYT-10d dataset. RESIDE and REDSandT use side-information in-addition to the sentences present in the dataset. Some examples of side-information include entity-type information, sub-tree parse, relation aliases etc. DISTRE, on the other hand, pre-trains its instance encoder (OpenAI GPT Radford et al. (2018)) under the language modeling task on the sentences present in the DS-RE dataset before fine-tuning for the downstream task. The current state-of-the-art on mono-lingual DS-RE is CIL which uses masked-language modeling and contrastive learning losses as auxiliary losses during training. We compare these models with BERT+Passage-Att. We note that our model doesn’t use additional pre-training, side-information and any auxiliary losses during training.
For multilingual DS-RE, DiS-ReX (Bhartiya et al., 2021) is the premier benchmarking dataset for the task. Current baselines on the dataset again use intra-bag attention for instance aggregation. The state-of-the-art on DiS-ReX uses the instance aggregation scheme first formulated in MNRE (Lin et al., 2017) 333Taken from https://github.com/dair-iitd/DiS-ReX. MNRE slightly modifies the intra-bag attention scheme by separately attending over sentences of the same language inside a bag. Each tuple is assigned a query vector for the attention aggregation step. A bag is divided into sub-bags where each sub-bag contains the instances of the same language. In essence, a bag has sub-bags and each relation class corresponds to query vectors, where denotes the number of languages present in the dataset. These are then used to construct triple representations (using attention aggregation) and are scored independently. The final confidence score for a triple is the average of triple scores. Since our aggregation scheme doesn’t contain any language specific machinery, we also train a mBERT+Passage-Att model on DiS-ReX and compare the results with the previously reported benchmarks on the dataset.
Evaluating DS-RE models poses yet another difficulty since the test-sets of DS-RE datasets are also automatically annotated, resulting in significant amount of noise in the gold labels. Recent efforts in improving the evaluation protocol of DS-RE models include the work of Gao et al. (2021), who released the NYT-10m dataset which has a manually annotated test set. In this setting, the models are trained on a distantly-supervised train set and tested on a manually annotated test set. We compare our model with the already established baselines on the dataset and CIL (the current state-of-the-art on NYT-10d).
3 Model Architecture
Our instance aggregation scheme is divided into three steps: Passage Construction, Passage Encoding and Passage Summarization. We describe each of the steps below. Figure 1 shows the sequence of steps as a flow diagram.
3.1 Passage construction
In this step, we construct a “passage of instances”, , for entity pair by selecting sentences from the bag . We initiate the passage construction process by randomly sampling instances from the bag without replacement. We terminate the sampling process if a) adding a new instance would exceed the maximum number of tokens allowed by the encoder (eg. for BERT the limit is 512 tokens) b) all of the instances from the bag have been added. We show that truncation of passage doesn’t affect the model performance in section 4. In Figure 1, the left most arrow denotes the passage construction, where the bag contains three instances . Here each instance is a list of tokens (). We add [CLS] token at the start of the passage and [SEP] tokens to separate the tokens of two adjacent instances. Entity mentions in an instance
are surrounded by special tokens which denote head and tail entity markers. Finally the entire passage is appended with [PAD] tokens (to ensure each data point has equal number of tokens).
3.2 Passage encoding
The constructed passage is then sent through the encoder to generate contextualized embeddings of each token in the passage. [PAD] tokens are ignored during the self-attention update in each layer of the encoder. This step is corresponds to the middle section of Figure 1.
3.3 Passage summarization
We initialize relation query vectors for the passage-summarization step. Each relation-class in the dataset has a query vector associated with it. These vectors are trained along with the remaining model. Let the set of all relation vectors be denoted by where denotes the number of classes in the dataset. Let for a given bag, the contextualized embeddings of each token in the passage be where denotes the maximum length of sequence. Passage summary for relation class is given by . Here is the normalized attention weight allotted to token by relation query vector (for relation class ) using dot-product attention ( as the query and as the key). We note that also includes embeddings of the [PAD] token (i.e. they are also considered during passage summarization). We justify the choice of including [PAD] tokens in the summarization step in section 4. This passage summarization step corresponds to the rightmost section of Figure 1.
After the three steps,
is obtained for each relation class in the dataset. These vectors are independently passed through a binary classifier followed by a sigmoid activation function to return the confidence for the triplegiven by . The binary classifier is shared across all relation classes.
We ask the following question with regards to our proposed aggregation scheme for DS-RE:
How does our model compare against existing baselines for DS-RE?
How does our model (trained on distantly supervised data) perform when evaluated on humanly annotated test data?
How does our model perform on Multilingual DS-RE task?
4.1 Results on popular benchmarking datasets
We train and test our model on two NVIDIA GeForce GTX 1080 Ti with a batch size of 16. We use learning rate of 2e-5 and weight decay of 1e-5 with AdamW (Loshchilov and Hutter, 2017; Kingma and Ba, 2014)
as the optimizer. Our implementation uses PyTorch(Paszke et al., 2019), the Transformers library (Wolf et al., 2019) and OpenNRE 444https://github.com/thunlp/OpenNRE (Han et al., 2019). We use bert-base-uncased checkpoint for BERT initialization in the mono-lingual setting. For multi-lingual setting, we use bert-base-multilingual-uncased.
The results presented in tables 1, 2, 3 suggest that our model establishes the new state-of-the-art performances on all three datasets. On NYT-10d, our model closely beats CIL in AUC, although showing significant improvement in P100, P200 and P300. This fact is also reflected in the PR Curves (figure 2) where ours is the only model which is able to achieve close to 100% precision for some threshold values. Our model beats REDSandT by 9pts in AUC, even though a) both are BERT based models b) REDSandT uses side-information in addition to sentences present in the dataset (eg. entity-type information, dependency parses etc).
On NYT-10m, we compare our model with BERT+Att, BERT+Avg and BERT+One schemes, presented in Gao et al. (2021) as the best performing models on NYT-10m. Each of the three models use BERT to construct instance-representations. BERT+Att aggregates the instance representations by using the standard intra-bag attention formulation. On the other hand, BERT+Avg weighs each instance representation uniformly, hence denoting bag-representation as the average of instance-representations. Finally, BERT+One independently performs multi-label classification on each instance present in the bag and then aggregates the classification results by performing class-wise max-pooling (over sentence scores). In essence, BERT+One ends up picking one instance for each class (the one which denotes the highest confidence for that particular class), hence the name. In addition to these models, we also run CIL on the NYT-10m dataset. We again observe that our model significantly outperforms CIL and the three models (close to 4pt improvement in AUC compared to the second best). Improvements in Macro F1 are even more significant, suggesting that our model is better at dealing class-imbalance in this setting. PR Curves again reveal that our model is the only model achieving precision values close to 100% for some threshold values. (figure 3)
In the multi-lingual domain, we again notice that our model achieves the state-of-the art performance, this time on the DiS-ReX dataset. Our model beats mBERT+MNRE by around 7 AUC points, despite MNRE being specifically geared towards the multi-lingual setting. We stress here that we made no changes in our model architecture as well as hyperparameters to achieve this result, except for replacing BERT with mBERT.
4.1.1 OOD-Generalization: Entity permutation test
To understand how robust our trained model would be to changes in the KB, we design the entity permutation test (inspired by Ribeiro et al. (2020)). An ideal DS-RE model should be able to correctly predict the relationship between an entity pair by understanding the semantics of the text mentioning them. Since DS-RE models under the MI-ML setting are evaluated on bag-level, it might be the case that such models are simply memorizing the KB on which they are being trained on.
To test this hypothesis, we construct a new test set using NYT-10m by augmenting the KB. Let denote a non-NA bag already existing in the test set of NYT-10m. We augment this bag to correspond to a new entity-pair (which is not present in the combined KB of all three splits of this dataset). The augmentation can be of two different types: replacing with or replacing with . We restrict such augmentations to the same type (i.e the type of and is same for ). For each non-NA entity pair in the test set of NYT-10m, we select one such augmentation and appropriately modify each instance in to have the new entity mentions. We note that since each instance in NYT-10m is manually annotated and our augmentation ensures that the type signature is preserved, this transformation is label preserving. For the NA bags, we use the ones already present in the original split. This entire transformation leaves us with an augmented test set, having same number of NA and non-NA bags as the original split. The non-NA entity pairs are not present in the KB on which the model is trained on.
We compare the drop in performance of our model with BERT-Att and CIL in table 5
. We observe that our model still achieves the highest performance along with the lowest drop (percentage wise). We believe that our robustness may be attributed to the fact that our aggregation-scheme doesn’t pay any special focus to the entity mentions (apart from the entity markings). In intra-bag attention models, however, the instance representations are generated by concatenating the hidden states corresponding to the entity mentions in the sentence.
4.1.2 Attention on [PAD] tokens
In the passage summarization step (described in section 3), we allow the relation query vector to also attend over the encodings of the [PAD] tokens present in the passage. We make this architectural choice in-order to provide some structure to the relation-specific summaries created by our model. If a particular relation class is not a valid relation for entity pair , then ideally, we would want the attended-summary of the passage created by the relation vector to represent some sort of a null state (since information specific to that relation class is not present in the passage). Allowing [PAD] tokens to be a part of the attention would provide enough flexibility to the model to represent such a state. We test our hypothesis by considering 1000 non-NA bags correctly labelled by our trained model in the test set of NYT-10d. Let denote the set of valid relation-classes for entity pair and let denote all of the relation-classes present in the dataset. We first calculate the percentage of attention given to [PAD] tokens for a given passage for all relation-classes in . The results are condensed into two scores, sum of scores for and sum of scores for . The results are aggregated for all 1000 bags, and then averaged out by dividing with the total number of positive triples and negative triples respectively. We obtain that on an average, only 0.07% of attention weight is given to [PAD] tokens by relation vectors corresponding to , compared to 88.35% attention weight given by relation vectors corresponding to . We obtain similar statistics on NYT-10m as well. This suggests that for invalid triples, passage summaries generated by the model resemble the embeddings of the [PAD] token. Furthermore, since we don’t allow [PAD] tokens to be a part of self-attention update inside BERT, the [PAD] embeddings at the output of the BERT encoder are not dependent on the passage, allowing for uniformity across all bags.
Finally, we train a model where we don’t allow the relation query vectors to attend on the [PAD] token embeddings and notice a 3.5pt drop in AUC on NYT-10d. We also note that the performance is still significantly higher than models such as REDSandT and DISTRE, suggesting that our instance aggregation scheme still performs better than the baselines, even when not optimized fully.
4.1.3 Performance vs length of passage
Our instance aggregation scheme truncates the passage if the number of tokens exceed the maximum number of tokens allowed by the encoder. In such cases, one would assume that the our model is not suited for cases where the number of instances present in a bag is very large. To test this hypothesis, we divide the non-NA bags, , present in the NYT-10m data into 7 bins based on the number of tokens present in . We then compare the performance with CIL on examples present in each bin. The results in figure 4 indicate that a) our model beats CIL in each bin-size b) the variation among different bins is the same for both models. This trend is continued even for passages where the number of tokens present exceed the maximum number of tokens allowed for BERT (i.e. 512). This results indicate that 512 tokens provide sufficient information for correct classification of a triple. Moreover, models using intra-bag attention aggregation scheme fix the number of instances sampled from the bag in practice. For CIL, the best performing configuration uses a bag-size of 3. This analysis therefore indicates that our aggregation scheme doesn’t suffer a drop in performance on large bags.
In this paper we introduced a novel aggregation scheme, Passage-Att, for MI-ML formulation of DS-RE. Our aggregation scheme combined with BERT establishes the new state-of-the-art on NYT10-d, NYT10-m and DiS-ReX, each being the standard benchmarking datasets for DS-RE in three distinct settings. We support our architectural choices with further analysis. We believe that our model would serve as a backbone for new research in the field of DS-RE. We encourage future researchers to implement their ideas using our proposed paradigm as aggregation scheme.
- Fine-tuning pre-trained transformer language models to distantly supervised relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1388–1398. External Links: Cited by: §1.
- DiS-rex: a multilingual dataset for distantly supervised relation extraction. External Links: Cited by: §1, §2.
CIL: contrastive instance learning framework for distantly supervised relation extraction.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 6191–6200. External Links: Cited by: §1.
- Improving distantly-supervised relation extraction through bert-based label and instance embeddings. IEEE Access 9, pp. 62574–62582. Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §1.
- Manual evaluation matters: reviewing test protocols of distantly supervised relation extraction. arXiv preprint arXiv:2105.09543. Cited by: §1, §2, §4.1.
- OpenNRE: an open and extensible toolkit for neural relation extraction. In Proceedings of EMNLP-IJCNLP: System Demonstrations, pp. 169–174. External Links: Cited by: §4.1.
- Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 541–550. External Links: Cited by: §1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
- Neural relation extraction with multi-lingual attention. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 34–43. Cited by: §2.
- Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 2124–2133. External Links: Cited by: A Simple, Strong and Robust Baseline for Distantly Supervised Relation Extraction, §1, §2.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.1.
- Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 1003–1011. External Links: Cited by: §1, §2.
PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Cited by: §4.1.
- Improving language understanding by generative pre-training. Cited by: §2.
- Beyond accuracy: behavioral testing of nlp models with checklist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912. Cited by: §4.1.1.
Modeling relations and their mentions without labeled text.
Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 148–163. Cited by: §1, §2.
- Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 455–465. External Links: Cited by: §1.
- RESIDE: improving distantly-supervised neural relation extraction using side information. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1257–1266. External Links: Cited by: §1.
- Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. Cited by: §4.1.
- Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1753–1762. Cited by: §2.