RGCL at SemEval-2020 Task 6: Neural Approaches to Definition Extraction

10/13/2020 ∙ by Tharindu Ranasinghe, et al. ∙ University of Surrey University of Wolverhampton 0

This paper presents the RGCL team submission to SemEval 2020 Task 6: DeftEval, subtasks 1 and 2. The system classifies definitions at the sentence and token levels. It utilises state-of-the-art neural network architectures, which have some task-specific adaptations, including an automatically extended training set. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility in architecture selection.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/.

Definition Extraction refers to the task in Natural Language Processing (NLP) of detecting and extracting a

term and its definition in different types of text. A common use of automatic definition extraction is to help building dictionaries [7], but it can be employed for many other applications. For example, ontology building can benefit from methods that extract definitions [4, 11], whilst the fields of definition extraction and information extraction can employ similar methodologies. It is therefore normal that there is growing interest in the task of definition extraction.

This paper describes our system that participated in two of the three subtasks of Task 6 at SemEval 2020 (DeftEval), a shared task focused on definition extraction from a specialised corpus. Our method employs state-of-the-art neural architectures in combination with automatic methods which extend and clean the provided dataset.

The remaining parts of this paper are structured as follows. First, we present related work in the area of definition extraction and the related field of relation extraction (Section 2). The three subtasks and the dataset provided by the task organisers are described in Section 3. Next, we describe our system (Section 4), followed by the results of the evaluation (Section 5) and a final conclusion (Section 6).

2 Related Work

The first efforts related to definition extraction happened in the field of hypernym extraction, where relations that usually indicate a definition were also dealt with. This includes the X is a type of Y relation, such as salmon is a type of fish, where salmon is a hyponym of fish. Notable work includes Hearst1992, who automatically extracts hyponyms from large amounts of unstructured text using lexico-syntactic patterns. Inspired by this approach, Malaise2004 describe a similar method to mine definitions in French, which are then classified in terms of their semantic relations, limited to the hypernymy - synonymy relation. The approach is also used for building ontologies [11].

The importance of the semantic relations between words for pattern-based approaches to definition extraction is highlighted in [17]

. Here, the authors describe and explain definitional verbal patterns in Spanish, which they also propose to use for mining definitions. The proposed system is further presented in Alarcon2009 and is aimed at Spanish technical texts. The system uses the aforementioned verbal patterns, as well as corresponding tense and distance restrictions, in order to extract a set of candidate terms and their definitions. Once extracted, the system applies some filtering rules and a decision tree to further analyse the candidates. Finally, the results are ranked using heuristic rules. All aspects of the system were developed by analysing the Institut Universitari de Lingüística Aplicada technical corpus in Spanish, which is also used for evaluation.

Machine learning algorithms have also been used for definition extraction. Gaudio2009 describe an approach that is said to be language independent and test it with decision trees and Random Forest, as well as Naïve Bayes, k-Nearest Neighbour and Support Vector Machines using different sampling techniques to varying degrees of success. Kobylinski2008 process Polish texts and use Balanced Random Forests, which bootstrap equal sets of positive and negative training examples to the classifier, as opposed to a larger group of unequal sets of training examples. Overall, while the approach is said to increase run time, it does bring minor increases in performance with some fine-tuning.

Most recently, Spala2019 have created DEFT, a corpus for definition extraction from unstructured and semi-structured texts. Citing some of the pattern-based approaches also mentioned here, the authors argue that definitions have been well-defined and not necessarily representative of natural language. Therefore, a new corpus is presented that is said to more accurately represent natural language, and includes more messy examples of definitions. Parts of the DEFT corpus make up the dataset for this shared task, which is described in more detail in the following section.

3 Subtasks and Dataset

The DeftEval shared task is split into three subtasks. The first is Sentence Classification, where the task is to predict whether a given sentence contains a definition. Subtask 2 is a sequence labelling task, which includes requires participants to assign BIO tags to indicate which tokens in a sentence belong to terms and their definitions. Furthermore, the BIO tags are fine-grained, denoting whether terms and definitions are primary, secondary (the second time a term or definition has been seen in a text), referential or ordered (multiple terms that have inseparable definitions). The final subtask is Relation Classification which requires to classify the relation between terms and their definitions. Included are the tags direct and indirect definitions (links term or referential term to definition, respectively), supplements (links indirect to direct definition), refers-to (links referential term/definition to term/definition) and AKA (links alias term to term).

The corpus provided by the organisers is made up of parts of the DEFT corpus described in Spala2019. This corpus has been compiled specifically for definition extraction tasks and is made up of legal contracts ( sentences) and textbook data ( sentences). Citing a growing need for definition extraction corpora, the creators also developed an annotation scheme that is specific to the task of definition extraction.

4 Methodology

In this section we present the different approaches we employed for each subtask. The overall approach is based on a neural network architecture, but each subtask requires different methods of preprocessing the data, as well as task-specific tweaks to the data and architecture. Our implementation has been made available on Github.111https://github.com/tharindudr/defteval

4.1 Sentence Classification

This section describes the methodology employed for Subtask 1: Sentence Classification, as well as the experiments carried out in order to boost performance. We first present the methods used to process and extend the data, followed by a description of the main neural network architecture employed.

4.1.1 Data Processing and Cleaning

We first used the data converting python script that the organisers provided to convert the deft corpus in to classification instances. After that we concatenated all the files in the training folder in to single file and used it for training purposes while the concatenated file from the dev folder is used for evaluation purposes. As the Sentence classification task required only to predict 1 (contains a definition) or 0 (does not contain a definition) it was feasible to perform some simple cleaning to increase the classification performance without causing any side effects. Upon analysis of the data we found that many sentences had some kind of numbering at the beginning, such as in the following example:

41. The evolution of various life forms on Earth can be summarized in a phylogenetic tree ([link])

Using a simple regular expression to match numbers and a punctuation mark at the beginning of a sentence, we removed these character strings across all sets. We used the same approach for finding and deleting character strings such as ([link]), which have been inserted by the task organisers to replace actual links to websites (see also the above example). In cases where the link replacement formed part of the sentence we did not perform a deletion:

Examples of some neutral atoms and their electron configurations are shown in [link].

This decision was made as it would otherwise leave sentences incomplete. After comparing the performance of our algorithm on both cleaned and uncleaned text we observed a marginal increase of

across all evaluation metrics using on the best performing architecture. Other than this we did not carry out any additional cleaning. This was also due to the fact that we use BERT embeddings, making it unnecessary to remove any other characters, as it includes vectors for most characters.

4.1.2 Data Augmentation

In order to improve the performance of our classification we extend the training set automatically. To achieve this, the sequence labelling part of the system (described in Section 4.2) was used to detect terms in the training data. Where possible, we extracted the first sentence of the corresponding Wikipedia articles for these terms by scraping Wikipedia. This is due to the fact that the first sentence usually defines the term or item that the article is about. However, the approach had little impact on the performance of the system, trading increases in precision for decreases in recall and decreasing the F1-score by about 0.02. What we learned is since the data augmentation process is completely automated and not manually checked it introduces a certin level of noise to the dataset which result in decreasing the performance.

4.1.3 System Architecture

In order to determine the most suitable system architecture for the sentence classification task, we experimented with three different neural architectures: Convolutional Neural Network (CNN)


, Recurrent Neural Network (RNN)

[2] and Transformer [3]. After running various configurations, we found the Transformer architecture to perform best.

With the introduction of BERT [3] transformer architectures have shown a massive success in a wide range of NLP tasks. Transformer architectures have been trained on general tasks like language modelling and then fine-tuned for classification tasks [19, 14].

Transformer models take an input of a sequence and output the representation of the sequence. The sequence has one or two segments that the first token of the sequence is always [CLS] which contains the special classification embedding and another special token [SEP] is used for separating segments.

For text classification tasks, transformer models take the final hidden state h of the first token [CLS] as the representation of the whole sequence [19]. The [CLS] token was then fed in to a simple softmax classifier to predict the label of the whole sentence: whether it contains a definition or not.

We fine-tuned all the parameters from transformer models as well as the softmax classifier jointly by maximising the log-probability of the correct label. For training the model, we used a batch-size of eight, Adam optimiser

[6] with learning rate

, and a linear learning rate warm-up over 10% of the training data. The models were trained using only training data. Furthermore, they were evaluated while training using an evaluation set that had one fifth of the rows in training data. We performed early stopping if the evaluation loss did not improve over ten evaluation rounds. All the models were trained for three epochs. We experimented with several transformer architectures like BERT

[3], XLNet [21], XLM [1], RoBERTa [10] and DistilBERT [16]. We used the HuggingFace’s implementation of the transformer models [20] and the pre-trained models available in the HuggingFace’s model repository222https://huggingface.co/models.

4.2 Sequence Labelling

This section describes the experiments we conducted for Subtask 2: Sequence Labelling. We first present the data processing methods used, followed by the neural network architecture employed. Due to the structure of the data and the way the annotations had to be made (CoNLL-like format) and evaluated, no cleaning was performed for this task.

4.2.1 Data Processing and Augmentation

As a preliminary step, we concatenated all the files from the train folder in Deft corpus to a single file and used it as the training file. Similarly we concatenated all the files from the dev folder in Deft corpus to a single file and used it for evaluation purposes.

For this subtask also we experimented with data augmentation techniques. We tried a similar approach as before, but with a bootstrapping focus: We used the classifier trained for this task to predict terms and extracted the first sentence from each corresponding Wikipedia article. Exploiting the structure of Wikipedia again, we simply automatically labelled the term in the corresponding sentence, therefore providing extra examples of the terms being used in a sentence. However, like in the previous case this step did not improve our results due to the noise it introduces. We also assume that the added terms were always mentioned at the beginning of a sentence, therefore adding positional bias to the classifier.

4.2.2 System Architecture

We experimented with three different neural network architectures for the sequence labelling task: LSTM-CRF [8], Stack-LSTM [8] and Transformer [3]. In this task we also found that the Transformer architecture performs best.

Transformer architectures have proved effective in NER tasks [3], which are also sequence labelling tasks. In light of this, in this subtask, we implemented the approach suggested in the first transformers paper - BERT [3]: transformer model combined with a token-level classifier. After processing the sentence through the transformer model each word gets a vector representation. We used this vector representation as the input to the token-level classifier over the label set available for subtask 2. The token-level classifier consists of a dropout [18] and a linear classifier. We fine-tuned all the parameters from transformer models as well as the token-level classifier jointly by maximising the log-probability of the correct label.

For training the model, we used a batch-size of eight, Adam optimiser [6] with learning rate , and a linear learning rate warm-up over 6% of the training data. The models were trained using only training data. Furthermore, they were evaluated while training using an evaluation set that had one fifth of the rows in training data. Similar to the subtask 1, we performed early stopping if the evaluation loss did not improve over ten evaluation rounds. All the models were trained for three epochs. We experimented with several transformer architectures: BERT [3], XLNet [21], XLM [1], RoBERTa [10] and DistilBERT [16]. We used the HuggingFace TokenClassification interface [20] and the pre-trained models available in the HuggingFace model repository333https://huggingface.co/models.

We also experimented with adding a Conditional Random Field (CRF) layer [22] after the output of the Transformer. However evaluation of several configurations showed that adding the CRF layer does not improve the results. Therefore, we did not pursue these experiments any further.

5 Evaluation

In this section we present the evaluation results that were obtained during testing. We also provide a brief look at the final submission results of the shared task.

5.1 Sentence Classification Results

Table 1

shows the evaluation of the different architectures we developed for the sentence classification task using the development set. We have also included baseline results which was performed using a Naive Bayes bag of words approach. It is clear that, while marginal, XLNet performs best overall. Interestingly, we compared BERT-Large against XLNet-Base, meaning that our best architecture was much less resource intensive to run.

For the final task evaluation using the test set, we achieved an F1-Macro score of , placing us 25th out of 56 participants. Compared to our evaluation results, this is a relatively high loss. We assume that our model has been largely overfitted in to the training set we used.

Not Definition Definition Weighted Average
Model P R F1 P R F1 P R F1 F1 Macro
CNN 0.78 0.73 0.72 0.76 0.71 0.75 0.74 0.77 0.74 0.76
RNN-BILSTM 0.76 0.71 0.74 0.68 0.74 0.72 0.75 0.72 0.73 0.75
BERT 0.90 0.88 0.89 0.81 0.79 0.80 0.86 0.86 0.86 0.84
XLNet 0.91 0.90 0.90 0.82 0.80 0.81 0.87 0.88 0.87 0.86
Baseline 0.89 0.54 0.68 0.49 0.87 0.63 0.66 0.68 0.66 0.67
Table 1: Results for Subtask 1 For each model, Precision (P), Recall (R), and F1 are reported on all classes, and weighted averages. Macro-F1 is also listed (best in bold).

5.2 Sequence Labelling Results

Table 2 shows the evaluation results for the different architectures we tested for the Sequence Labelling task. As before, we see XLNet with the best results, and again see that the less resource intense base version is almost on par with the large version. It should also be noted that the best results were achieved with shortened maximum sequence lengths, down from 128 to 64.

In the official evaluation on the test set we ranked 28th of 51 with an F1-score of . This shows a significant drop in performance, possibly due to overfitting.

Model P R F1
BERT 0.71 0.74 0.73
ROBERTa 0.67 0.70 0.69
XLNet - Base 0.71 0.75 0.73
XLNet - Large 0.72 0.76 0.74
Table 2: Results for Subtask 2 For each model, Precision (P), Recall (R), and F1 are reported overall (best in bold).

6 Conclusion

We have presented the system the RGCL team has prepared for the SemEval-2020 Task 12. The design of the system allows for easy switching of different architectures to accommodate the needs of the task at hand. For this task, we have shown the Transformer architecture using XLNet is the most successful when working with limited resources. It has also been shown that data augmentation techniques we experimented, while not detrimental to overall performance, do not necessarily improve performance. In a shared task setting, the effect of the extended data from Wikipedia was not useful, however, for a wider approach with higher recall, this could be more helpful.

We also tried to participate in the final subtask, Relation Classification. However, due to time constraints, we were not able to achieve a valid submission for the this subtask. We approached it as a sequence pair classification task and employed a Siamese Neural Network which was shown to perform well in sequence pair classification tasks [12, 13]. The architecture we employed is similar to the architecture presented in [15]. When two sequences have a relation, we extracted the sequences and provided them as the input for the Siamese transformer architecture. Then we used the objective function suggested as classification objective function in [15] and optimised the cross-entropy loss. Due to the complexity of this task, we managed to run only a baseline of the proposed architecture which achieved very low evaluation scores on the development data. Therefore, we did not have a submission for this task and do not present any results here. In future, we hope to carry out further experiments with Siamese transformer architectures for relation classification tasks.

Going forth, we also wish to use this system for further tasks across further languages. While we may not achieve the best performance, the system utilises realistic system resources and is therefore very versatile. This is particularly with regard to the first subtask, where the difference to the best team is around 0.09, whereas for subtask two the best team is 0.36 ahead of us, indicating that our system is not competitive. It is possible to extend these experiments to a different domain easily using a pretrained transformer model in that domain given that a corpus similar to deft corpus is available in that domain. For an example, our system should be easily adoptable to biology domain using the BioBERT pretrained transformer model [9] and a deft corpus like corpus on biology domain.


  • [1] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §4.1.3, §4.2.2.
  • [2] J. Cui, J. Long, E. Min, Q. Liu, and Q. Li (2018)

    Comparative study of cnn and rnn for deep learning based intrusion detection system

    In Cloud Computing and Security, X. Sun, Z. Pan, and E. Bertino (Eds.), Cham, pp. 159–170. External Links: ISBN 978-3-030-00018-9 Cited by: §4.1.3.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.1.3, §4.1.3, §4.1.3, §4.2.2, §4.2.2, §4.2.2.
  • [4] M. A. Hearst (1992) Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics-Volume 2, pp. 539–545. Cited by: §1.
  • [5] Y. Kim (2014-10) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751. External Links: Link, Document Cited by: §4.1.3.
  • [6] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §4.1.3, §4.2.2.
  • [7] Ł. Kobyliński and A. Przepiórkowski (2008) Definition extraction with balanced random forests. In International Conference on Natural Language Processing, pp. 237–247. Cited by: §1.
  • [8] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016-06)

    Neural architectures for named entity recognition

    In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270. External Links: Link, Document Cited by: §4.2.2.
  • [9] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2019-09) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. External Links: ISSN 1367-4803, Document, Link Cited by: §6.
  • [10] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4.1.3, §4.2.2.
  • [11] V. Malaisé, P. Zweigenbaum, and B. Bachimont (2007) Mining defining contexts to help structuring differential ontologies. Application-driven terminology engineering 2, pp. 19. Cited by: §1, §2.
  • [12] J. Mueller and A. Thyagarajan (2016) Siamese recurrent architectures for learning sentence similarity. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    AAAI’16, pp. 2786–2792. Cited by: §6.
  • [13] T. Ranasinghe, C. Orasan, and R. Mitkov (2019-09) Semantic textual similarity with Siamese neural networks. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), Varna, Bulgaria, pp. 1004–1011. External Links: Link, Document Cited by: §6.
  • [14] T. Ranasinghe, M. Zampieri, and H. Hettiarachchi (2019) BRUMS at hasoc 2019: deep learning models for multilingual hate speech and offensive language identification. In Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation (December 2019). Cited by: §4.1.3.
  • [15] N. Reimers and I. Gurevych (2019-11) Sentence-bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §6.
  • [16] V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §4.1.3, §4.2.2.
  • [17] G. Sierra, R. Alarcón, C. Aguilar, and C. Bach (2008) Definitional verbal patterns for semantic relation extraction. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 14 (1), pp. 74–98. Cited by: §2.
  • [18] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (56), pp. 1929–1958. External Links: Link Cited by: §4.2.2.
  • [19] C. Sun, X. Qiu, Y. Xu, and X. Huang (2019) How to fine-tune bert for text classification?. In Chinese Computational Linguistics, M. Sun, X. Huang, H. Ji, Z. Liu, and Y. Liu (Eds.), Cham, pp. 194–206. External Links: ISBN 978-3-030-32381-3 Cited by: §4.1.3, §4.1.3.
  • [20] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §4.1.3, §4.2.2.
  • [21] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5754–5764. Cited by: §4.1.3, §4.2.2.
  • [22] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr (2015) Conditional random fields as recurrent neural networks. In

    Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)

    ICCV ’15, USA, pp. 1529–1537. External Links: ISBN 9781467383912, Link, Document Cited by: §4.2.2.