ClaiRE at SemEval-2018 Task 7 - Extended Version

04/16/2018 ∙ by Hettinger Lena, et al. ∙ 0

In this paper we describe our post-evaluation results for SemEval-2018 Task 7 on classification of semantic relations in scientific literature for clean (subtask 1.1) and noisy data (subtask 1.2). Due to space limitations we publish an extended version of Hettinger et al. (2018) including further technical details and changes made to the preprocessing step in the post-evaluation phase. Due to these changes Classification of Relations using Embeddings (ClaiRE) achieved an improved F1 score of 75.11 81.44



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of SemEval-2018 Task 7 is to extract and classify semantic relations between entities into six categories that are specific to scientific literature

(Gábor et al., 2018). In this work, we focus on the subtask of classifying relations between entities in manually (subtask 1.1) and automatically annotated and therefore noisy data (subtask 1.2). Given a pair of related entities, the task is to classify the type of their relation among the following options: Compare, Model-Feature, Part_Whole, Result, Topic or Usage. Relation types are explained in detail in the task description paper Gábor et al. (2018). The following sentence shows an example of a Result relation between the two entities combination methods and system performance:

Combination methods are an effective way of improving system performance.

This sentence is a good example for two challenges we face in this task. First, almost half of all entities consist of noun phrases which has to be considered when constructing features. Secondly, the vocabulary is domain dependent and therefore background knowledge should be adopted.

Previous approaches for semantic relation classification tasks mainly employed two strategies. Either they made use of a lot of hand-crafted features or they utilized a neural network with as few background knowledge as possible. The winning system of an earlier SemEval challenge on relation classification

(Hendrickx et al., 2009) adopted the first approach and achieved an F1 score of 82.2% (Rink and Harabagiu, 2010). Later, other works outperformed this approach by using CNNs with and without hand-crafted features (Santos et al., 2015; Xu et al., 2015) as well as RNNs (Miwa and Bansal, 2016).


We present two approaches that use different levels of preliminary information. Our first approach is inspired by the winning method of the SemEval-2010 challenge (Rink and Harabagiu, 2010). It models semantic relations by describing the two entities, between which the semantic relation holds, as well as the words between those entities. We call those in-between words the context

of the semantic relation. We classify relations by using an SVM on lexical features, such as part-of-speech tags. Additionally we make use of semantic background knowledge and add pre-trained word embeddings to the SVM, as word embeddings have been shown to improve performance in a series of NLP tasks, such as sentiment analysis

(Kim, 2014), question answering Chen et al. (2017) or relation extraction Dligach et al. (2017). Besides using existing word embeddings generated from a general corpus, we also train embeddings on scientific articles that better reflect scientific vocabulary.

In contrast, our second approach relies on word embeddings only, which are fed into a convolutional long-short term memory (C-LSTM) network, a model that combines convolutional and recurrent neural networks 

Zhou et al. (2015). Therefore no hand-crafted features are used. Because both CNN and RNN models have shown good performance for this task, we assume that a combination of them will positively impact classification performance compared to the individual models.


By combining Lexical information and domain-Adapted Scientific word Embeddings, our system ClaiRE originally achieved an F1 score of 74.89% for the first subtask with manually annotated data and 78.39% for the second subtask with automatically annotated data Hettinger et al. (2018). Improving our preprocessing lifted this performance to 75.11% and 81.44% respectively. Our results make a strong case for domain-specific word embeddings, as using those improved our score by close to 5%.

Paper Structure

In Section 2, we describe the features that we used to characterize semantic relations. Section 3 shows how we classify the relation using an SVM and a C-LSTM neural network. Section 4 presents the results, which are discussed in Section 5. Finally, Section 6 concludes this work.

2 Features

Example Sentence: Combination methods are an effective way of improving system performance.
Lexical Feature Set Exemplary Boolean Features
BagOfWords () an, be, effective, improve, of, way
POS path () VDANAV
Distance () 6
Levin classes () 45
Entities without order () combination methods, methods, system performance, performance
Start entity () combination methods, methods
End entitiy () system performance, performance
Similarity () 0.43
Similarity bucket () q50
Table 1: Examples for lexical context and entity features.

In this section, we describe the features which are used in our two approaches. All sentences are first preprocessed before constructing boolean lexical features on the one hand and word embedding vectors on the other. Both feature groups are based on the entities of relations as well as the context in which those entities appear.

Apart from the Compare relation, all relation types are asymmetric, and therefore the distinction between start and end entity of a relation is important. If entities appear in reverse order, that means the end entity of a relation appears first in the sentence, this is marked by a direction feature which is part of the data set.

In our entrance example, combination methods denotes the start entity, system performance the end entity, and are an effective way of improving the context.

2.1 Preprocessing

Early experiments showed that it is beneficial to filter the vocabulary of our data and reduce noise by leaving out infrequent context words. The best setting was found to be a frequency threshold of on lemmatized words. Therefore we discard a context word if its lemma appears less than times in the corpus of the respective subtask.

Post-Evaluation changes

Lemmas as well as POS tags were extracted with the help of SpaCy.111 We started and finished the challenge with version 2.0.2 and afterwards updated to version 2.0.9. This version update lead to a change of POS tags, with which our results improved. During post-evaluation we also noticed an error in the preprocessing that caused two feature sets ( and ) to intermix. Both the lemmas of pronouns as well as the POS-tags of pronouns were mapped to the same symbol ’PRON’, therefore we had to explicitly separate these two sets. After resolving this intersection our results improved further.

2.2 Context features

First we will explain feature construction based on the context of a relation. Abbreviations for feature names are denoted in brackets. Context is defined as the words between two entities. Early tests showed that using those words described the relation better than the words surrounding the relation entities.


We construct several lexical boolean features which are illustrated in Table 1. First we apply a bag of words () approach where each lemmatized word forms one boolean feature, which for example takes 1 as value if the lemma improve is present and 0 if it is not. Second we determine whether the context words contain certain part-of-speech (POS) tags (), such as VERB. To represent the structure of the context phrase we add a path of POS tags feature, which contains the order in which POS tags appear (). The distance feature depicts whether the POS-path and therefore the context phrase has a certain length ().

Additionally we add background knowledge by extracting the top-level Levin classes of intermediary verbs from VerbNet222 (

), a verb lexicon compatible with WordNet. It contains explicitly stated syntactic and semantic information, using Levin verb classes to systematically construct lexical entries 

Schuler (2005). For example the verb improve belongs to class 45.4, which is described by Levin as consisting of “alternating change of state“ verbs.333


Aside from lexical features we also use word embedding vectors to leverage information from the context of entities (). For each filtered context word we extract its word embeddding from a pre-trained corpus, where out-of-vocabulary words (OOV) are represented by the zero vector. The individual word vectors are later applied to train a C-LSTM.

In contrast, for use in an SVM we found it beneficial to represent the context embedding features as the average over all context word embeddings.

2.3 Entity features

In the second set of features, we model the relation entities themselves as they may be connected to a certain relation class. For example, the token performance or one form of it mostly appears as an end entity of a Result relation, and in the rare case when it represents a start entity, it is almost always part of a Compare relation. Therefore we leverage information about entity position for the creation of lexical and embedding entity features.


For the creation of boolean lexical features, we first take the lowercased string of each entity and construct up to three distinct features from it. One feature which marks its general appearance in the corpus without order () and one each if it occurs as start () or end () entity of a relation, taking its direction into account. Additionally we add the head noun to the respective feature set if the entity consists of a nominal phrase to create greater overlap between instances.

Furthermore we measure the semantic similarity of the relation entities using the cosine of the corresponding word embedding vectors (). While the cosine takes every value from [-1, 1] in theory, we cut off after two digits to reduce the feature space and get 99 boolean similarity features for our corpus. To again enable learning across instances we additionally discretize the similarity range and form another five boolean similarity features () that capture into which of the following buckets the similarity score falls: (values below zero are very rare in this corpus).


Similar to the context features we also want to add word embeddings of entities to our entity feature set. This is not straighforward as more than 44% of all entities consist of nominal phrases, while a word embedding usually corresponds to a single word. By way of comparison, the proportion of nominals in the relation classification corpus of the SemEval-2010 challenge was only 5%. Thus we tested different strategies to obtain a word embedding for nominal phrases and found that averaging over the individual word vectors of the phrase yielded the best results for this task. These word embeddings for start () and end () entities of relations were then presented to our two classification methods, which will be described in detail in the following section.

3 Classification Methods

We utilize two different models for classifying semantic relations: an SVM which incorporates both the lexical and embedding features described in Section 2 and a Convolutional Long Short Term Memory (C-LSTM) neural network that only uses word embedding vectors

3.1 Svm

lexical context

6816 lexical entity

900 embedding



9886 features

Figure 1: Feature vector used in the SVM. Numbers hold true for subtask 1.1, including 1.2 data

To fully exploit our hand-crafted lexical features we employ a traditional classifier. In comparison to Naive Bayes, Decision Trees and Random Forests we found a Support Vector Machine to perform best for this task. Instead of utilizing the decision function of the SVM to predict test labels we decided to make use of the probability estimates according to

Wu et al. (2004) as this proved to be more successful. As mentioned before, the lexical features are fed into the SVM as boolean features whereas the word embeddings are normalized using MinMax-Scaling to the range to make it easier for the SVM to handle both feature groups (Fig. 1).

3.2 C-Lstm

In contrast to SVM, neural network models do not necessarily rely on handcrafted features and are therefore faster to implement. We experiment with C-LSTM Zhou et al. (2015) which extracts a sentence representation by combining one-dimensional convolution and an LSTM network and uses the representation to perform a classification.

C-LSTM extracts a sentence representation in the following steps. First embeddings for all words are obtained from a pre-computed embedding table where is the embedding size and denotes the size of the vocabulary. For entities that are nominal phrases the average over the individual word embeddings is used. This results in a sequence of embedding vectors of length where are embeddings representing entities and the represent the context word embeddings. Next the embedding vectors in are concatenated to form an input matrix for the one-dimensional convolution layer. For computational reasons a matrix

is obtained by right padding

with a zero token to the maximum sequence length in the corpus. After that feature maps with being the number of features in each map are computed over using a one-dimensional convolution layer with filters of window size

and stride

. The resulting feature map matrix is then split along the second axis into a sequence with individual elements and length . Finally

is used as input to an LSTM network with the last output being a representation of the input sequence. A softmax layer is used to compute label scores from the sentence representation. See

Figure 2 for an illustration of the model.

Figure 2: An illustration of the model architecture of C-LSTM.

4 Evaluation

After describing the two models we employ for relation classification, we now portray the data set we use and present results for both SVM and C-LSTM in detail. Results are reported as micro-F1 and macro-F1. The latter is the official evaluation score of the SemEval Challenge. We describe the experimental setup for both models and compare different feature sets and pre-trained embeddings.

4.1 Data and Background Knowledge

label subtask 1.1 subtask 1.2 total
COMPARE 95 (08%) 41 (03%) 136 (05%)
MODEL-F. 326 (27%) 174 (14%) 500 (20%)
PART_W. 234 (19%) 196 (16%) 430 (17%)
RESULT 72 (06%) 123 (10%) 195 (08%)
TOPIC 18 (01%) 243 (20%) 261 (11%)
USAGE 483 (39%) 468 (38%) 951 (38%)
Table 2: Distribution of class labels for training data as absolute and relative values.

We evaluate our approach on a set of scientific abstracts, . It consists of 355 semantic relations for each subtask which are similarly distributed as its respective training data set. As training data we received abstracts of scientific articles per subtask, which resulted in labeled training relations for subtask 1.1 and training instances for subtask 1.2 (c.f. Table 2). We combine data sets of both subtasks for training, resulting in training examples in total ().

Background Knowledge

In our experiments, we compare different pre-trained word embeddings as a source of background knowledge. As a baseline, we employ a publicly available set of 300-dimensional word embeddings trained with GloVe Pennington et al. (2014) on the Common Crawl data444 (CC). To better reflect the semantics of scientific language, we trained our own scientific embeddings using word2vec Mikolov et al. (2013) on a large corpus of papers collected from arXiv.org555 (arXiv).

In order to create the scientific embeddings, we downloaded LaTeX sources for all papers published in 2016 on using the provided dumps.666 After originally trying to extract the plain text from the sources, we found that it was more feasible to first compile the sources to pdf (excluding all graphics etc.) and then use pdftotext777 to convert the documents to plain text. This resulted in a dataset of about papers. Using gensim Řehůřek and Sojka (2010), for each document we extracted tokens of minimum length 1 with the wikicorpus tokenizer and used word2vec to train 300-dimensional word embeddings on the data. We kept most hyper-parameters at their default values, but limited the vocabulary to words occurring at least 100 times in the dataset, reducing for example the noise introduced by artifacts from equations.

4.2 Svm

context + entities
data macro F1 micro F1 macro F1 micro F1
1.1 45.10 59.15 48.96 65.35
+1.2 46.95 61.97 66.03 70.14
CC 51.14 64.79 70.31 73.24
arXiv 51.55 64.79 75.11 77.46
Table 3: SVM results for subtask 1.1.

After an extensive grid search per cross validation the best parameters for the SVM were found to be a rbf-kernel with and for both tasks.

Our post-evaluation results of the SVM for subtask 1.1. are shown in Table 3. Adding entity features proves to be very beneficial compared to using only context features, as we could improve macro-F1 by 16 points on average. Results are further improved by enlarging the data set with the training samples of subtask 1.2 and by adding word embeddings to the feature set. While adding the CC embeddings enhances the micro-F1 by more than 4 points, our domain-adapted arXiv embeddings prove to perform even better and deliver the best result with a macro-F1 score of and a micro-F1 of .

context + entities
data macro F1 micro F1 macro F1 micro F1
1.2 68.61 71.27 73.49 81.41
+1.1 61.09 69.01 78.63 83.66
CC 62.74 70.42 76.80 85.63
arXiv 63.29 70.99 81.44 85.07
Table 4: SVM results for subtask 1.2.

Similar observations can be made for subtask 1.2., as is pictured in Table 4. Originally we achieved a micro-F1 score 74.89% for the first subtask and 78.39% for the second but adding the changes noted in Section 2.1 led to an improvement of on average Hettinger et al. (2018).

4.3 C-Lstm

We fix the batch size and number of epochs to


respectively for all trained models. Words are encoded using either arXiv or CC embeddings. The embeddings are not further optimized during training. Cross-entropy is used as the loss function and the model is optimized using Adam 

Kingma and Ba (2014) with the initial learning rate set to , , , .

To find the optimal hyperparameter configuration, we perform a random search 

Bergstra and Bengio (2012) on the hyper-parameters number of filters, filter width, rnn cell units, dropout rate and l2 norm scale. For this study, we sample stratified from the training set to serve as a validation set. All parameters were chosen from a uniformly random discrete or continuous distribution. The ranges and the parameters yielding the best performance on the validation set are given in Table 5.

parameter min max selected
number of filters 10 500 384
filter width 2 5 3
rnn cell units 16 500 93
dropout rate 0.0 0.5 0.23
l2 normalization scale 0.0 3.0 0.79
Table 5: C-LSTM parameters and settings selected by random search from search ranges of [min, max].

Using the determined optimal parameter settings, models with both types of embeddings were trained on the full training set and evaluated on the test set. Table 6 shows that the C-LSTM model performs well on the scientific embeddings, but consistently worse than the SVM model using handcrafted features and achieves a macro-F1 score of and for subtask 1.1 and subtask 1.2 respectively.

subtask 1.1 subtask 1.2
macro F1 micro F1 macro F1 micro F1
CC 54.42 67.61 74.42 78.87
arXiv 67.49 70.96 67.02 74.37
Table 6: Results for C-LSTM models trained with CC and arXiv embeddings on both subtasks.

5 Discussion

We briefly discuss our approach during the training phase of the SemEval-Challenge and how label distribution and evaluation measure influences our results. Ahead of the final evaluation phase where the concealed test data was presented to the participants we were given a preliminary test partition as part of the training data . To be able to estimate our performance we evaluated it on as well as for a 10-fold stratified cross validation setting. We chose this procedure to be sure to pick the best system for submission at the challenge.

As some classes were strongly underrepresented in the training corpus and , we assumed that this is also true for the final test set . When in doubt we therefore chose to optimize according to as cross validation is based on a slightly more balanced data set (of train data for subtask 1.1 + 1.2). The best system we submitted for subtask 1.1 of the challenge achieved a macro-F1 of 75.05% on during the training phase which shows that we were able to estimate our final result pretty closely.

During training we also noticed that for heavily skewed class distributions as in this case, macro-F1 as an evaluation measure strongly depends on a good prediction of very small classes. For example, macro-F1 of subtask 1.1 increases by 5 points if we correctly predict one

Topic instance out of three instead of none. Thus we pick a configuration that optimizes the small classes.

We also omitted some lexical feature sets from our system as performance on the temporary and final test set showed that they did not improve results. These features were hypernyms of context and entity tokens from WordNet and dependency paths between entities. Using tf-idf normalization instead of boolean for lexical features also worsened our results.

The C-LSTM model performes quite well, considering it only relies on very limited information, the sequence of entity and word embedding vectors, to perform the classification. For example the model has no way of determining the direction of the relation and we speculate that increasing the model complexity to include such information might increase the performance further. Additionally, the results for subtask 1.2 show that in contrast to the SVM model, C-LSTM does not perform consistently better with arXiv embeddings, which warrants further investigation.

6 Conclusion

In this paper, we described our SemEval-2018 Task 7 system to classify semantic relations in scientific literature for clean (subtask 1.1) and noisy (subtask 1.2) data and its results during the post-evaluation phase. We constructed features based on relation entities and their context by means of hand-crafted lexical features as well as word embeddings. To better adapt to the scientific domain, we trained scientific word embeddings on a large corpus of scientific papers obtained from We used an SVM to classify relations and additionally contrasted these results with those obtained from training a C-LSTM model on the scientific embeddings. Due to improved preprocessing we were able to obtain a macro-F1 score of on clean data and on noisy data. We finished the challenge as 4th out of 28 (subtask 1.1) and 6th out of 20 (subtask 1.2) though the results from Hettinger et al. (2018) are applied.

In future work, we will improve the tokenization of the scientific word embeddings and also take noun compounds into account, as they make up a large part of the scientific vocabulary. We will also investigate more complex neural network based models, that can leverage additional information, for example relation direction and POS tags.