Named entity recognition in chemical patents using ensemble of contextual language models

07/24/2020 ∙ by Jenny Copara, et al. ∙ 0

Chemical patent documents describe a broad range of applications holding key information, such as chemical compounds, reactions, and specific properties. However, the key information should be enabled to be utilized in downstream tasks. Text mining provides means to extract relevant information from chemical patents through information extraction techniques. As part of the Information Extraction task of the Cheminformatics Elseiver Melbourne University challenge, in this work we study the effectiveness of contextualized language models to extract reaction information in chemical patents. We compare transformer architectures trained on a generic corpus with models specialised in chemistry patents, and propose a new model based on the combination of existing architectures. Our best model, based on the ensemble approach, achieves an exact F1-score of 92.30 ensemble of contextualized language models provides an effective method to extract information from chemical patents. As a next step, we will investigate the effect of transformer language models pre-trained in chemical patents.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Chemical patents represent a valuable information resource in downstream innovation applications, such as drug discovery and novelty checking. However, the discovery of chemical compounds described in patents is delayed by a few years [He2020]. Among the reasons, it could be considered the recent increase in the number of chemical patents that disregard a manual curation, and the particular wording. Additionally, narrative in chemical patents possesses meaningful concepts that are expressed usually in the seeking to protect the knowledge, while in scientific literature the text tends to be as clear as possible [Valentinuzzi2017]. In addition, chemical patents represent a complex source of information [Habibi2016]. In this landscape, information extraction methods, such as Named Entity Recognition (NER), provide a suited solution to identify key information in patents.

NER aims to identify information of interest and its specific instances found in a document [Grishman2019, Okurowski1993]. It has been often addressed as a sequence classification task. One of the most successful approaches in sequence classification is Conditional Random Fields (CRF) [Lafferty2001, Sutton2012]. It was established as state-of-the-art in different NER domains for many years [Leaman2008, Rocktschel2012, Leaman2015, Ratinov2009, Guo2014, Habibi2016, Yadav2018]. In the chemical patent domain, CRF was explored by Zhang et al. [Zhang2016] in the CHEMDNER patent corpus [Krallinger2015]. Using a set of hand-crafted and unsupervised features derived from word embeddings and Brown clustering, their model achieved of F-score. With similar F-score performance, Akhondi et al.  [Akhondi2016] explored CRF combined with dictionaries in the biomedical domain in the tmChem tool [Leaman2015] in order to select the best vocabulary for the CHEMDNER patent corpus. It has been shown [Habibi2016] that recognizing chemical entities in the full patent text is a harder task than in titles and abstracts, given the peculiarities of this kind of text. Evaluation in full patents was performed using Biosemantics corpus  [Akhondi2014] through neural approaches based on biLSTM-CRF [Habibi2017] and biLSTM-CNN-CRF [Zhai2019], where the former achieved and the latter of F-score. It is worth noting that in [Zhai2019] the authors used ELMo contextualized embeddings [Peters2018] while in [Habibi2017] the authors used word2vec embeddings [Mikolov2013] to represent features.

Over the years, neural language models have improved their ability to encode the semantics of words using large amounts of unlabeled text. They have initially evolved from a straightforward model [Bengio2003]

of one hidden layer that predicts the next word in a sequence, aiming to learn the distributed representation of the words (i.e., the word embedding vector), to an improved objective function that allows learning from a larger amount of text

[Collobert2011], but with higher computational resources usage and longer training time. These developments have encouraged the seeking of language models able to bring high-quality word embeddings with lower computational cost (i.e., word2vec [Mikolov2013] and GloVe [Pennington2014]). However, natural language still presented challenges for language models, in particular, concerning word contexts. Recently, a second type of word embeddings have attracted attention in the literature, the so-called contextualized embeddings, such as ELMo [Peters2018], UMLFiT [Howard2018]

, GPT-2 

[Radford2019], and BERT [Devlin2019]. Particularly, the transformers architecture based on BERT uses the attention mechanism to pre-train deep bidirectional representations conditioning tokens on the left and right context.

In this work, we explore contextualized language models to extract information in chemical patents as part of the lab ChEMU – Information extraction from Chemical Patents  [He2020] – in the 11th Conference and Labs of the Evaluation Forum 2020, Task 1: Named Entity Recognition. The entities in the corpus are example_label, other_compound, reaction_product, reagent_catalyst, solvent, starting_material, temperature, time, yield_other, and yield_percent

. BERT-based models were used as pre-trained language models and fine-tuned on the ChEMU NER task to classify tokens according to the different entities. We investigate the combination of different architectures to improve NER performance. In the following sections, we describe the design and results of our experiments.

2 Methods and Data

2.1 NER model

2.1.1 Transformers with a token classification on top

We used five BERT-based language models [Devlin2019]. The first four models are bert-base-cased, bert-base-uncased, bert-large-cased and bert-large-uncased. They were pretrained on a large corpus of English text with different model sizes for base and large. The last pretrained language model used is ChemBERTa trained on a corpus of 100k SMILES strings from the benchmark dataset ZINC. ChemBERTa is a RoBERTa [Liu2019]

based model, trained over 5 epochs. RoBERTa architecture is a BERT-based model with training improvements in hyperparameters, tokenizer, training task, to name but a few.

The fine-tuning on the NER model is a BERT module with a fully connected layer on top of the hidden states of each token, using Adam optimizer [Kingma2015]. We used the implementation from hugging face framework.111

The language models were fine-tuned on the ChEMU Task 1 dataset. The first four language models were fine-tuned for 10 epochs, with a sequence length of maximum 256 tokens, a learning rate of and a warmup proportion of . ChemBERTa model was fine-tuned for 29 epochs, with a sequence length of maximum 256 tokens, a learning rate of and a warmup proportion of . During the evaluation of the performance of our models, we increased the sequence length of maximum 512 tokens to take into in account the larger entities in the data.

2.1.2 Ensemble model

Our ensemble method is based on a voting strategy, where each model votes with its prediction and a majority of votes is necessary to assign the prediction. In order to decide which model composition to use in our ensemble model, we used the dev-set and compute all possible ensemble predictions according to the majority rule. We retained the ensemble composition with the best overall F-score and used it for the test set.

During the test phase, as we were unable to compute predictions for the bert-large models by the deadline, we had to take this model out of the ensemble equation. The prediction models considered in the ensemble are bert-base-cased, bert-base-uncased and a convolutional neural network model.

2.1.3 Baseline

As baseline we evaluated two models, Conditional Random Fields and Convolutional Neural Network. Conditional Random Field (CRF) was motivated to solve sequence classification by estimating the conditional probability of a label sequence given a word sequence, considering a set of observed features in the latter

[Lafferty2001, Sutton2012]. Our CRF classifier relies on the CRFSuite 222 implementation and a set of standard features in a window of tokens [Copara2016, Guo2014] without taking into account part-of-speech tags, neither gazetteers. The features used are token itself, lower-cased word, capitalization pattern, type of token (i.e., digit, symbol, word), 1-4 character prefixes/suffixes, digit size (i.e., size 2 or 4), combination of values (digit with alphanumeric, hyphen, comma, period), binary features for upper/lower-cased letter, alphabet/digit char and symbol. Please refer to [Copara2016, Guo2014, Okazaki2007] for further details on the features used.

The Convolutional Neural Network (CNN) for NER used relies on incremental parsing with Bloom embeddings. The convolutional layers use residual connections, layer normalization and maxout non-linearity. The input sequence is embedded in a vector compounded by bloom embeddings modeling the characters, prefix, suffix and part-of-speech of each word. In the CNN, over the text is used 1D convolutional filters in order to predict how the next words are going to change. Our implementation relies on spaCy NER

333, using the pretrained transformer bert-base-uncased for 30 epochs and a batch size of 4. During the Test Phase, we need to fix the max length of the text to 1500k to reserve the enough RAM memory.

2.2 Data

The data in ChEMU Task 1: NER is provided as snippets sampled from 170 English document patents from the European Patent Office and the United States Patent and Trademark Office [He2020]. Gold annotations were provided for training (900 snippets) and development (250 snippets) sets for a total of entities. The annotation was done in the BRAT standoff format, Fig. 1 shows an example of a snippet and its annotation.

Figure 1: Data example with annotations for ChEMU NER task

During the development phase, we used the official development set to evaluate the performance of our models turning out in our test set in this phase. The official training set was split into train and dev sets, in order to train our models. As a result of this new setting, we get 800 snippets for train set, 100 for dev set and 225 for test set. Table 1 shows the entity distribution during the Development Phase. Major part of annotations come from other_compound, reaction_product and starting_material covering the 52% of entities in the Development Phase. In contrast, example_label, time and yield_percent entities represent of entities in the development phase. We used the new split in order to tune the hyper-parameters of the models that are going to be used in Test Phase.

EXAMPLE_LABEL (EL) 784/5 102/5 218/6 1104/5
OTHER_COMPOUND (OC) 4095/28 545/29 1080/28 5720/28
REACTION_PRODUCT (RP) 1816/13 236/12 506/13 2558/13
REAGENT_CATALYST (RC) 1135/8 146/8 289/8 1570/8
SOLVENT (So) 1001/7 139/7 250/7 1390/7
STARTING_MATERIAL (SM) 1543/11 211/11 413/11 2167/11
TEMPERATURE (Te) 1345/9 170/9 346/9 1861/9
TIME (Ti) 928/6 131/7 252/7 1311/6
YIELD_OTHER (YO) 940/7 121/6 261/7 1322/7
YIELD_PERCENT (YP) 848/6 107/6 228/6 1183/6
All 14435/100 1908/100 3843/100 20186/100
Table 1: Entity distribution in Development Phase

2.3 Evaluation metrics

The metrics used to evaluate ChEMU Task 1: NER are precision, recall, and F-score. As can be seen in the example in Fig. 1 each entity has a span that is expected to be identified for the models as well as the correct entity type. To evaluate how accurate was the predicted span concerning the real, also is included the exact and relaxed span matching conditions for the evaluation. Our models were evaluated with the ChEMU web page system444 and the BRAT Eval tool.555

3 Results and discussion

3.1 Comparison in Development Phase

Table 2 shows the exact and relaxed F-scores for all the models explored for ChEMU NER. The reported results come from the ChEMU web page system except for CNN, bert-large-uncased, and ensemble models that come from the BRAT Eval tool.

We assess the performance of two baselines, i.e., CRF and CNN models. CRF achieves of F-score where for entities with major proportion in the data (starting_material, reaction_product, other_compound) achieves an F-score average of while CNN achieves an average of but this is compensated by entities as temperature, time and solvent.

exact 0.8722 0.8182 0.9140 0.9113 0.9079 0.9052 0.6810 0.9285
relaxed 0.9450 0.8820 0.9732 0.9719 0.9706 0.9910 0.8500 0.9876
Table 2: F-scores in Development Phase. *models are evaluated using the BRAT Eval tool for the task and the remaining models are computed using the ChEMU web page system.

Among the BERT based models, the ensemble shows our best F-score in Development Phase. The entities time, yield_other and yield_percent were recognized with highest F-score. We associate this fact with the nature of these entities and the language models involved given that the ensemble model mainly relies on bert-base models. On the other hand, reaction_product, reagent_catalyst and starting_material entities were less recognized with , and of F-score, respectively. These entities are chemical entity types [He2020] (e.g., for starting_material: 4-(6-Bromo-3-methoxypyridin-2-yl)-6-chloropyrimidin-2-amine) but still are present some patterns that were enough to recognize those entities in exact F-score.

We perform an analysis of statistical significance in the predictions of the studied models and found that among the ensemble approach, bert-base-cased, bert-base-uncased, bert-large-cased and bert-large-uncased, there is no statistical significant difference, with

by two-tailed t-test. Our analysis takes into account the span and type entity in exact matching.

We also investigate the performance of ChemBERTa, where it was expected to achieve better results; however, even being a specific domain language model (pre-trained with SMILES strings from ZINC database), the specialization of chemical patents goes in a different direction leading to the lowest results among all the explored models (exact and relaxed metrics, see exact F-scores in Table 5).

Even whether our language models are not able to encode the specialized language in chemical patents, these results show the high ability of the contextualized neural language models to perform chemical NER in patents and the results are promissory with a specific domain pre-training.

3.2 Test phase

We perform the evaluation in the test set released (9,999 files containing chemical narratives from patents) where 3 runs were allowed. For run 1, we submitted our baseline on CRF. For run 2, we used bert-base-cased and for run 3, our ensemble based on the majority vote approach. Table 3 shows the official F-scores of our submissions for exact and relaxed span matching. The ensemble achieved of exact F-score exceeding in our baseline and the best individual contextualized language model (bert-base-cased).

For each of our submissions the entity with lowest exact F-score is starting_material, achieving in CRF, in bert-base-cased and in the ensemble. The of difference between CRF and the ensemble shows that the major advantage of language models based on attention mechanisms lies in the wealth of natural language without any specific design of hand-crafted features as it is necessary for CRF.

The 5-top best performed entities are example_label, temperature, time, yield_other, yield_percent, which is similar to the results in the development phase. These results suggest that the test set has a similar entity distribution for the train, dev, and test sets despite the vast amount of test files provided.

Entity CRF bert-base-cased Ensemble
exact relaxed exact relaxed exact relaxed
EXAMPLE_LABEL 0.9190 0.9367 0.9617 0.9730 0.9669 0.9784
OTHER_COMPOUND 0.8310 0.9029 0.8780 0.9608 0.8920 0.9653
REACTION_PRODUCT 0.6462 0.7689 0.8593 0.9378 0.8766 0.9322
REAGENT_CATALYST 0.7598 0.8035 0.8791 0.9082 0.9022 0.9176
SOLVENT 0.8299 0.8323 0.9444 0.9491 0.9541 0.9541
STARTING_MATERIAL 0.4957 0.6752 0.8413 0.9343 0.8701 0.9394
TEMPERATURE 0.9499 0.9688 0.9692 0.9902 0.9729 0.9877
TIME 0.9698 0.9843 0.9868 0.9967 0.9879 0.9978
YIELD_OTHER 0.8984 0.8984 0.9799 0.9821 0.9842 0.9865
YIELD_PERCENT 0.9705 0.9807 0.9936 0.9962 0.9974 0.9974
all 0.8056 0.8683 0.9098 0.9596 0.9230 0.9624
Table 3: Official F-scores of our submissions

Our work has been presented in the competition as BiTeM team. The top 10 submissions in the competition ranked by exact F-score are shown in Table 4, where our runs 2 and 3 were included. Our ensemble is better than the ChEMU Task 1 NER baseline and behind the top 1 in terms of exact F-score.

Name of
Team Precision Recall F1
exact relaxed exact relaxed exact relaxed
1 Fi***st Melaxtech 0.9571 0.9690 0.9570 0.9687 0.9570 0.9688
2 fu***NE Melaxtech 0.9587 0.9697 0.9529 0.9637 0.9558 0.9667
3 mu***NE Melaxtech 0.9572 0.9688 0.9510 0.9624 0.9541 0.9656
4 00***AT VinAI 0.9462 0.9707 0.9405 0.9661 0.9433 0.9684
5 ta***on Lasige_BioTM 0.9327 0.9590 0.9457 0.9671 0.9392 0.9630
6 ru***le BiTeM 0.9378 0.9692 0.9087 0.9558 0.9230 0.9624
7 ru***ed BiTeM 0.9083 0.9510 0.9114 0.9684 0.9098 0.9596
8 te***pc NextMove Software 0.9042 0.9301 0.8924 0.9181 0.8983 0.9240
9 te***npc NextMove Software 0.9037 0.9294 0.8918 0.9178 0.8977 0.9236
10 BANNER Baseline 0.9071 0.9219 0.8723 0.8893 0.8893 0.9053
Table 4: Top 10 participant submissions

Our CRF baseline achieves of exact F-score, while the competition baseline . BANNER [Leaman2008] is the competition baseline, based on CRF as well, but customized to biomedical NER, taking into account features, such as part-of-speech, lemma, Roman numerals, names of the Greek letters. Indeed, those features give the advantage to BANNER as they better characterize chemical entities.

3.3 Error analysis

The gold annotations for the test set are not available, thus we performed our error analysis on the development set. The results of all models with respect to each class are presented in Table 5. Among all models, ChemBERTa achieves the lowest performance. All the BERT-based models outperform the baseline models for all classes. The ensemble model consistently outperforms the single models. The ensemble model achieves the highest improvement for reaction_product and starting_material with over 12-point increase in F-score.

Entity CRF CNN
ChemBERTa Ensemble
EL 0.963 0.9526 0.9862 0.9817 0.9793 0.9769 0.9631 0.9885
OC 0.8762 0.7409 0.8953 0.8938 0.8947 0.8925 0.7850 0.9052
RP 0.7535 0.8425 0.8586 0.8515 0.8410 0.8427 0.5957 0.8807
RC 0.833 0.8557 0.8595 0.8355 0.8498 0.8468 0.4673 0.8946
So 0.8949 0.7517 0.9447 0.9451 0.9407 0.9426 0.5945 0.9545
SM 0.7253 0.8229 0.8072 0.8153 0.7995 0.7813 0.4405 0.8470
Te 0.9796 0.6397 0.9842 0.9842 0.9827 0.9841 0.8105 0.9855
Ti 0.99 0.8533 1.0000 0.9941 0.9941 0.9941 0.8141 0.9980
YO 0.9046 0.9448 0.9905 0.9924 0.9811 0.9848 0.7135 0.9943
YP 0.9913 0.9693 0.9978 0.9978 0.9913 0.9892 0.7131 0.9978
Table 5: Evaluation results (F) on development set – Exact scores

The error analysis of the exact matches shows that the most confusion occurred for starting_material, where it is more confused for reagent_catalyst than any other classes and reaction_product is mistaken for other_compound (see Fig. 2). Some examples of detected entities with incorrect labels are also presented in Table 6: e.g., the ensemble model correctly detected the spans of the passage isopropylamine; however, it incorrectly tagged it as reagent_catalyst instead of starting_material. Similarly bert-base-cased model tagged the passage TBDMS-Cl incorrectly as reagent_catalyst. It also did not correctly detect the spans of the entity.

Figure 2:

Normalized Confusion Matrix for the ensemble model (only exact matches were considered) in the development set.

Model Gold Prediction Gold entity Predicted entity
bert-base-cased SM RC TBDMS-Cl -
bert-base-cased RP OC Aromatic Amine Derivative Amine
Ensemble SM RC isopropylamine isopropylamine
Ensemble RP OC 78 78
Table 6: Mislabelled examples – some passages are partially detected.

The entities, such as reagent_catalyst, other_compound, reaction_product, and starting_material, with longer text are more likely to be partially detected by the BERT models, mainly BERT-large and ChemBERTa (see example prediction in Fig. 3). Particularly, the large nature of bert large models did not translate into effective representation. Fig. 3 shows how different models detected a reagent catalyst entity. BERT-large-uncased and ChemBERTa did not detect the entity. Both BERT-large-cased and BERT-base-cased were able to partially detect the entity.

Figure 3: An example of predictions by different models for (REAGENT_CATALYST) annotation. The span detected by each model is color-coded.

4 Conclusions

In this task, we explored the use of contextualized language models based on the transformer architecture to extract information from chemical patents. The combination of language models resulted in an effective approach, outperforming the baseline CRF model but also individual transformer models. Our experiments have shown that without an extensive pre-training in the patent chemical domain, the majority vote approach is able to leverage distinctive features, that are present in English language but as well in patents and achieves a of exact F-score. It seems that the transformer models are able to take advantage of natural language contexts in order to capture the most relevant features without supervision in the chemical domain. Our next step will be to investigate pre-trained models on large chemical patent corpora to further improve NER performance.