Chemical patents represent a valuable information resource in downstream innovation applications, such as drug discovery and novelty checking. However, the discovery of chemical compounds described in patents is delayed by a few years [He2020]. Among the reasons, it could be considered the recent increase in the number of chemical patents that disregard a manual curation, and the particular wording. Additionally, narrative in chemical patents possesses meaningful concepts that are expressed usually in the seeking to protect the knowledge, while in scientific literature the text tends to be as clear as possible [Valentinuzzi2017]. In addition, chemical patents represent a complex source of information [Habibi2016]. In this landscape, information extraction methods, such as Named Entity Recognition (NER), provide a suited solution to identify key information in patents.
NER aims to identify information of interest and its specific instances found in a document [Grishman2019, Okurowski1993]. It has been often addressed as a sequence classification task. One of the most successful approaches in sequence classification is Conditional Random Fields (CRF) [Lafferty2001, Sutton2012]. It was established as state-of-the-art in different NER domains for many years [Leaman2008, Rocktschel2012, Leaman2015, Ratinov2009, Guo2014, Habibi2016, Yadav2018]. In the chemical patent domain, CRF was explored by Zhang et al. [Zhang2016] in the CHEMDNER patent corpus [Krallinger2015]. Using a set of hand-crafted and unsupervised features derived from word embeddings and Brown clustering, their model achieved of F-score. With similar F-score performance, Akhondi et al. [Akhondi2016] explored CRF combined with dictionaries in the biomedical domain in the tmChem tool [Leaman2015] in order to select the best vocabulary for the CHEMDNER patent corpus. It has been shown [Habibi2016] that recognizing chemical entities in the full patent text is a harder task than in titles and abstracts, given the peculiarities of this kind of text. Evaluation in full patents was performed using Biosemantics corpus [Akhondi2014] through neural approaches based on biLSTM-CRF [Habibi2017] and biLSTM-CNN-CRF [Zhai2019], where the former achieved and the latter of F-score. It is worth noting that in [Zhai2019] the authors used ELMo contextualized embeddings [Peters2018] while in [Habibi2017] the authors used word2vec embeddings [Mikolov2013] to represent features.
Over the years, neural language models have improved their ability to encode the semantics of words using large amounts of unlabeled text. They have initially evolved from a straightforward model [Bengio2003]
of one hidden layer that predicts the next word in a sequence, aiming to learn the distributed representation of the words (i.e., the word embedding vector), to an improved objective function that allows learning from a larger amount of text[Collobert2011], but with higher computational resources usage and longer training time. These developments have encouraged the seeking of language models able to bring high-quality word embeddings with lower computational cost (i.e., word2vec [Mikolov2013] and GloVe [Pennington2014]). However, natural language still presented challenges for language models, in particular, concerning word contexts. Recently, a second type of word embeddings have attracted attention in the literature, the so-called contextualized embeddings, such as ELMo [Peters2018], UMLFiT [Howard2018]
, GPT-2[Radford2019], and BERT [Devlin2019]. Particularly, the transformers architecture based on BERT uses the attention mechanism to pre-train deep bidirectional representations conditioning tokens on the left and right context.
In this work, we explore contextualized language models to extract information in chemical patents as part of the lab ChEMU – Information extraction from Chemical Patents [He2020] – in the 11th Conference and Labs of the Evaluation Forum 2020, Task 1: Named Entity Recognition. The entities in the corpus are example_label, other_compound, reaction_product, reagent_catalyst, solvent, starting_material, temperature, time, yield_other, and yield_percent
. BERT-based models were used as pre-trained language models and fine-tuned on the ChEMU NER task to classify tokens according to the different entities. We investigate the combination of different architectures to improve NER performance. In the following sections, we describe the design and results of our experiments.
2 Methods and Data
2.1 NER model
2.1.1 Transformers with a token classification on top
We used five BERT-based language models [Devlin2019]. The first four models are bert-base-cased, bert-base-uncased, bert-large-cased and bert-large-uncased. They were pretrained on a large corpus of English text with different model sizes for base and large. The last pretrained language model used is ChemBERTa trained on a corpus of 100k SMILES strings from the benchmark dataset ZINC. ChemBERTa is a RoBERTa [Liu2019]
The fine-tuning on the NER model is a BERT module with a fully connected layer on top of the hidden states of each token, using Adam optimizer [Kingma2015]. We used the implementation from hugging face framework.111https://huggingface.co/transformers/
The language models were fine-tuned on the ChEMU Task 1 dataset. The first four language models were fine-tuned for 10 epochs, with a sequence length of maximum 256 tokens, a learning rate of and a warmup proportion of . ChemBERTa model was fine-tuned for 29 epochs, with a sequence length of maximum 256 tokens, a learning rate of and a warmup proportion of . During the evaluation of the performance of our models, we increased the sequence length of maximum 512 tokens to take into in account the larger entities in the data.
2.1.2 Ensemble model
Our ensemble method is based on a voting strategy, where each model votes with its prediction and a majority of votes is necessary to assign the prediction. In order to decide which model composition to use in our ensemble model, we used the dev-set and compute all possible ensemble predictions according to the majority rule. We retained the ensemble composition with the best overall F-score and used it for the test set.
During the test phase, as we were unable to compute predictions for the bert-large models by the deadline, we had to take this model out of the ensemble equation. The prediction models considered in the ensemble are bert-base-cased, bert-base-uncased and a convolutional neural network model.
As baseline we evaluated two models, Conditional Random Fields and Convolutional Neural Network. Conditional Random Field (CRF) was motivated to solve sequence classification by estimating the conditional probability of a label sequence given a word sequence, considering a set of observed features in the latter[Lafferty2001, Sutton2012]. Our CRF classifier relies on the CRFSuite 222http://www.chokkan.org/software/crfsuite/ implementation and a set of standard features in a window of tokens [Copara2016, Guo2014] without taking into account part-of-speech tags, neither gazetteers. The features used are token itself, lower-cased word, capitalization pattern, type of token (i.e., digit, symbol, word), 1-4 character prefixes/suffixes, digit size (i.e., size 2 or 4), combination of values (digit with alphanumeric, hyphen, comma, period), binary features for upper/lower-cased letter, alphabet/digit char and symbol. Please refer to [Copara2016, Guo2014, Okazaki2007] for further details on the features used.
The Convolutional Neural Network (CNN) for NER used relies on incremental parsing with Bloom embeddings. The convolutional layers use residual connections, layer normalization and maxout non-linearity. The input sequence is embedded in a vector compounded by bloom embeddings modeling the characters, prefix, suffix and part-of-speech of each word. In the CNN, over the text is used 1D convolutional filters in order to predict how the next words are going to change. Our implementation relies on spaCy NER333https://spacy.io, using the pretrained transformer bert-base-uncased for 30 epochs and a batch size of 4. During the Test Phase, we need to fix the max length of the text to 1500k to reserve the enough RAM memory.
The data in ChEMU Task 1: NER is provided as snippets sampled from 170 English document patents from the European Patent Office and the United States Patent and Trademark Office [He2020]. Gold annotations were provided for training (900 snippets) and development (250 snippets) sets for a total of entities. The annotation was done in the BRAT standoff format, Fig. 1 shows an example of a snippet and its annotation.
During the development phase, we used the official development set to evaluate the performance of our models turning out in our test set in this phase. The official training set was split into train and dev sets, in order to train our models. As a result of this new setting, we get 800 snippets for train set, 100 for dev set and 225 for test set. Table 1 shows the entity distribution during the Development Phase. Major part of annotations come from other_compound, reaction_product and starting_material covering the 52% of entities in the Development Phase. In contrast, example_label, time and yield_percent entities represent of entities in the development phase. We used the new split in order to tune the hyper-parameters of the models that are going to be used in Test Phase.
2.3 Evaluation metrics
The metrics used to evaluate ChEMU Task 1: NER are precision, recall, and F-score. As can be seen in the example in Fig. 1 each entity has a span that is expected to be identified for the models as well as the correct entity type. To evaluate how accurate was the predicted span concerning the real, also is included the exact and relaxed span matching conditions for the evaluation. Our models were evaluated with the ChEMU web page system444http://chemu.eng.unimelb.edu.au/ and the BRAT Eval tool.555https://bitbucket.org/nicta_biomed/brateval/src/master/
3 Results and discussion
3.1 Comparison in Development Phase
Table 2 shows the exact and relaxed F-scores for all the models explored for ChEMU NER. The reported results come from the ChEMU web page system except for CNN, bert-large-uncased, and ensemble models that come from the BRAT Eval tool.
We assess the performance of two baselines, i.e., CRF and CNN models. CRF achieves of F-score where for entities with major proportion in the data (starting_material, reaction_product, other_compound) achieves an F-score average of while CNN achieves an average of but this is compensated by entities as temperature, time and solvent.
Among the BERT based models, the ensemble shows our best F-score in Development Phase. The entities time, yield_other and yield_percent were recognized with highest F-score. We associate this fact with the nature of these entities and the language models involved given that the ensemble model mainly relies on bert-base models. On the other hand, reaction_product, reagent_catalyst and starting_material entities were less recognized with , and of F-score, respectively. These entities are chemical entity types [He2020] (e.g., for starting_material: 4-(6-Bromo-3-methoxypyridin-2-yl)-6-chloropyrimidin-2-amine) but still are present some patterns that were enough to recognize those entities in exact F-score.
We perform an analysis of statistical significance in the predictions of the studied models and found that among the ensemble approach, bert-base-cased, bert-base-uncased, bert-large-cased and bert-large-uncased, there is no statistical significant difference, with
by two-tailed t-test. Our analysis takes into account the span and type entity in exact matching.
We also investigate the performance of ChemBERTa, where it was expected to achieve better results; however, even being a specific domain language model (pre-trained with SMILES strings from ZINC database), the specialization of chemical patents goes in a different direction leading to the lowest results among all the explored models (exact and relaxed metrics, see exact F-scores in Table 5).
Even whether our language models are not able to encode the specialized language in chemical patents, these results show the high ability of the contextualized neural language models to perform chemical NER in patents and the results are promissory with a specific domain pre-training.
3.2 Test phase
We perform the evaluation in the test set released (9,999 files containing chemical narratives from patents) where 3 runs were allowed. For run 1, we submitted our baseline on CRF. For run 2, we used bert-base-cased and for run 3, our ensemble based on the majority vote approach. Table 3 shows the official F-scores of our submissions for exact and relaxed span matching. The ensemble achieved of exact F-score exceeding in our baseline and the best individual contextualized language model (bert-base-cased).
For each of our submissions the entity with lowest exact F-score is starting_material, achieving in CRF, in bert-base-cased and in the ensemble. The of difference between CRF and the ensemble shows that the major advantage of language models based on attention mechanisms lies in the wealth of natural language without any specific design of hand-crafted features as it is necessary for CRF.
The 5-top best performed entities are example_label, temperature, time, yield_other, yield_percent, which is similar to the results in the development phase. These results suggest that the test set has a similar entity distribution for the train, dev, and test sets despite the vast amount of test files provided.
Our work has been presented in the competition as BiTeM team. The top 10 submissions in the competition ranked by exact F-score are shown in Table 4, where our runs 2 and 3 were included. Our ensemble is better than the ChEMU Task 1 NER baseline and behind the top 1 in terms of exact F-score.
Our CRF baseline achieves of exact F-score, while the competition baseline . BANNER [Leaman2008] is the competition baseline, based on CRF as well, but customized to biomedical NER, taking into account features, such as part-of-speech, lemma, Roman numerals, names of the Greek letters. Indeed, those features give the advantage to BANNER as they better characterize chemical entities.
3.3 Error analysis
The gold annotations for the test set are not available, thus we performed our error analysis on the development set. The results of all models with respect to each class are presented in Table 5. Among all models, ChemBERTa achieves the lowest performance. All the BERT-based models outperform the baseline models for all classes. The ensemble model consistently outperforms the single models. The ensemble model achieves the highest improvement for reaction_product and starting_material with over 12-point increase in F-score.
The error analysis of the exact matches shows that the most confusion occurred for starting_material, where it is more confused for reagent_catalyst than any other classes and reaction_product is mistaken for other_compound (see Fig. 2). Some examples of detected entities with incorrect labels are also presented in Table 6: e.g., the ensemble model correctly detected the spans of the passage isopropylamine; however, it incorrectly tagged it as reagent_catalyst instead of starting_material. Similarly bert-base-cased model tagged the passage TBDMS-Cl incorrectly as reagent_catalyst. It also did not correctly detect the spans of the entity.
|Model||Gold||Prediction||Gold entity||Predicted entity|
|bert-base-cased||RP||OC||Aromatic Amine Derivative||Amine|
The entities, such as reagent_catalyst, other_compound, reaction_product, and starting_material, with longer text are more likely to be partially detected by the BERT models, mainly BERT-large and ChemBERTa (see example prediction in Fig. 3). Particularly, the large nature of bert large models did not translate into effective representation. Fig. 3 shows how different models detected a reagent catalyst entity. BERT-large-uncased and ChemBERTa did not detect the entity. Both BERT-large-cased and BERT-base-cased were able to partially detect the entity.
In this task, we explored the use of contextualized language models based on the transformer architecture to extract information from chemical patents. The combination of language models resulted in an effective approach, outperforming the baseline CRF model but also individual transformer models. Our experiments have shown that without an extensive pre-training in the patent chemical domain, the majority vote approach is able to leverage distinctive features, that are present in English language but as well in patents and achieves a of exact F-score. It seems that the transformer models are able to take advantage of natural language contexts in order to capture the most relevant features without supervision in the chemical domain. Our next step will be to investigate pre-trained models on large chemical patent corpora to further improve NER performance.