Extracting entities from scientific texts is an important task of Natural Language Processing. While traditional methods were based on manually generated features, pre-trained language models such as BERT(devlin-etal-2019-bert) have recently achieved competitive results. However, since BERT was pre-trained on general text from Wikipedia and Bookcorpus, its performance on domain-specific tasks has been shown to be suboptimal in several previous works (beltagy-etal-2019-scibert; Lee_2019). Those empirical findings have motivated the development of domain-specific pre-trained language models. In particular, several domain-specific transformers have been made publicly available through the hugging face’s transformer library. For example, there are SciBERT (beltagy-etal-2019-scibert) for the scientific domain, BioBERT (Lee_2019) for the biomedical domain, and FinBERT (yang2020finbert) for the financial domain.
Pre-trained transformers such as SciBERT or BioBERT have achieved excellent performance in scientific NER compared to previous work. However, despite this success, their traditional fine-tuning for NER can be suboptimal, as they typically classify the first subtoken representation of each word to label sequences. Some work has attempted to avoid this problem by designing NER as a span-based classification instead of sequence labeling (Eberts2020SpanbasedJE)
. However, these methods are more difficult to implement and require many additional hyperparameters such as the number of negative samples and window size. In this work, we adopt the conventional sequence labeling approach using the BIO scheme, but instead of directly classifying word labels, a transformer layer(vaswani2017attention) is added on top of the subword representation to better encode the word-level interaction. More specifically, each first subtoken of every word is passed through an additional transformer layer before the word classification. This small change in the architecture can significantly induce additional performance.
In this paper, we treat the NER task as a classification of BIO tags. Our model consists of three main elements: a pre-trained transformer layer, a word-level interaction layer and a classification layer composed of a linear layer and a CRF layer.
Pre-trained transformer layer
The first part of our architecture is a pre-trained transformer model to encode input subwords into contextualized embeddings. Usually, a model like BERT (devlin-etal-2019-bert) or RoBERTa (liu2019roberta) is considered for this step, but since we deal with scientific documents, we used a domain-specific model called Scibert (beltagy-etal-2019-scibert).
This component takes as input the first subword of each word in the previous layer and encodes their interaction with a single-layer transformer (vaswani2017attention). Unlike previous models that directly classify the first subtoken, the addition of this layer provides a better representation for the sequence labeling.
Finally, this component classifies each word representation into the corresponding label using a linear layer. Furthermore, with the use of a CRF layer (10.5555/645530.655813), we ensure that all output labels are valid by adding constraints during inference (e.g., label O must not precede label I) whitch also make evaluation easier.
SciERC is a dataset introduced by (luan-etal-2018-multi)
for scientific information extraction. The dataset consists of 500 abstracts extracted from papers related to Artificial Intelligence. They have been annotated with scientific entities, their relations and conference clusters.
TDM is a NER dataset that was recently published (hou2021tdmsci) for detecting Tasks, Datasets, and Metrics (TDM) from Natural Language Processing papers. It consists of sentences extracted from the full text of natural language processing papers, not just abstracts.
NCBI (DOGAN20141) is a NER dataset that is designed to identify disease mentions in biomedical texts from PubMed article abstracts.
3.2 Implementation details
For all experiments, we used the recommended version of SciBERT (beltagy-etal-2019-scibert), which contains 12 transform layers and a hidden dimension of 768 for the pre-trained model. In the case of the word-level layer, we employ a one-layer bidirectional transformer with a hidden dimension of 768 and 8 attention heads. We also experiment with a Bi-LSTM (10.1162/neco.19188.8.131.525) with a hidden dim of 768 in place of the word-level transformer layer to see the difference in performance. The final layer is a linear layer that is used to project the word representation into label space before feeding it into a CRF layer for decoding.
We did not conduct an extensive search for hyperparameters, but rather the ones recommended by (devlin-etal-2019-bert)
. For all datasets, we picked a learning rate of 3e-5 and used a batch size of 4 for TDM and SciERC and a batch size of 16 for the NCBI dataset. We trained all our models for up to 25 epochs and upon completion, we selected the checkpoints with the best f1 score on the validation set.
Inspired by research on semi- and self-supervised learning(tarvainen2018mean; he2020momentum), we maintain an exponential moving average (EMA) of all parameters during learning. We find that this simple technique yields additional performance at almost no cost. More formally, after each gradient, where and represent the parameters of the model and the parameter of the moving average at the step k, respectively. Furthermore, is a hyperparameter for the EMA update that is usually set to a floating number close to 1.0; in our experiment, we used a of 0.99.
All of our models were trained with a single V100 GPU.
We evaluate the models on the exact correspondence between the gold entities and the predicted entities. Furthermore, in line with previous research, we exclude non-entity spans. We therefore consider precision, recall and F-score as metrics, implemented by the seqeval library(seqeval). For SciERC, we reported both micro and macro averages, however, we only reported the micro average for TDM and NCBI-Disease for comparison with previous works..
3.4 Main results
The main results of our experimentation are presented in Table 2. It shows that our best models are able to outperform the state of the art on SciERC and TDM and obtain a competitive result on the NCBI dataset.
On SciERC, our model is able to surpass SciBERT (beltagy-etal-2019-scibert) by a significant margin (+2 on F1-macro). Our method is also capable of outperforming span-based approaches such as PURE (zhong-chen-2021-frustratingly) and Spert (Eberts2020SpanbasedJE) while employing a much simpler approach. More specifically, we achieved 70.91 on F1-micro while SPERT, the current state of the art got 70.33.
Our model also exceeds the state of the art on the TDM dataset by 5 points on F1-micro without using any data augmentation technique.
Finally, our approach also showed competitive results on the NCBI dataset, while not exceeding the state of the art. Our model has roughly the same performance as the original SciBERT NER model. However, the current state of the art, the BioBERT NER model, outperforms our model by more than one point on the F1-micro
3.5 Bi-LSTM vs Transformer
Table 2 also reports our experimentation using a Bi-LSTM as the word-level layer. We can see that the transformer is more efficient than the Bi-LSTM layer for all datasets. However, the additional gain of the transformer may be due to the fact that it contains more parameters.
3.6 Ablation studies
Here, we undertake an ablation study to see the contribution of different components of our proposed model. In detail, we examine the addition of the word-level layer and the effectiveness of the exponential moving average model. The reported results is the micro-F1 averaged accross three different random seeds.
Word level layer
In this study, we examine the effectiveness of our word-level interaction approach by comparing it to subword-level interaction with the same number of parameters. More concretely, the subword-level interaction is achieved by simply adding another layer of transformers on top of the scibert representation. The following table shows the comparison between the two approaches:
|Word level||Subword level|
As shown in the table 3, encoding the information at the word level is in fact beneficial. We demonstrate from the results of this experiment that the additional performance provided by our approach is not only due to the larger number of parameters but to the fact that it produces a better representation between words to predict NER tags. Intuitively, our approach forces each first subtoken to encode the entire word information through a self-attention mechanism, and then the addition of the transformation layer models their interaction.
Exponential Moving Average
We also investigate the effectiveness of keeping an Exponential Moving Average (EMA) of the parameters during training which consist in keeping a moving average of the model parameter after each gradient step.
|Base Parameters||EMA Parameters|
This table 4
shows that keeping an exponential moving average of the parameter during training can provide additional performance gain almost for free. The EMA version outperforms the baseline for all datasets on which we trained our model. In fact, keeping an moving average of the parameters can be seen as a soft ensembling of all the previous model checkpoints. Furthermore, keeping an EMA of model parameters could be employed for any deep learning task and not limited to Named Entity Recognition. It could be interesting to investigate its effectiveness across a wider range of NLP task, but it is not in the scope of our study so we leave it for future work.
4 Related Work
Early works in NER were made up of handcrafted features and CRF-like models. The advent of deep learning allowed for end-to-end model training. In the early days of deep learning, the majority of NER models used an LSTM-based architecture with word- and character-level encoding and CRF layer for decoding (lample-etal-2016-neural; huang2015bidirectional)
. Recently, the arrival of BERT has dramatically transformed the field of NLP. In terms of NER, BERT’s paper proposed to do NER by classifying the hidden representation of the first subtoken of each word for sequence labeling. Then, this approach has been adapted to different domains such as the scientific domain through the works ofbeltagy-etal-2019-scibert and Lee_2019 among others. For example, beltagy-etal-2019-scibert use scibert as an encoder and project the first subtoken into the label space and then employ a CRF layer for decoding (Eberts2020SpanbasedJE; zhong-chen-2021-frustratingly). Some work has employed span-based approach as opposed to traditional sequence labelling. In the span-based approach, each span in a sequence up to a maximum sequence length is enumerated, aggregated, and then classified. These methods are particularly useful for modelling joint entity and relation extraction (Wadden2019EntityRA) and nested named entity recognition.
In this paper, we proposed a new architecture for applied named entity recognition. Our model outperforms the state of the art on two scientific NER benchmarks, namely the SciERC and TDM datasets, and achieves competitive performance on the NCBI-disease dataset. The advantage of our model is that it is simple, does not require additional data or external resources such as gazetter to achieve competitive results. Our empirical results also show that maintaining an exponential moving average of the model parameters during learning can provide an additional performance gain for a negligible computational resource (time complexity).