Log In Sign Up

End-to-End Entity Detection with Proposer and Regressor

by   Xueru Wen, et al.
Jilin University

Named entity recognition is a traditional task in natural language processing. In particular, nested entity recognition receives extensive attention for the widespread existence of the nesting scenario. The latest research migrates the well-established paradigm of set prediction in object detection to cope with entity nesting. However, the manual creation of query vectors, which fail to adapt to the rich semantic information in the context, limits these approaches. An end-to-end entity detection approach with proposer and regressor is presented in this paper to tackle the issues. First, the proposer utilizes the feature pyramid network to generate high-quality entity proposals. Then, the regressor refines the proposals for generating the final prediction. The model adopts encoder-only architecture and thus obtains the advantages of the richness of query semantics, high precision of entity localization, and easiness for model training. Moreover, we introduce the novel spatially modulated attention and progressive refinement for further improvement. Extensive experiments demonstrate that our model achieves advanced performance in flat and nested NER, achieving a new state-of-the-art F1 score of 80.74 on the GENIA dataset and 72.38 on the WeiboNER dataset.


page 2

page 5


BoningKnife: Joint Entity Mention Detection and Typing for Nested NER via prior Boundary Knowledge

While named entity recognition (NER) is a key task in natural language p...

Propose-and-Refine: A Two-Stage Set Prediction Network for Nested Named Entity Recognition

Nested named entity recognition (nested NER) is a fundamental task in na...

Bottom-Up Constituency Parsing and Nested Named Entity Recognition with Pointer Networks

Constituency parsing and nested named entity recognition (NER) are typic...

Type-supervised sequence labeling based on the heterogeneous star graph for named entity recognition

Named entity recognition is a fundamental task in natural language proce...

Entity Candidate Network for Whole-Aware Named Entity Recognition

Named Entity Recognition (NER) is a crucial upstream task in Natural Lan...

Local Hypergraph-based Nested Named Entity Recognition as Query-based Sequence Labeling

There has been a growing academic interest in the recognition of nested ...

1 Introduction

Named entity recognition identifying text spans of specific entity categories is a fundamental task in natural language processing. It has played a crucial role in many downstream tasks such as relation extraction [fu-etal-2019-graphrel], information retrieval [10.1145/3331184.3331333] and entity linking [chen2020improving]. The model [liu2022tfm, yan2021named] based on sequence labeling has achieved great success in this task. Even though mature and efficient as these models are, they fail to handle the nested entities that are a non-negligible scenario in the real-world language environment. Some recent studies [shen-etal-2021-locate] have noted the formal consistency of the object detection with NER tasks. Figure 1 shows the instances where the entities overlap with each other and the detection boxes intersect with each other.

Figure 1: Examples for object detection and named entity recognition under flat and nested circumstances. Examples are obtained from GENIA [10.5555/1289189.1289260] and COCO2017 [10.1007/978-3-319-10602-1_48].

A few previous works have designed proprietary structures to deal with the nested entities, such as the constituency graph [finkel-manning-2009-nested] and hypergraph [HUANG2021200]. Other works [alex-etal-2007-recognising, fisher-vlachos-2019-merge] capture entities through the layered model containing multiple recognition layers. Despite the success achieved by these approaches, they inevitably necessitate the deployment of sophisticated transformations and costly decoding processes, introducing extra errors compared to the end-to-end manner.

Seq2Seq methods [ju-etal-2018-neural] can address various kinds of NER subtasks in a unified form. However, these methods have difficulties defining the order of the outputs due to the natural conflict between sets and sequences. This trait limits the performance of the model. The span-based approaches [xu-etal-2017-local, sohrab-miwa-2018-deep]

, which identify entities by enumerating all candidate spans in a sentence and classifying them, also receive lots of attention. Although enumeration can be theoretically perfect, the high computational complexity still burdens these methods. Second, these methods mainly focus on learning span representations without the supervision of entity boundaries

[Tan2020BoundaryEN]. Further, enumerating all subsequences from a sentence generate many negative samples, which reduces the recall rate. Some recent work, including set prediction networks, has attempted to address these defects.

The latest works [shen-etal-2022-piqn] treat information extraction as a reading comprehension task, extracting entities and relations through manually constructed queries. The set prediction network [DianboSui2020JointEA] is introduced to the entity and relation extraction. Because the techniques accommodate the unordered character of the prediction target, these methods achieve great success. However, most of them still confront problems caused by query vectors. The random initialization of the query vector leads to the lack of sufficient semantic information and difficulty in learning the proper attention pattern.

This paper presents the end-to-end entity detection network, which predicts the entities in a single run and thus is no longer affected by prediction order. The proposed model transforms the NER task into a set prediction problem. First, we utilize the feature pyramid network to build the proposer, which generates high-quality entity proposals with rich semantical query vectors, high-overlapping spans, and category logarithms. High-quality proposals significantly alleviate the difficulty of training and accelerate the convergence. Then, the encoder-only regressor constructed by the iterative transformer employs the regression procedure on the entity proposals. In contrast to some span-based methods discarding the partially match proposals, the regressor adjusts these proposals to improve model performance. The prediction head computes probability distributions for each entity proposal to identify the entities. In the training phase, we dynamically assign prediction targets to each proposal.

Moreover, we introduce the novel spatially modulated attention in this paper. It guides the model to learn more reasonable attention patterns and enhances the sparsity of the attention map by making full use of the spatially prior knowledge, which improves the model performance. We also correct the entity proposal at every layer of the regressor network, called the progressive refinement. This strategy increases the precision of the model and facilitates gradient backpropagation.

Our contribution can be summarized as follows:

  • We design the proposer constructed by the feature pyramid to incorporate multi-scale features and initialize high-quality proposals with high-overlapping spans and strongly correlated queries. Compared to previous works which randomly initialize query vectors, the proposer network can significantly accelerate the convergence by reducing the training difficulties.

  • We deploy the encoder-only framework in the regressor, which evades handcraft construction of query vectors and hardships in learning appropriate query representations and thus notably expedites convergence and improves performance. The iterative refinement strategy is further utilized in the regressor to improve the precision and promote gradient backpropagation.

  • We introduce the novel spatially modulated attention mechanisms that help learn the proper attention pattern. The spatially modulated attention mechanism dramatically improves the model’s performance by integrating the spatially prior knowledge to increase attention sparsity.

2 Related work

2.1 Named Entity Recognition

Since traditional NER methods with sequence labeling [liu2022ltp, wang2017named] have been well studied, many works [AlejandroMetkeJimenez2016ConceptIA] have been devoted to extending sequence tagging methods to nested NER. One of the most explored methods is the layered approach [wang-etal-2020-pyramid]. Other works deploy proprietary structures to handle the nested entities, such as the hypergraph [HUANG2021200]. Although these methods have achieved advanced performance, they are still not flexible enough due to the need for manually designed labeling schemes.

Seq2Seq approach [yan-etal-2021-unified-generative] unifies different forms of nested entity problems into sequence generation problems. This strategy avoids the complicated annotation methods while achieving considerable performance improvement. The sensitivity of the outputting result’s order, on the other hand, poses a barrier to further improving the achievement of the models of this kind.

Span-based approaches [ChuanqiTan2020BoundaryEN, LI202126], which classify candidate spans to identify the entities, also draw broad interest. One of the most noteworthy approaches [shen-etal-2021-locate]

proposes a two-stage identifier that fully exploit partially matched entity proposals and introduces the regression procedure. It reveals the potential of migrating advanced methods in computer vision to formally symmetric tasks in natural language processing. These methods confront the challenge caused by the high complexity due to the enumeration of subsequences.

2.2 Set Prediction

Several recent pieces of research [ijcai2021-542] have deployed the set prediction network in information extraction tasks and proved its effectiveness. These works can be seen as variants of DETR [10.1007/978-3-030-58452-8_13] which is proposed for object detection and use the transformer decoder to update the manually created query vector for the generation of detection boxes and corresponding categories.

The models based on set prediction networks, especially DETR, have been extensively studied. Slow convergence due to the random initialization of the object query is the fundamental obstacle of DETR. A two-stage model [ZhiqingSun2020RethinkingTS] with the feature pyramid [TsungYiLin2016FeaturePN] is proposed to generate high-quality queries and introduce multi-scale features. This paper also discusses the necessity for cross attention and suggests that an encoder-only network can achieve equally satisfactory results. Spatially modulated co-attention [9709993] is introduced to integrate spatially prior knowledge. This work increases the sparsity of attention through a priori knowledge for the purpose of accelerating training. The thought-provoking deformable attention [XizhouZhu2021DeformableDD] is presented, which shows the possibility of models learning the spatial structure of the attention. It also improves the model performance by iteratively refining the detection box.

3 Method

In this section, we are going to detail our method. The general framework of our model is shown in Figure 2, which is constructed in the following parts:

Figure 2: Architecture of our proposed model for end-to-end entity detection.
  • Sentence Encoder We utilize hybrid embedding to encode the sentence. The generated embeddings are then fused by the BiGRU [cho2014learning] to produce the final multi-granularity representation of the sentence.

  • Proposer We build up the feature pyramid through the stack of BiGRU and CNN [kim-2014-convolutional] to constitute the proposer network. The proposer exploits the multi-scale features to initialize the entity proposals.

  • Regressor We design the regressor that refines the proposals progressively to locate and classify spans more accurately. The regressor is built by stacking update layers constructed by the spatially modulated attention mechanism.

  • Prediction Head

    The prediction head outputs the span location probability distribution based on the refined proposals. The distribution will be combined with probabilities generated by the category logarithms to compute the joint probability distribution, from which can obtain the eventual prediction results.

3.1 Sentence Encoder

The goal of this component is to transform the original sentence into dense hidden representations. With the inputted sentence

, we represent the -th token with the concatenation of multi-granularity embeddings as follows:


The embedding at character level is generated by fusing each character’s embedding

through recurrent neural networks and average pooling them as follows:


where is the amount of characters constituting the token. The character-level embedding can help the model cope better with out-of-vocabulary words.

stands for the representation generated by the pre-trained language model BERT [devlin2019bert]. We follow [wang-etal-2020-hit] to obtain the contextualized representation by encoding the sentence with the surrounding tokens. The BERT separates the tokens into subtokens by Wordpiece partitioning [wu2016google]. The representation of subtokens is average pooled to create the contextualized embedding as follows:


where is the number of subtokens forming the token. The pre-trained model can aid in the generation of more contextually relevant text representations.

As for the embedding of word-level , we exploit pre-trained word vectors including Glove [pennington-etal-2014-glove]. To introduce the semantic message of part-of-speech, we embed each token’s POS tag as .

The multi-granularity embeddings are then fed into the BiGRU network to produce the hybrid embedding for the final representation of the sentence as follows:


3.2 Proposer

We tailor the pyramid network [wang-etal-2020-pyramid] to build up our proposer, which is able to integrate features of multi-scales and reasonably initialize the proposals. Figure 3 illustrates the structure of the feature pyramid. The constitution of the pyramid is performed in a bottom-to-up and up-to-bottom manner. The bidirectional construction procedures allow better message communication between layers. We selectively merge the features at different layers to yield initial proposals. This process is implemented in a similar way to the attention mechanism.

Figure 3: Data flow of feature pyramid and detailed structure of blocks.

3.2.1 Forward Block

The feature pyramid is first built from the bottom to up. It consists of layers, each with two main components, a BiGRU and a CNN of kernel sizes . At layer , the BiGRU models the interconnections of spans of the same size. The CNN aggregates neighboring hidden states, which are then passed into higher layers. Apparently, each feature vector represents a span of original tokens and can be calculated as:


One may note that the pyramid structure provides inherent induction: the higher the number of layers, the shorter the input sequence, with higher levels of feature vectors representing the long entities and lower levels representing the short entities. Moreover, since the input scales of the layers are diverse, we apply Layer Normalization [JimmyBa2016LayerN] before feeding the hidden states into the BiGRU.

As described above, the forward block can be formalized as follow:



is activate function which we exploit the

Gelu [hendrycks2016gaussian] in our work. In particular, there is , namely the initial inputs to the pyramid network is exactly the output from the sentence encoder.

3.2.2 Backward Block

The backward block allows layers to receive feedback from the higher-level neighbors, which benefits the model to learn better representations at each layer by enhancing the communication between the neighbor tokens. In bottom-to-up propagation, the sequence reduces the length each time passes through the CNN. Thus the corresponding Transposed CNNs are required to reconstruct the representation of the text span embeddings.

Specifically, in the -th inverse layer, we first feed the hidden states transmitted down from the -th layer into BiGRU to catch the spans’ interaction. And then, we concatenate the outputs with the outcomes from the

-th forward layer. The concatenated representations are linear transformed and then summed up with the pooled residuals to get the outcomes at the

-th layer. The outcomes will be passed through the Transposed CNN and then sent to the next layer. Similarly, Layer Normalization is deployed due to the diversity of input scales.

The overall structure can be formalized as follows:


where is activate function and Gelu is utilized.

denotes the concatenation operation. Here, the length of the residuals and the feature representations of the current layer were aligned using average pooling. The purpose of designing this residual connection is to avoid model degradation with the increase of the feature pyramid layers.

3.2.3 Proposal Fuse

After the build-up of the feature pyramid, we are able to initialize the entity proposals. In contrast to the works [DianboSui2020JointEA, ijcai2021-542] which manually create the query vectors, the encoder-only architecture is deployed in our work in which query vectors also work as semantic feature vectors.

Note that the entity proposal mentioned in this paper includes not only query vector but also category logarithms and span location , where stands for the dimension of queries vector and represents the number of the pre-defined categories. Additional one for corresponds to the None type. The span location indicates the start and end position of the entity. The proposals are initialized with the help of the prepared feature pyramid.

Apparently, each feature vector in the pyramid specifically correspond to an text span . Our model computes the scores vector for each feature and utilizes them to calculate the weights vector for each proposal as follows:


where denotes the weight vector of the span at index of the -th level for -th proposal that corresponds to the token . present positions of features whose corresponding text span contains the token . The coefficients of aggregation at the start and end positions are calculated separately, which can enhance the expressiveness of the aggregation process and increase the range of the initial span.

With the weight vector, the location span can be initialized as follows:


where stands for the element-wise product and represent the span location correspond to the feature . One may notice that the position of the initial span is actually bounded by the leftmost span and the rightmost span containing the token . Such limitation helps the model to give more reasonable proposals.

As for query vector , we directly employ the features at the bottom layer, namely . The reason for doing so is that the features at the bottom layer are obtained from the whole propagation, which means they contain the most semantic messages of the pyramid. Once obtain the query vector, category logarithms

can be naturally derived using an multi-layer perceptron:


The category logarithms will be further utilized and corrected in the regressor network.

3.3 Regressor

Previous works [ZhiqingSun2020RethinkingTS, ZihangJiang2020ConvBERTIB] have proven the importance of the sparsity of the attention map in accelerating the training. Based on their conclusion, we introduce the spatially modulated attention in the regressor. We employ the encoder-only architecture in which semantic feature vectors directly play the role of the query instead of manually creating them. Inspired by successful practice on auxiliary loss [10.1007/978-3-030-58452-8_13, RamiAlRfou2018CharacterLevelLM], we further introduce progressive refinement into the regressor. The overall architecture of the regressor is shown in Figure 4.

Figure 4: General view of regressor and detailed structure of spatially modulated attention.

3.3.1 Category Embedding

Each proposal is inputted alongside category logarithms, which carry the semantic information of the proposal category. We follow the way of embedding the positional message in the Transformer [2017Attention] using the element-wise sum to integrate the query vector with the category information:


where stands for the category weights vector of -th proposal. represents the type embedding matrix. denotes the category embedded query vectors.

In an encoder-only design, query vectors are also semantic feature vectors, making it difficult for the model to discriminate the role played by each query vector. We created category embeddings based on this understanding to clarify the function of each query vector in the forward propagation process.

3.3.2 Spatially Modulated Attention

The central idea of the spatially modulated attention mechanism is to utilize the spatial guidance from the span locations of each entity proposal to learn the reasonable sparse attention maps.

Dynamic spatial weight maps

A Gaussian-like distribution is generated for each proposal in each attention head. The Gaussian-like distribution in this paper is given as:


where denotes independent variable. can be seen as expectation vector, which represent the center of the distribution. can be seen as precision matrix (namely the inverse of the covariance matrix ), which described the shape of distribution.

The main difference between the Gaussian-like distribution and the multidimensional Gaussian distribution is that we remove the probability normalization term. The main reason for doing so is twofold. First, the normalized coefficients of the multidimensional Gaussian distribution are derived in the continuous case. Whereas the case here is a discretized attention map. Second, the subsequent

Softmax operation will also do the normalization work.

Sets of parameters are needed to generate the distributions. We formulate our ways of producing them as follows:


where is the linear transformation for -th head of in Formula 11, which is also the query in the QKV Attention [2017Attention]. denotes the difference between the span location of the proposal and the center of the map. From Formula 13, it can be found that is a symmetric positive definite matrix satisfying the requirement of being the precision matrix for a Gaussian distribution.

It can be observed that the model generates the spatial weight map dynamically, implying that we expect the model to learn the way to make better use of the spatial prior.

Spatially modulated attention

The spatially modulated attention mechanism aims to ease the difficulty of learning the proper attention pattern by enhancing the sparsity of the attention map with the spatial weight map. With the dynamically generated spatial prior , the spatially modulated attention can be conducted as follows:


where denotes the -th attention head, stands for the number of heads. The spatially modulated attention mechanism performs the element-wise sum between the spatial graph and the dot product attention and then undertakes the Softmax normalization. By introducing the spatial graph, each query vector has a relatively high weight for the query vector whose corresponding span location is closer. It limits the search space for attention patterns and thus accelerates the convergence speed.

Multi-head iteration

We aggregate the results of each attention head to produce new query vectors and span locations. This process is similar to the way the proposer fuses the proposals. First, we compute the weight vectors for each head:


where is the number of attention heads, denotes the score of -th head in -th proposal. We calculate the corresponding score based on the aggregation result of each head and do the Softmax normalization to get the weight vectors.

Using the weight vectors, we conduct the iteration of span location. Similarly, we perform aggregation for the start and end positions separately. The procedure can be formulated as follows:


The spatially modulated attention implements iteration of span location based on the weighted average of each head’s center. Fusing iterations on span location and multi-head attention mechanisms increases the flexibility of span updates and benefits the accuracy of model predictions.

As for query vectors, we simply follow the custom way of aggregating the output of each head as follows:


3.3.3 Gated Update

We replace the feed-forward module in Transformer with the gate mechanism in the GRU [cho-etal-2014-learning] to strengthen the expressive ability of the model. The gate mechanism can be formulated as follows:


where is the current input, is the hidden state, and is the final new state output.

We denote the above mechanism as . The new query vectors are obtained as follows:


Where stands for the outputs from the spatially modulated attention in Formula 17 and is category embedded represent of the query in Formula 11. The attention mechanism’s outcome is used as the input to update the query representations in each iteration, with the original state representations serving as the hidden states.

3.3.4 Logarithms Iteration

After updating the query vector and span locations , we further iterative the category logarithms . Same as in span location, the module predicts the difference of logarithms rather than directly outputting the result. The updating procedure can be formulated as follows:


The assumption that each entity proposal may change responsibilities during the iteration is the reason for undertaking logarithms iterations.

3.4 Prediction Head

The prediction head produces predictions based on the joint probability distribution indicated by entity proposals. First, we can use Softmax normalization to convert the category logarithms into the probability distribution:


As for span location distribution, it is natural to extend the mechanism of spatially modulated attention as the joint pointer network, which is similar to the fusion of the smooth boundary [zhu-li-2022-boundary] and the biaffine decoder [yu-etal-2020-named]. Specifically, we calculate the joint probability distribution of the span location in the following way:


where is created in the same way as spatially modulated attention. is the score derived from the joint pointer network, which is essentially the sum of two pointer networks.

3.4.1 Train

In the training phase, the determination of the best match for each proposal is a one-to-many Linear Assignment Problem. We formulate the searching procedure as follows:



denotes the targets with padding triple denoting

None. denotes the set of entities existing in the sentence. signifies the proposals outputted by the model, in which each element can be represented as . The is calculated in the following way:


in Formula 23 stands for the assignment mappings satisfying certain constraints when the number of proposals is . For the circumstance when , any has to be surjective mapping, which means every candidate entity will be assigned to a proposal. As for , the mapping ensure , namely any proposal will be assign to an entity. Given the constraints and optimization objectives, we use Algorithm 1 to solve the assignment problem.

1:Targets , Proposals
2:for  to  do
3:     for  to  do
5:     end for
6:end for
8:for  in  do
10:end for
Algorithm 1 Assignment

In the algorithm, stands for the cost matrix recording the match cost of each proposal-entity pair. HUG denotes the Hungarian algorithm [HaroldWKuhn1955TheHM], which can solve the assignment problem given . We use the greedy strategy when , which means the entity with the minimal matching loss (including padding) for each unassigned proposal in the Hungarian algorithm will be the corresponding prediction target.

With the optimal pairing

, the final bipartite loss function is simply the matching cost:


3.4.2 Predict

In the prediction phase, we utilize the probabilities distribution outputted by the model to produce the final results through Algorithm 2.

3:for  in  do
6:     if   then
8:     end if
9:end for
Algorithm 2 Prediction

We examine each proposal to see if it is more likely to represent the entity than to be the padding. In particular, we assume that the probability that a proposal represents an entity is the product of its corresponding category probability and the localization probability, and the probability that it is the padding is the probability that its category is None. So a proposal is considered to signify an entity when and only when the maximum of its span location probability times the maximum of its category probability overpasses its probability of being None.

4 Experiments

The datasets, baselines, and settings used in our experiments are described in this section. We perform a series of experiments to illustrate the characters of our model. Table 1 displays the statistics of the datasets.

Info. GENIA CoNLL03 WeiboNER
Train Test Dev Train Test Dev Train Test Dev
#S 15203 1669 1854 14041 3250 3453 1350 270 270
#AS 25.4 24.6 26.0 14.5 15.8 13.4 33.6 33.2 33.8
#NE 46142 4367 5506 23499 5942 5648 1834 371 405
#AE 1.94 2.14 2.08 1.43 1.43 1.42 1.24 1.19 1.20
#ME 21 14 25 20 31 20 17 13 12
Table 1: Statistical information on English datasets. #S: the number of sentences. #AS: the average length of sentences. #NE: the number of entities. #AE: the average length of entities. #ME: the maximum number of entities in a sentence.

4.1 Datasets

Experiments were carried out using public datasets including GENIA [10.5555/1289189.1289260], CoNLL03 [tjong-kim-sang-de-meulder-2003-introduction], and WeiboNER [peng-dredze-2015-named].

GENIA is a biological dataset including five entity types: DNA, RNA, protein, cell lineage, and cell type. It contains a high portion of the nested entities. The same experimental setup as [yu-etal-2020-named] was used. We did not conduct experiments on additional authoritative nested datasets in English because we lacked access to the datasets’ copyright.

CoNLL03 dataset is a huge dataset frequently utilized. The main content in the dataset is the news report from the RCV1 corpus of Reuters. Locations, organizations, people, and information are among the named features. The setting [yan-etal-2021-unified-generative] combining the training and validation sets is followed. We demonstrate the generalizability of our approach on flat datasets by the experiments on this dataset.

WeiboNER dataset is a Chinese dataset from the Weibo social platform. It includes named and nominal entity types of individuals, organizations, locations, and geopolitics. The setup of our experiments on this dataset is identical to [li-etal-2020-flat]. We use this Chinese dataset to evaluate the model’s cross-language ability.

4.2 Training Details

We use the pre-trained BERT model developed by the open framework Transformers [ThomasWolf2019HuggingFacesTS] in our experiments. In order to acquire a better contextual representation in biological domain text, we substituted BERT with BioBERT [10.1093/bioinformatics/btz682] for the GENIA dataset and used the BIO-word2vec [chiu-etal-2016-train] to decrease the quantity of out-of-vocabulary words. We use Glove [pennington-etal-2014-glove] as pre-trained word vectors for the English dataset. We use word vectors developed by [li-etal-2018-analogical] for the Chinese dataset. The models are trained by the AdamW optimizer with the learning rate of . The max gradient normalization is in all experiments. The learning rate is altered along with the training procedure under the cousin warm-up-decay learning rate schedule.

4.3 Comparison

We compare our model to many powerful state-of-the-art models for the GENIA dataset, including LocateAndLabel [shen-etal-2021-locate], BARTNER [yan-etal-2021-unified-generative], SequenceToSet [ijcai2021-542], NER-DP [yu-etal-2020-named], LUKE [yamada-etal-2020-luke], MRC [li-etal-2020-unified], BioBART []. The above baselines’ reported results are taken straight from the original published literature.

Prec. Rec. F1
LocateAndLabel[shen-etal-2021-locate] 80.19 80.89 80.54
Pyramid[wang-etal-2020-pyramid] 79.45 78.94 79.19
BARTNER[yan-etal-2021-unified-generative] 78.89 79.60 79.23
SequenceToSet[ijcai2021-542] 82.31 78.66 80.44
NER-DP[yu-etal-2020-named] 81.80 79.30 80.50
BioBART[] - - 79.93
LogSumExpDecoder[wang-etal-2021-nested] 79.20 78.67 78.93
Our Model 81.74 79.76 80.74
Table 2: The performance on GENIA dataset.

Table 2

shows the advanced outcomes provided by our model. Our model increases the F1 score compared with the top algorithm in the GENIA dataset. The GENIA dataset contains numerous nested entities, where about 20% entities are nested. The improvement of our model on the GENIA dataset illustrates our model’s ability to extract nested entities. Compared to the previous stare-of-art model LocateAndLabel, we achieve the performance progress while avoiding the enumeration of candidate spans through the employment of the set prediction network, thus reducing the computational complexity. The set prediction can be seen as the soft enumeration with the flexibility lacked by the manual preparation of the candidate spans. Compared with another work SequenceToSet, which also achieved excellent results, we deploy the encoder-only architecture to implement the performance advancement while bypassing the difficulty of training query vectors. The encoder-only architecture accelerates the coverage of the training. It is shown in the following experiments that the model produces comparable results even only trained under a small number of epochs.

For the CoNLL03 dataset, the LocateAndLabel [shen-etal-2021-locate], PIQN [shen-etal-2022-piqn], NER-DP [yu-etal-2020-named], LUKE [yamada-etal-2020-luke], MRC [li-etal-2020-unified], and KNN-NER [2203.17103] are compared with our model. The results are taken directly from the original published literature. Note that [yan-etal-2021-unified-generative] provides the results of LUKE, NER-DP, and MRC.

Model CoNLL03
Prec. Rec. F1
LocateAndLabel[shen-etal-2021-locate] 92.13 93.79 92.94
PIQN[shen-etal-2022-piqn] 93.29 92.46 92.87
NER-DP[yu-etal-2020-named] 92.85 92.15 92.50
LUKE[yamada-etal-2020-luke] - - 92.87
MRC[li-etal-2020-unified] 92.47 93.27 92.87
KNN-NER[2203.17103] 92.82 92.99 92.93
Our Model 92.86 93.13 93.00
Table 3: The performance on CoNLL03 dataset.

Table 3 shows the outcomes. Since CoNLL03 is a well-established dataset widely investigated, the improvement on this dataset demonstrates that our model is able to generalize well on the traditional flat dataset. Although our model introduces a certain amount of invalid query vectors, the satisfactory performance on this dataset shows that the additional entity proposals do not affect the model’s ability to recognize flat entities in general. Compared with some recent works PIQN and KNN-NER, our model still achieves some improvement in F1.

As for the WeiboNER dataset, we compare our model to various strong baselines, including TFM [liu2022tfm], LocateAndLabel [shen-etal-2021-locate], KNN-NER [2203.17103], SLK-NER[DouHu2020SLKNERES], BoundaryDet [chen-kong-2021-enhancing], ChineseBERT [sun-etal-2021-chinesebert] and AESINER [YuyangNie2020ImprovingNE]. We present outcomes published in the original literature.

Model WeiboNER
Prec. Rec. F1
TFM[liu2022tfm] 71.29 67.07 71.12
LocateAndLabel[shen-etal-2021-locate] 70.11 68.12 69.16
KNN-NER[2203.17103] 75.00 69.92 72.03
SLK-NER[DouHu2020SLKNERES] 61.80 66.30 64.00
BoundaryDet [chen-kong-2021-enhancing] - - 70.14
ChineseBERT[sun-etal-2021-chinesebert] 68.75 72.97 71.26
AESINER[YuyangNie2020ImprovingNE] - - 69.78
Our Model 72.93 71.85 72.38
Table 4: The performance on WeiboNER dataset.

Table 4 shows that our approach reaches a new state-of-art performance provided on the WeiboNER dataset. It demonstrates that the advantages of our model do not limit only to the English datasets. Even compared with some approaches, like ChineseBERT, which targets the Chinese dataset, our model still produces significant improvement. More importantly, since the WeiboNER dataset is a few-shot dataset, the success of this dataset further demonstrates the easiness of training our model. We claim that this excellent performance stems from the good nature of our model’s emphasis on localization, which is generated by spatially modulated attention. Unlike other datasets, the WeiboNER dataset is sampled from social platforms, where the occurrence of entities and the overall sentence meaning are relatively less closely related. Moreover, the entity length tends to be shorter due to the nature of the Chinese language, as seen in Table 1. Spatially modulated attention in the model introduces into the Gaussian-like distribution to increase the sparsity of the attention graph by focusing more on local information, which caters well to the characteristics of the data set and thus achieves excellent results.

4.4 Detailed Result

As indicated in Table 5, we also looked at the performance of entities of various lengths. Our model produces the best results when the entity length is or on all datasets. And the F1 score gradually decreases as the length of the entity increases.

Lenz. GENIA CoNLL03 WeiboNER
Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1
86.59 80.46 83.41 93.55 92.68 93.11 80.52 74.61 77.46
83.01 80.19 81.58 94.04 94.32 94.18 57.31 61.84 59.49
77.57 78.07 77.82 90.04 91.22 90.63 - - -
75.49 78.40 76.92 92.10 94.59 93.33 - - -
72.99 78.99 75.87 85.18 92.00 88.46 - - -
All 81.74 79.76 80.74 92.86 93.13 93.00 72.93 71.85 72.38
Table 5: Result on entities of different lengths.

It is under expectation because short entities can be detected immediately by the proposer, but long entities need the regressor to perform many regressions to identify. The sensitivity of the model’s performance to the entity length is a side consequence of our suggested method’s efficacy. The effect of entity length on recognition performance is also directly influenced by the different kernel sizes of the proposer. From Formula 8 and 9, it can be seen that kernel sizes directly affect the expected length of the model output entities. The expected length in experiments mostly does not exceed , and thus the results in the table are obtained quite reasonably.

4.5 Analysis and Discussion

4.5.1 Ablation Study

On the GENIA dataset, we examined how different modules in the model contributed. Firstly, we deleted the backward module from the feature pyramid to see how it affected communication across features of different layers. Secondly, we tested the importance of spatially prior information by removing spatial modulation from the attention mechanism. Thirdly, we remove the gated update to explore the importance of the gate mechanism in enhancing model expressive ability. Moreover, we remove the category embedding in order to investigate its significance in clarifying the role of each query vector. Finally, we investigated the impact of logarithm and location iterations on the model’s efficacy. The results are shown in Table 6.

Model GENIA CoNLL03 WeiboNER
Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1
Origin 81.50 79.07 80.27 92.48 93.00 92.74 72.93 71.85 72.38
-Backward block 80.58 79.65 80.11 91.72 92.66 92.19 71.42 70.37 70.89
-Spatial modulation 81.46 78.71 80.06 92.00 92.95 92.47 71.50 69.38 70.42
-Gated Update 80.85 78.78 79.80 92.36 93.11 92.73 71.18 71.35 71.27
-Category embedding 81.39 78.58 79.96 92.20 92.79 92.49 69.64 67.40 68.50
-Locations iteration 81.62 78.67 80.12 92.02 93.18 92.60 71.92 70.86 71.39
-Logarithms iteration 81.07 79.11 80.08 92.05 92.66 92.35 73.76 70.12 71.89
Table 6: Ablation study. ’-’ means remove the module or substitute it with another module.

The reduction in performance after removing the backward block demonstrates its use in integrating the features of different scales. In addition, since we use the bottom-level features of the feature pyramid as the query vectors used in the subsequent modules, removing the backward module reduces the depth of the network to some extent and weakens the ability of the model to express semantics.

The decrease in F1 scores after taking out spatial modulation emphasizes the need for guidance of spatial prior knowledge. Spatial modulation contributes to the sparsity of the attention map and facilitates the learning of proper attention patterns. It can be observed that the performance drop after removing spatial modulation on the WeiboNER dataset is far more significant than that coming on the other datasets, which is closely related to the nature of the dataset that focus more on local messages.

The reduction of the F1 score after removing the gate mechanism proves it helps improve the model’s expressive ability. It actually plays the same role as the feed-forward layer in the Transformer, i.e., it increases the capacity and nonlinearity of the model.

The performance deterioration after removing category embedding demonstrates the importance of explaining the role of each query vector. Because of the encoder-only architecture used in the proposed approach, the query vector also functions as a semantic vector. This architecture has the potential to introduce confusion about the roles of the query vectors, which is well mitigated by the category embedding.

The F1 value reduces when logarithm and location iterations are removed. It emphasizes the significance of iteration in the regression network. We claim it plays a similar role to auxiliary loss [10.1007/978-3-030-58452-8_13] and box refinement [ZacharyTeed2022RAFTRA]

. In the auxiliary loss, the output results of each layer are decoded and the loss is calculated, thus enabling supervised learning for each layer. In the box refinement, the detection box is gradually adjusted until the eventual prediction is made, which is similar to the iterative process in this paper. Both of them apply the direct updates gradient to each layer, while the latter does the prediction based on the combination of outputs from all layers. These strategies can reduce the difficulty of model training on the one hand. On the other hand, it improves the performance of the model.

4.5.2 Impact of Kernel Sizes

We explore the effect of kernel sizes in the pyramid and report the results in Table 7. We tested different layers of the pyramid and various combinations of kernel sizes. All models are trained with epochs, with the number of heads and layers equal to and .

Kernel Sizes GENIA CoNLL03 WeiboNER
Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1
[] 81.35 78.60 79.95 92.21 92.95 92.58 73.35 71.35 72.34
[2] 80.88 78.44 79.69 92.34 92.88 92.61 72.44 70.12 71.26
[2,2] 81.50 79.07 80.27 92.48 93.00 92.74 72.58 68.64 70.55
[2,2,2] 81.23 78.55 79.87 91.49 92.77 92.13 72.93 71.85 72.38
[2,2,2,2] 81.24 78.67 79.94 92.24 92.63 92.43 72.19 69.87 71.01
[2,3] 80.09 79.09 80.01 92.00 92.74 92.37 72.68 71.60 72.13
[2,3,2] 81.19 78.35 79.74 91.75 92.45 92.10 71.89 71.35 71.62
Table 7: Impact of kernel sizes. represents a three-layer feature pyramid with two convolution kernels of size and . Other symbols indicate structures in the same way.

The model reaches satisfactory performance when kernel sizes equals or . We believe that this is because under the settings, the entities of different lengths are almost covered and the expected length of the outputted entity from the model is close to the average length of the entity in the datasets.

In addition, we observe that the model outperforms its kernel size of even when the kernel size is . We believe that the cause of this phenomenon is twofold. On the one hand, different kernel sizes lead to different expected lengths of proposals, and the model will achieve better performance when the expected length of proposals is close to the average entity length. When the kernel size is, its expected length is , which is closer to the average length of entities in the dataset than when the kernel size is . On the other hand, with a kernel size of , the feature pyramid has more layers, and the network is deep than when the kernel size is , which can lead to a relatively more difficult training of the model and slower convergence.

4.5.3 Influence of Regressor Layers

To investigate the influence of the layer, we experimented with models of different layers on the GENIA dataset. All the models were trained with epochs. The sizes of kernels and number of heads equal and relatively. Figure 5 shows the results on different entities length for the models of the different number of layers. We can see our model reaches a comparable result when . To balance the time cost and model performance, we set for other experiments.

Figure 5: Result on models of different numbers of layers. The highest value of each curve is marked with a red dot.

It can be directly observed that the recognition efficiency of short entities is less affected by the number of regression layers. On the other hand, the recognition efficiency of entities is more sensitive to the number of regression layers. It is because the proposer can make almost accurate proposals directly for shorter entities, while for long entities, the refinement by the regressor is necessary to identify them correctly.

4.5.4 Analysis of Attention Heads

We also evaluate the model on the GENIA dataset to see how attention heads affect the results. All of the models were trained using epochs with sizes of kernels and the number of layers equal and relatively. The association between the number of heads and model performance is demonstrated in Figure 6. Our model produces the best results when the number of heads equals . So we set it equal to for other experiments.

Figure 6: Result on models of different numbers of heads. The highest value of each curve is marked with a red dot.

Similarly, the longer the length of the model, the more the performance of recognizing the entity is affected by the number of attention heads. It is because the number of attention heads directly affects the iterative process of the regressor, which is indispensable for long entities.

4.5.5 Case Study

We do the case study in Table 8 to show the ability of our model to identify entities in various cases. The errors demonstrated in the table are also analyzed.

As can be seen in the first example, our model is capable of recognizing multiple entities presented in long sentences. As shown in the table, the proposer is able to produce high-quality proposals. And it can be observed from the iterative process of entity suggestions that the regressor regression plays a crucial role in pinpointing entities even if they are relatively short.

Several significant flaws are demonstrated in the second and third cases. The first problem is that our model has difficulty classifying the entities. While the model correctly locates the entities in all of its predictions, it makes errors in predicting the type of entities. We claim that this issue arises as a result of the model’s overemphasis on local information. However, the determination of the entity type usually depends on the global message.

The second problem is that there are some circumstances where the model fails to identify highly overlapped nested entities of relatively short length. It is understandable because there are considerably more negative data in the training phase than positive ones in this scenario. Only one proposal represents the genuine entity in most instances where proposals have substantial overlap, while the remaining proposals do not correspond to any entity. It is partly due to the assignment methodology of the entity proposals and the encoder-only architecture.

Sentences with Entities Predictions
Treatment of T cells with the selective PKC inhibitor GF109203X abrogates the PMA-induced IkB alpha phosphorylation/degradation irrespective of activation of Ca(2+)-dependent pathways, but not the phosphorylation and degradation of IkB alpha induced by TNF-alpha, a PKC -independent stimulus.
Costimulation with anti- CD28 MoAb greatly enhanced the proliferative response of neonatal T cells to levels equivalent to those of adult T cells , whereas adult T cells showed only slight increases.
Point mutations of either the PU.1 site or the C/EBP site that abolish the binding of the respective factors result in a significant decrease of GM-CSF receptor alpha promoter activity in myelomonocytic cells only.
Table 8: Case study. In the left column, the category of the entity is indicated by the label at the bottom right of the right square bracket and the location of the left and right boundary words is shown by the superscript of the square bracket. The iterative process of the entity proposals, as well as the relationship between them and the predicted entities, are shown in the right column.

5 Conclusion and future work

An end-to-end entity detection approach with proposer and regressor is presented in this study. We employ a proposer that incorporates multi-scale information through the feature pyramid to predict high-quality entity proposals. It significantly speeds up the training process and boosts performance. The encoder-only framework is proposed in our work, which introduces significant spatially prior knowledge into the attention mechanism. It avoids the unfavorable impact of random initialization of query vectors. In order to increase prediction accuracy, we explore iterations over span locations and category logarithms in the joint model. We model the entity proposal and prediction target assignment issue as a Linear Assignment Problem and compute the bipartite loss during the training phase. The model is able to predict the entities in a single run. The experiments on datasets of different characteristics demonstrate the nature of our model and the effectiveness of our approach. Our work reveals the potential to integrate spatial priors into NLP research. We expect the findings will contribute to a better understanding of the set prediction network and iterative refinement. We will go deep into the relevance of NLP and CV tasks in the future.


This work was supported by the National Natural Science Foundation of China under Grant 62072211, Grant 51939003, and Grant U20A20285.

Code Availability


Conflicts of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.