Entity Typing (ET) is the process of identifying the semantic types of every entity within a corpus. In contrast to Named Entity Recognition, where each token in a sentence is labelled with zero or one class label, ET involves labelling each entity mention with one or more class labels, which are typically arranged in a hierarchy 
. These fine-grained class labels encapsulate more semantic information than singular labels, and allow for entities to be labelled with mutually exclusive types. The results obtained by ET are therefore highly valuable for many downstream natural language processing tasks such as information extraction
, knowledge graph construction, and text mining .
Despite the widespread success of entity recognition, research into effective entity typing is still ongoing. End-to-end entity typing, whereby every token is labelled with zero or more type(s), is considerably more challenging than entity recognition as it is a multi-class, multi-label task . This difficulty has resulted in the majority of state-of-the-art entity typing systems assuming the segmentation step and operating only at the mention-level. In other words, each entity mention in the dataset has already been identified and is labelled with one or more semantic classes.
There are a number of limitations that arise from performing entity typing at the mention level, as opposed to end-to-end, however. Firstly, the state-of-the-art technique for mention-level entity typing is to train on fixed windows of tokens centered around the entity mention using a three-part Bi-LSTM 
, in an attempt to prevent the model from training on irrelevant information. This not only results in the model being highly sensitive to the size of the context window, but also means that it will never be able to incorporate the contextual information outside the window when it is actually important. An end-to-end model employing a bidirectional gated recurrent unit (GRU) would be capable of learning to harness this context effectively via the forget gate. Secondly, to perform entity typing from scratch with a mention-level system one must also train a segmentation model and combine both systems, which is slower and a more complicated pipeline than training one end-to-end system. We show that an end-to-end model is capable of outperforming a mention-level model given the right architecture, whilst also simplifying the task.
Existing models are also hindered by their input representations. The input representation of each word in state-of-the-art mention-level typing models is generated using context-independent embedding models, such as GloVe . The effectiveness of state-of-the-art context-dependent embedding models, such as BERT , in entity typing has not yet been investigated despite its proven success in many other NLP tasks such as Named Entity Recognition. In addition to explicitly learning a context representation using an end-to-end model, we also investigate the effectiveness of contextualised word embeddings on mention-level entity typing.
We therefore carry out an extensive ablative study to demonstrate effectiveness of contextualised embeddings for mention-level entity typing, and show the competitiveness of end-to-end entity typing despite it being a more challenging learning task. We accomplish this by introducing two models: a mention-level model which embeds the left, right, and mention contexts using BERT, and an end-to-end entity typing (E2EET) model that determines the type(s) of all tokens in a sentence.
In this paper we describe our two models in detail in Section 3. In Section 5 we evaluate our mention-level model and show that it outperforms state-of-the-art mention-level entity typing models. We also show that E2EET is effective on clean datasets and is capable of outperforming mention-level entity typing models despite not knowing which tokens in each sentence are entities apriori.
2 Related Work
Initial entity typing research treated the task as a multi-class, multi-label classification problem, i.e. identify the type(s) of every entity within a document. The Fine-Grained Entity Recognition (FIGER) 
model follows a pipeline-based approach: it first identifies the entity mentions via a segmentation step, and then predicts a list of class labels (types) associated with each mention. The segmentation is performed using a conditional random field (CRF) trained on a variety of handcrafted features such as token length and contextual bi-grams. Label prediction is performed using a multi-layer perceptron. FIGER’s reliance on handcrafted features to perform segmentation makes it unfeasible for domains and applications where these features are not readily available.
More recent research focuses on mention-level entity typing. In contrast to the two-staged pipeline of FIGER, which performs entity segmentation (i.e. entity recognition) followed by label prediction, mention-level models are trained on already-segmented data  and aim to predict the type(s) of each entity mention given its context.
State-of-the-art entity typing models typically employ a three-part Bi-directional Long Short-Term Memory Model (Bi-LSTM). The left, right, and mention contexts are each fed through a Bi-LSTM layer to obtain an encoded representation, which is then decoded and fed through a linear layer to obtain a set of labels for the corresponding mention. This architecture was introduced as part of the Hybrid Neural Model (HNM)
, which comprises two components: a recurrent-based mention model that obtains a vector representation of the entity mention given its context, and a context model that generates a single vector for both the left and right contexts of the mention. The output from the mention model and context model is concatenated and fed through a softmax layer to obtain a probability distribution over all possible types. The type with the highest probability is selected as the prediction.
and passed through the Bi-LSTM layers, the output is concatenated and two constraints form the basis of an integer linear programming model which is applied to the output of the final dense layer. Thetype disjointness constraint ensures that an entity is not labelled as two mutually exclusive types. The mutual exclusivity of each type are determined via an external knowledge base. The type hierarchy constraint ensures that an entity is not labelled as a certain type if it is not also labelled as that type’s parent category.
In contrast to other mention-level entity typing systems, Automatic Fine-Grained Entity Typing (AFET) 
does not employ a Bi-LSTM. It instead introduces a novel, heuristic-based method to separate clean and noisy mentions, whilst also introducing hierarchy-based partial label embeddings to improve performance. AFET takes advantage of the noisy labels in the dataset; mentions are separated into clean and noisy sets depending on whether their ground truth labels form a single path in the category hierarchy, i.e. are not mutually exclusive. The loss function of the model differs for each training example depending on whether it is from the clean or noisy set. AFET notably relies on handcrafted features (such as POS tag and Brown Cluster), unlike other systems, which limits its functionality on datasets that have not been labelled by hand.
In summary, FIGER is a pipeline-based entity typing system that is heavily reliant on feature collection. HNM, METIC and AFET are all deep learning-based mention-level entity typing systems. HNM and METIC use a three-part Bi-LSTM for representation learning, while AFET relies on handcrafted features.
3 Entity Typing Models
This paper introduces two Entity Typing models: a mention-level model, which determines the type(s) of an entity given an entity mention and its surrounding context, and an end-to-end model (E2EET), which determines the type(s) (if any) of all tokens in a sentence. We begin this section by providing an overview of the embedding layer that is common between both models, and then explain each model in detail.
The embedding layer plays a crucial role in our models, allowing for a context-dependent, deep representation of input tokens. To facilitate such a representation, we use BERT . BERT is based upon the encoder stack of the bidirectional transformer model , an encoder-decoder model structure that combines feed-forward layers with a multi-headed attention mechanism. In contrast to other state-of-the-art context-dependent embedding models, such as ELMo , BERT learns deep bidirectional representations by performing a procedure known as the “masked language model”. For every input sentence to the model some number of terms are masked at random, and the model learns to predict the original terms.
BERT embeddings allow our models to incorporate valuable contextual information in the embedding layer, as opposed to being learned via a recurrent layer as is common in existing entity typing systems. The embeddings generated by BERT and fed into our model are context-dependent, meaning the embeddings of each word are generated with respect to its surrounding context. Polysemous words are embedded according to their canonical meaning, providing a richer representation than the context-independent embedding models that are currently used in many state-of-the-art entity typing models.
An important distinction between BERT and other embedding models is that it is trained at the “wordpiece” level (also known as Byte Pair Encoding) , as opposed to the word level. A single unknown token must first be tokenized into pieces prior to being fed through the BERT model. For example, “Johanson” becomes “Johan” “##son”.
3.1 Mention-level model
The mention-level model, as shown in Figure 1, predicts the label(s) of a given entity mention and its surrounding context. It accomplishes this using a three-part context model, consisting of the left, right, and mention context, inspired by the state-of-the-art mention-level systems discussed in Section 2
. However, our model does not employ a recurrent neural network, and instead uses two feed forward layers: one to learn an encoded representation of the combined left, right, and mention contexts, and another to map this encoded representation back to the label space.
3.1.1 Context vectors
We first build the left and right context windows by taking wordpieces to the left and right of the mention, respectively, where is a fixed context window size. For the mention context window, we take the first wordpieces of the mention. If any vector is not of length
, it is padded with[PAD] tokens. If its length is greater than , excess wordpieces are trimmed. Each of the left, right, and mention context windows are then encoded via a pre-trained BERT model to obtain three embedding matrices of size , where is the embedding dimension. We then take the average across each of these matrices, yielding three vectors of size . These three vectors are hereby denoted as the left, right, and mention context vectors , , and .
3.1.2 Attention mechanisms
The mention-level model may be augmented with one of two attention mechanisms: scalar and dynamic.
The scalar attention mechanism learns the extent to which each context (left, right, and mention) is important when predicting the labels of each mention. The weights of each context are then multiplied according to their relevance to the task. To do this, a scalar value is learned for each context vector , , and . We normalise the attention weights using the softmax operation so that they sum to 1 (here, represents the contexts ). The weights of each layer are multiplied by the corresponding attention value and are then concatenated to form as shown in Figure 1.
The three attention weights are applied to every mention regardless of the mention itself. This results in low complexity but has the downside of assuming that every mention will benefit from the same attention weights. For example, if the attention value is high for the left context and low for the mention and right contexts, and a mention has no left context (i.e. it is at the start of the sentence), the predictions for this particular mention may be adversely affected by the attention mechanism.
In light of this issue, we propose a dynamic attention mechanism. Rather than learning one weight per context layer, the dynamic variant uses a much simpler feed-forward network to assign weights to each context layer based upon the mention context. The inputs to this layer are the averaged embeddings across each wordpiece in the mention context. The output is a vector of three weights corresponding to the left, right and mention context, which are softmaxed and applied to each layer , and .
The dynamic attention mechanism allows for the predictions of polysemous mentions to be more heavily influenced by their surrounding context, whilst also reducing irrelevant contextual information for non-polysemous mentions. For example, given the mention “Apple” (which could be a company or fruit depending on the context), the normalised weights of the attention layer’s output might be for the left, right, and mention contexts respectively. Given another example, where the entity types may be easily inferred from the mention itself (such as “Barrack Obama”), the normalised weights might be .
3.1.3 Hidden and output layers
After being multiplied by the attention weights, the three context vectors are concatenated to form the combined representation, which contains one weight corresponding to each label , where is the set of all labels.
3.1.4 Loss function
After performing the sigmoid function to normalise the weights to between 0 and 1, the loss of the model is calculated using binary cross entropy :
Here, is the correct label of the class of index , is the prediction score associated with the class of index , and is the set of labels.
3.1.5 Prediction layer
Given a set of prediction weights across each label , the prediction layer outputs when and when .
One minor difference between the mention-level model and the end-to-end model is that in mention-level typing, each entity mention is guaranteed to have at least one label. To address this issue we adjust the prediction layer of our mention-level model to output its highest-weighted label in the event that no prediction weights are .
3.2 End-to-end model (E2EET)
The end-to-end entity typing model (E2EET), as shown in Figure 2, predicts the type(s) of every token in a sentence. In contrast to the mention-level model, it does not require the entities to be segmented and can operate on a dataset composed purely of raw text. Rather than taking an entity mention and its context as input, it takes an entire sentence as input and outputs a set of zero or more labels for each token in the sentence.
E2EET is similar in architecture to our mention-level model, but uses a bidirectional gated recurrent unit (GRU)  instead of a feed-forward network. This allows for both forward and backwards contexts (independent of window size) to be taken into account when predicting the label(s) of each token.
3.2.1 Loss function
The loss of E2EET is calculated using binary cross entropy in the same manner as the mention-level model as per Equation 1. However, rather than averaging across a single set of label predictions, the loss of E2EET is averaged across the predictions of all wordpiece tokens in the current batch.
3.2.2 Concatenation layer
As opposed to many embedding models, BERT operates at the wordpiece level. The outputs of the E2EET must be a set of labels per token, not per wordpiece. Our concatenation layer therefore takes the average predictions of each wordpiece label corresponding to a particular word as shown in Figure 2.
|Name||# Train||# Dev||# Test||# Train||# Dev||# Test|
We evaluate our mention-level model on the benchmark datasets provided by : Wiki, Ontonotes, and BBN111https://github.com/INK-USC/AFET. We use a portion of the training datasets as validation sets (434 for Wiki, and 10% for Ontonotes and BBN). These datasets are summarised in Table 1 in the Original column.
Initial data exploration of the training sets of the Ontonotes and BBN datasets found that the data contained a high proportion of incorrect labels. The testing sets, however, appear to be free from error as a result of being manually annotated . We therefore created our own versions of these datasets, hereby known as the “modified” datasets and prefixed with an “M”. The datasets are summarised in Table 1 in the Modified column,
The training set of the Wiki dataset is relatively clean when compared with the training sets of the original Ontonotes and BBN datasets. However, we found that due to the complexity of our end-to-end model, it was necessary to trim the large Wiki dataset in order for it to fit in memory. We therefore constructed the M-Wiki dataset by taking the first 50,000 documents of the original 1.5 million document training set as the new training set and the following 434 as the new validation set. The test set is the same as in the original dataset. For the M-Ontonotes and M-BBN datasets, the training sets comprise the first 80% of the test data. The validation sets comprise the following 10%, and the test sets comprise the remaining 10%.
4.2 Model parameters
After parameter tuning we found that the best performance on the development set for both of our models were achieved with a learning rate of 0.0001, a hidden dimension size of 768, and 0.5 dropout prior to the final layer. The models were optimised using ADAM. The batch size was 100 for the mention-level model and 10 for the end-to-end model. For the mention-level model we used a context window size of 10 for the left, right, and mention contexts. The end-to-end model was trained with a max sequence length of 100, allowing for 99% of the data to be included without dramatically increasing training time.
4.3 Embedding techniques
In order to evaluate the effectiveness of the BERT embeddings in our models, we evaluate our end-to-end model with four different embedding techniques. Uniform
, the baseline, assigns a uniform distribution of embedding weights for each token. The GloVe embeddings are pretrained on Wikipedia 2014 + Gigaword 5222https://nlp.stanford.edu/projects/glove/. The Word2Vec embeddings are pretrained on the Wikipedia corpus333https://wikipedia2vec.github.io/wikipedia2vec/pretrained/. The embedding dimension of each of these techniques was 300. The BERT embeddings are generated from the pre-trained BERTBASE, Cased model444https://github.com/google-research/bert, which provides embeddings of dimension 768. We used Bert-as-service555https://github.com/hanxiao/bert-as-service to embed the sentences per-batch. We did not fine-tune BERT on our datasets as we found it did not provide a performance improvement.
4.4 Evaluation metrics
We evaluate our model using three standard metrics for entity typing systems in terms of F1 scores: Strict Accuracy, Loose Macro, and Loose Micro score, described in detail by . Strict Accuracy only considers the prediction of a token correct when the set of predicted classes matches the set of ground truth classes exactly. Loose Macro calculates the scores for matching subsets at the entity level, individually for each entity mention, whereas Loose Micro computes the score at the corpus level, and the score is averaged across all entities. Loose Macro tends to be penalised when new unseen categories appear in the test set, and is therefore sensitive in unbalanced datasets. In such cases, Loose Micro is a fairer metric.
4.5 Baseline systems
We compare our mention-level model to the state-of-the-art systems evaluated in :
5.1 Mention-level model performance
Our first set of investigations is to determine how our mention-level model’s performance compares to existing systems. We also investigate the effectiveness of the proposed attention mechanisms.
Table 2 shows the results of our model when compared to state-of-the-art mention-level models, with results for existing systems supplied by . AFET, which relies on handcrafted features, is highlighted in grey.
Our model outperforms METIC, the top-performing system that does not rely on handcrafted features, in every experiment except the micro-F1 metric on the Wiki dataset. We attribute the success to the combination of the context-dependent BERT embedding vectors and the three-part context model. In contrast to existing state-of-the-art systems, our model is able to encapsulate contextual information via the context-dependent embeddings provided by BERT’s transformer-based architecture.
Despite not relying on handcrafted features like AFET does, our model mostly outperforms AFET on the Wiki and BBN datasets. However, it performs substantially worse on the Ontonotes dataset. This is most likely due to the high quality of the handcrafted features present in Ontonotes which help to boost AFET’s performance.
The attention mechanisms (scalar and dynamic) generally had very little impact on performance. Additionally, despite its complexity, the dynamic attention mechanism was often outperformed by the scalar variant. It was found during training that the dynamic attention mechanism performs best on the validation sets (which comprise 10% of the training data), but there was a significant difference (0.2) in F1 scores between the validation and test sets. The most likely cause of this phenomenon is that the training and testing sets are too distinct from one another, leading to rapid overfitting in more complex models such as attention-based models. This is supported by the fact that the training sets of BBN and Ontonotes were automatically labelled, whereas the testing sets were manually labelled .
The results clearly show that our mention-level entity typing model outperforms all current state-of-the-art techniques that do not rely on handcrafted features. It is also highly competitive with AFET, a top-performing system that uses handcrafted features.
5.2 End-to-end entity typing (E2EET) performance
5.2.1 Baseline performance
To evaluate E2EET we test the model after its embedding layer has been initialised with four different embedding techniques. Table 3 shows the results of E2EET on the modified Wiki, Ontonotes and BBN datasets. Here, the strict accuracy, macro, and micro F1 scores are calculated with respect to the model’s predictions of every single token in the corpus.
The BERT embeddings significantly outperform the other embedding techniques on every dataset. It is clear that the context-dependent embeddings provided by the BERT model are highly effective when used to support an entity typing model.
The results indicate that the accuracy and macro-F1 metrics used for mention-level models do not accurately reflect the performance of an end-to-end model. The accuracy is extremely high because the vast majority of tokens are not entities, and the model successfully predicts no labels for these tokens. Macro-F1 is similarly misleading, being consistently low due to the scores being divided by the total number of tokens. The only useful metric for evaluating E2EET appears to be micro F1, which disregards non-entities by dividing by the sum of ground truth labels for each token.
Disregarding the misleading accuracy and macro-F1 scores, E2EET performs well on the M-Ontonotes and M-BBN datasets. It did not fare well on the M-Wiki dataset, however, as it is considerably noisier. Overall, the results indicate that the model is capable of performing end-to-end entity typing, but future research regarding end-to-end models should investigate and devise more suitable evaluation metrics.
5.2.2 Comparison with mention-level model
Table 4 shows a comparison between our mention-level and E2EET when trained and evaluated upon the modified datasets. Here, E2EET is using BERT embeddings and is evaluated using the same F1 calculations as the mention-level model, i.e. the scores are only calculated across entities as opposed to across all tokens. This means that any non-entities that are incorrectly labelled as entities by E2EET are ignored.
E2EET is competitive with the mention-level model on the relatively clean M-Ontonotes and M-BBN datasets, even outperforming the mention-level model on M-BBN. This is surprising considering E2EET has a vastly more difficult training objective, and does not know which tokens are entity mentions. The most likely explanation for this result is that the context of the entire sequence plays a pivotal role in the model’s success, particularly in M-BBN. It allows E2EET to more effectively classify each token than the mention-level model which was trained on smaller context windows.
Another noticeable result is that, for the mention-level typing models, there is a clear relationship between the complexity of the attention model and the overall performance. This is in contrast to the results on the original datasets (Table2), where the attention mechanism had little impact on performance. The dynamic attention model in particular excels on the modified datasets, which are significantly cleaner than the original datasets as a result of being taken from the hand-labelled test set of their respective original dataset. The dynamic attention mechanism is clearly most effective when the dataset is clean and when there is consistency between the training and testing sets.
Overall the results of the comparison shows that E2EET is highly competitive with the mention-level model as a result of its ability to incorporate document-level context. It performs well on clean datasets where there is little disparity between the training and evaluation sets, and provides a strong foundation for future research into entity typing.
In this paper we have carried out an extensive ablative study demonstrating the effectiveness of contextualised embeddings for mention-level entity typing and have shown the competitiveness of our proposed end-to-end system for entity typing. Our mention-level model embeds the left, right, and mention contexts using BERT and employs two novel attention mechanisms in order to predict the labels associated with each entity mention. Our end-to-end model (E2EET), on the other hand, effectively determines the type(s) of all tokens in a sentence. Results show that our mention-level model outperforms state-of-the-art mention-level entity typing models. Our end-to-end model performs well on clean datasets and is capable of outperforming the mention-level model despite not knowing which tokens in each sentence are entities.
In future we plan to run ten-fold cross validation to ensure statistical significance in the experiments. It would be interesting to investigate the effectiveness of other recent context-dependent embeddings such as XLNet , transformers to replace Bi-GRU for context representation learning, and to incorporate hierarchical encoding techniques to improve prediction accuracy.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §1, §3.2.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3.
A hybrid neural model for type classification of entity mentions.
Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §1, §2, 2nd item.
-  (2012) Fine-grained entity recognition.. In AAAI, Vol. 12, pp. 94–100. Cited by: §1, §2, §4.4.
-  (2018) Hierarchical losses and new resources for fine-grained entity typing and linking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 97–109. Cited by: §2.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1, §2, §4.3.
-  (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §3.
-  (2016) Afet: automatic fine-grained entity typing by hierarchical partial-label embedding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1369–1378. Cited by: §2, 1st item, §4.1.
-  (2008) Information extraction. Foundations and Trends® in Databases 1 (3), pp. 261–377. Cited by: §1.
-  (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §3.
-  (2016) Neural architectures for fine-grained entity type classification. arXiv preprint arXiv:1606.01341. Cited by: §4.1, §5.1.
-  (2019) ICDM 2019 knowledge graph contest: team uwa. arXiv preprint arXiv:1909.01807. Cited by: §1.
-  (2017) An interactive web-based toolset for knowledge discovery from short text log data. In International Conference on Advanced Data Mining and Applications, pp. 853–858. External Links: Cited by: §1.
-  (2009) Mining multi-label data. In Data mining and knowledge discovery handbook, pp. 667–685. Cited by: §1.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.
-  (2018) METIC: multi-instance entity typing from corpus. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 903–912. Cited by: §2, 4th item, §4.5, §5.1.
-  (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §6.