Named Entity Sequence Classification

12/06/2017 ∙ by Mahdi Namazifar, et al. ∙ 0

Named Entity Recognition (NER) aims at locating and classifying named entities in text. In some use cases of NER, including cases where detected named entities are used in creating content recommendations, it is crucial to have a reliable confidence level for the detected named entities. In this work we study the problem of finding confidence levels for detected named entities. We refer to this problem as Named Entity Sequence Classification (NESC). We frame NESC as a binary classification problem and we use NER as well as recurrent neural networks to find the probability of candidate named entity is a real named entity. We apply this approach to Tweet texts and we show how we could find named entities with high confidence levels from Tweets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In Named Entity Recognition (NER) the goal is to locate and classify named entities (defined as physical or abstract objects that can be expressed with proper nouns) in a given text. NER as an area of Natural Language Processing (NLP) has been studied extensively. A survey of studies of NER based on classical NLP approaches can be found in

Nadeau and Sekine (2007)

. NER approaches based on deep learning have also been frequently reported in the NLP literature

dos Santos and Guimarães (2015); Socher et al. (2012). These approaches generally rely on pre-trained word embeddings as well as sequence modeling techniques based on Recurrent Neural Networks (RNNs) or Temporal Convolutions.

NER for Tweets has also been studied in numerous studies Ritter et al. (2011); Liu et al. (2011, 2013); Limsopatham and Collier (2016). Due to the limit on the number of characters of a Tweet, heavy use of slangs and emojis, lack of proper capitalization, as well as the informal style of writing, detecting named entities in Tweet text is significantly more challenging than other types of text (News, books, web page, etc.).

Our focus in this work is on detecting named entities in Tweets. The architecture that we use for our NER model is somewhat similar to the architecture proposed in Huang et al. (2015) in that we use bidirectional LSTMs and Conditional Random Fields (CRF) to tag Tweets tokens with named entity labels. We show that our trained NER model performs quite well relative to the standard open source NER solution by the Stanford NLP group Manning et al. (2014).

One of the applications of NER on Tweets for Twitter is recommending content to users. Some example content recommendation use cases where NER is valuable include notifying the user that their network is Tweeting about a specific named entity, or grouping Tweets about a certain named entity to show the users. It is important to note that for recommending content it is crucial to have very high confidence in the detected named entities to be actually true named entities. Unfortunately it is not straightforward to measure confidence on a sequence of tokens that is detected by our NER model (which is a sequence tagging based approach) being a named entity. In fact none of the studies that we found on sequence tagging would address this requirement, and that is the motivation behind the problem that we call Named Entity Sequence Classification (NESC).

We define NESC as follows: given a text and a sequence of tokens in that text, determine if the sequence of tokens is a named entity. The idea is that NESC as a binary classification problem would provide the probability of the sequence of tokens (named entity candidate) being in fact a named entity. For the NESC model we use the output of the trained bidirectional LSTM of the NER model for a context window around the candidate named entity (sequence of tokens that we want to classify as a named entity or not) and we train another LSTM for the binary classification problem of NESC.

It should be mentioned that this binary classification of sub-sequences of a sequence is not unique to named entity recognition and could be applied to any sequence tagging problem where consecutive elements of the sequence could constitute a tag of interest.

In the rest of the this paper, we first present our NER model for Tweets. Next we discuss the NESC problem definition, our proposed model architecture for NESC, and also details on how to train the proposed NESC model. Finally we show some experimental results on performance of our NER and NESC models.

2 NER for Tweets

By definition, NER not only tries to locate named entities in text, but also tries to classify the detected named entities to a pre-determined set of entity types. Different NER models have different sets of possible entity types. In this work for the set of possible entity types that are covered by our NER model, we adhered to the following types: Person, Place, Product, Organization, and Other; and we use the Inside–Outside–Beginning (IOB) format IOB (2013) for entity boundaries.

In order to build an NER model for Tweets we follow the general model architecture proposed by many authors Tran et al. (2017); Chiu and Nichols (2015); Limsopatham and Collier (2016)

that includes pre-trained word embeddings for Tweet tokens followed by a recurrent neural network (a bidirectional LSTM in our case) and a fully dense layer followed by a softmax layer that produce a discrete probability distribution for the possible labels of each Tweet token.

Tweet texts are first tokenized using a Twitter internal text processing tool which is a heuristic-based text tokenizer for tens of languages. Each token then is vectorized using our pre-trained word embeddings. Our word embeddings are 200-dimensional vectors that are trained using GloVe

Pennington et al. (2014) on over 1 billion Tweets. Other than dense word embeddings, the vector representing each token also includes some sparse variables in the form of 2 one-hot vectors. The first one-hot vector indicates whether the token is one of the special characters (%, /, ., !, ?, …), a hashtag, an @handle, whether it’s first character is capitalized, or the entire token is capitalized 111If the entire token is capitalized, the value of the indicator for first character being capitalized is 0.. This one-hot vector has 36 dimensions. The other one-hot vector specifies the Part of Speech (POS) tag associated with each of the tokens. These POS tags are also provided by Twitter’s internal text processing tools, and each token can get one of the 17 possible POS tags; and as a result this one-hot vector has 17 dimensions. For the non-zero value of the one-hot vectors we use 0.1 so that this number is in the same range as the values in the dense word embedding vectors. Figure 1 depicts the parts of a vectorized token.

Figure 1: Vectorized token

Vectorized Tweet tokens next go through our NER model that has the architecture that is shown in Figure 2. As one could see from the figure vectorized tokens go through a bidirectional LSTM with dropout and the output goes through a fully connected layer and then a softmax layer to create 11 dimensional probability distributions for each token. These 11 dimensions correspond to beginning and inside of the 5 entity type tags plus the one not-an-entity tag. Next these probabilities go through a Conditional Random Field (CRF) which learns the correct order of entity labels and the result is the final NER labels for each of the tokens. For instance it learns that not-an-entity labels most likely is not followed by a label for inside an entity label (e.g., I-person).

Figure 2: NER model architecture

3 Nesc

3.1 Problem Definition

The output of the NER model discussed in the previous section for each token provides a label along with a probability that is given by the softmax layer to that label. These scores however cannot directly be used, for example in a multiplicative way, to get a confidence for multi-token entities. As an example if we run sentence “I love San Francisco” through the NER model the output of the softmax layer (probability of each label for each token) are shown in Table 1.

I love San Francisco
O-not-an-entity 0.985 0.988 0.012 0.026
B-Person 0.001 0.001 0.023 0.043
I-Person 0.001 0.001 0.001 0.087
B-Place 0.002 0.001 0.483 0.005
I-Place 0.001 0.001 0.006 0.659
B-Product 0.003 0.001 0.276 0.021
I-Product 0.001 0.001 0.159 0.032
B-Organization 0.001 0.002 0.009 0.123
I-Organization 0.001 0.001 0.011 0.002
B-Other 0.001 0.001 0.012 0.001
I-Other 0.003 0.002 0.008 0.001
Table 1: Example of NER labels and their associated probabilities

From the labels in Table 1 we can see that the model predicts that ”San Francisco” is a place, but it is not straightforward to calculate the likelihood of “San Francisco” being an entity from the softmax probabilities provided for the token labels. The reason for that is that these probabilities are calculated using the entire context of the text and all of the probabilities of tokens and their labels are correlated.

The NESC problem aims at calculating likelihoods for named entities that are proposed by the output of the NER model. We define the NESC problem as follows:


Given a text and a sub-sequence of tokens in , what is the probability of the being a named entity in .

For instance in our previous example “I love San Francisco” what is the probability of the sequence containing the two tokens “San” and “Francisco” being a named entity? In this example

and is the output of the NER model.

3.2 NESC Model Architecture

By definition, NESC is a binary classification problem, and in the rest of this section we discuss how we build a model for this problem. First we define a context window around with context size . For instance if the index of the starting token of is and the index of the ending token of is , then the context window of would be tokens to . Let’s call this context window

. Needless to say we first pad the input text

on both sides with pad size to make sure that we can define for any sub-sequence of tokens in . Figure 3 shows an example of context window around a candidate named entity of size 2 tokens and context size () of 2.

Figure 3: NESC context window

Now NESC as a binary classification problem boils down to given a context window with context size , whether the middle part of the window (considering the context size ) is a named entity or not. To approach this problem we need to vectorize the context window defined by the sequence of the candidate tokens and the other tokens in the context window. To do this we use the output of pre-trained NER model’s bidirectional LSTM and we build a sequence of the bidirectional LSTM layer output vectors that are associated with the tokens of the context window of . In other words the vectorization of context window is simply the slice of the output of the pre-trained NER model’s bidirectional LSTM layer output that coincides with the context window.

Remember that for each token, the NER model’s bidirectional LSTM vector associated with it has captured information from all other tokens that come before and after it, and therefore the vectorization of context window contains information from the entire text and not only the context window .

Now that we have the context window represented as a sequence of vectors, the NESC problem can be viewed as a binary classification problem on sequences. For this we build a model that is an LSTM from which we take the internal state vector and follow that by a fully connected layer and a softmax layer that outputs the probability of the candidate sub-sequence being a named entity. Figure 4 depicts the full architecture of the NESC model.

Figure 4: NESC model architecture

3.3 Model Training

We build the training set for NESC from the labeled data of the NER problem. Each NESC training sample is a sequence of vectors that are the output of the bidirectional LSTM layer of the pre-trained NER model for a context window, along with a target binary value which indicates whether the center of the context window (with a given context size) is a named entity or not. For positive samples (i.e., sub-sequences of tokens that are named entities in the text that they appear in), we can directly use named entities in the NER training data. Each labeled named entity in that data becomes a record in the NESC training set with positive target value.

On the other hand, the negative samples are generated in two different ways. First way of generating negative samples involves perturbations on positive samples. More specifically, if we consider the window of tokens of a positive sample in the text, we can get negative samples by extending, shrinking, and moving this window. Table 2 shows some negative samples that can be created by perturbing the correct named entity.

homeless population in San Francisco is surging Positive NESC Sample
homeless population in San Francisco is surging Negative NESC Sample
homeless population in San Francisco is surging Negative NESC Sample
homeless population in San Francisco is surging Negative NESC Sample
homeless population in San Francisco is surging Negative NESC Sample
Table 2: Positive and negative samples for NESC from NER labeled data

As it is shown for each named entity in the NER labeled data, one positive and several negative NESC samples can be created.

The second approach of generating negative samples for NESC is simply taking a random sub-sequence of tokens from the NER labeled text. For each random sub-sequence we also check to make sure that the sub-sequence is not in fact an entity. This is easy to do since these random sub-sequences are selected from NER labeled texts. For each random sub-sequence we first sample the size of the sub-sequence from the empirical discrete distribution of size (number of tokens) of the named entities in the NER labeled data. Next we select a random sub-sequence of tokens of length that we just sampled. Lastly we check to make sure that the selected random sub-sequence is not a named entity. This can be automatically done by checking the labels of tokens of the random sub-sequence.

Due to the imbalance in the number of positive and negative training samples created, we use a weighted cross entropy loss function for NESC in which the weights of the binary class are calculated based on number of positive and negative samples in the training set.

3.4 NER Labeled Data

In order to create the labeled data for NER we first took a sample of Tweets. Twitter’s Human Computations Team (HCOMP) labeled the tokens of each Tweet sample based on the IOB schema. Each sample was initially labeled by two different individuals and for samples that the two human labels did not match, a third labeler also labeled the samples. At the end we ended up with around 100,000 labeled English Tweets on which at least 2 labelers completely agreed on the labels. From this set we created our training, validation, and test sets.

In our NER training set there are 62,507 named entities, each of which would become a positive sample for NESC. On the other hand, using the approaches discussed earlier we create 226,067 negative NESC samples, and as a result our NESC training set contains 288,574 records.

4 Results

We first report the performance of our NER model on our labeled test set and compare it with the performance of Stanford NLP’s Manning et al. (2014) NER model. Table 3 summarizes this performance comparison. Here we look at 3 performance measures:

  • Untyped Token Level: The classification problem here is defined on token labels (whether a token gets an entity or a not-an-entity label). A token’s label would either associate it with an entity (label starts with “B-” or “I-”) or associate it with not an entity (label O-not-an-entity). In this case relevant instances are tokens that have labels other than O-not-an-entity in the test set.

  • Untyped Entity Level: The classification problem here is define on named entities (whether an entity is correctly identified). Here named entity types (Person, Place, etc.) are disregarded. In this case relevant instance are full labeled named entities without their types in the test set. Here the focus is on detecting full named entities and not on their detected type.

  • Typed Entity Level: The classification problem here is defined on typed named entities (whether an entity and its type is correctly identified). In this case relevant indices are full labeled named entities and their types in the test set. Here the focus is on detecting full named entities along with their types.

Measure Type Precision Recall F1 Score
Untyped Token Level Twitter NER 0.84 0.78 0.81
Stanford NLP NER 0.77 0.53 0.63
Untyped Entity Level Twitter NER 0.76 0.71 0.73
Stanford NLP NER 0.63 0.39 0.48
Typed Entity Level Twitter NER 0.69 0.64 0.66
Stanford NLP NER 0.54 0.34 0.42
Table 3: Our NER model (Twitter NER) vs Stanford NLP NER model performance on Tweets

As we can see from these numbers our NER model performs significantly better than Stanford NLP’s NER model on Tweets with respect to precision, recall, and F1 score at both typed and untyped entity levels. This performance difference is especially more apparent in recall numbers where for instance the typed entity recall of our model is 0.64 whereas the same value for Stanford NLP NER is 0.34.

Next we study the performance of our NESC model. We created training, validation, and test sets for NESC by applying the method mentioned in Section 3.3

on our NER’s training, validation, and test sets, respectively. We calculate the precision and recall of the NESC model after isotonic calibration on the validation set at different classification threshold values for the test set. Figure

5 shows this Precision–Recall curve.

Figure 5: NESC Precision–Recall curve

One could see from this curve that at 0.90 precision the recall is around 0.52 which means that more than half of the entities in the Tweets in the test set are detected with 0.90 precision. The sparsity in this precision–recall curve is due to the small size of the test set (built using only 2000 Tweets) and could be remedied by getting more labeled Tweets in order to get a higher resolution precision–recall curve.

Entities detected by NER with NESC probabilities in tokenized Tweets Other candidates probability
1 NowPlaying No Cigarette Smoking In My Room - Stephen Marley ft Melonie Fiona 16 : 39
2 @SInow : Tony Romo reportedly is unlikely to play Sunday vs . the Steelers
3 @GuardianBooks : The Essex Serpent beats Harry Potter to win Waterstones book of the year
4 Marco Republic Paris Memory Foam Cushion Womens Mary Jane Platform Wedges Heels Comfort Pumps #pumps #flats #heels Marco Republic Paris Memory Foam Cushion Womens Mary Jane Platform Wedges Heels Comfort Pumps
5 See How Bobrisky And Lolo Was Dancing On Stage As Fans React [ Video ]
6 @90sNiallftafi : when calum and Michael got Ashton to get a spider out of the bathroom & Ashton scared calum
7 @DaiIyRap : Childish Gambino is dropping his new album “ Awaken , My Love ” Next Month
8 Palmetto Packings Compression Seals - 5 / 16 ” Square - FDA Listed - 20 Feet - New Palmetto Packings Compression Seals
9 First impressions : Russell Wilson and Seahawks sputter on offense in loss to Bucs
10 Power Forward with us tonight at our President’s Community Lecture featuring Bill Ritter .
11 Happy Veterans day to every soldier who has fought for the rights of our country @Justin__martin3 love you
12 Long Island Volleyball College Showcase at SPORTIME Thanks to all the Players and College Coaches for making it a huge success ! #LIVCS Long Island
Long Island Volleyball
Long Island Volleyball College Showcase
Table 4: NER and NESC results on sample Tweets

Table 4 shows the result of NER and NESC on a number of sample Tweets. Each row in this table contains a tokenized Tweet text along with named entities in the text detected by NER (highlighted in yellow), the entity type detected by NER (shown as a subscript in red), and the probability of the detected entity being in fact an entity found by NESC (also shown as a subscript in red). For instance

Barack Obama is the 44th president of the United States”

indicates that NER has detected that “Barack Obama” is a person’s name because NER’s label for “Barack” is B–Person, for “Obama” is I–Person, and for “is” is O–not–and–entity. Moreover, according to NESC the probability of the substring “Barack Obama” in the string “Barack Obama is the 44th president of the United States” being an entity is 0.993.

In row 1 of the Table 4 we see that NER correctly has detected the song name “No Cigarette Smoking In My Room” as entity of type other and NESC gives the probability of 0.961 to this entity. Note that this is a rather long name with 6 tokens which NER has correctly identified. Also there are 2 other named entities (persons) in the Tweet that are correctly identified by NER and receive high NESC scores. Row 4 shows a Tweet in which a NER does not identify a product name correctly and consequently the NESC score that the incorrect subsequence gets is relatively lower than the NESC score for the correct entity (last column). In row 5 we see a Tweet text in which all the words are capitalized, which is quite common in Tweets. In this example 2 persons names are correctly identified by NER and both of them receive a score of 1.0 from NESC. In row 6 we see a lower case named entity “calum” is correctly identified by NER and receive very high scores from NESC. In Row 8 we see that NER incorrectly returns the sequence “Palmetto Packings Compression” as a product name where in fact the correct product name is “Palmetto Packings Compression Seals”. However it is worth noting that the NESC score for the the incorrect product name is quite low at 0.645, where as if we query the correct product name from NESC the returned score is 0.837 (column 3). Moreover in this example NER fails to detect “FDA” as an entity, but if we query NESC for FDA in that Tweet, the returned score is quite high at 0.943. In row 11 we can see that NER correctly identifies “Veterans day” as an entity of type other which has a high NESC score of 0.961. Finally in row 12 we see that NER incorrectly identifies “Island Volleyball” as an organization, but we see that the NESC score of that subsequence is very low at 0.148.

For our application of recommending users named entities from Tweets, from these results we see that we can rely on NESC scores with a high threshold. Specially if for a collection of Tweets NESC gives very high scores (say all 0.95 or higher) to a given sequence of tokens, then we could confidently assume that said subsequence of tokens in fact is an entity, and build user recommendations based on that entity accordingly.

5 References