Robust Named Entity Recognition in Idiosyncratic Domains

by   Sebastian Arnold, et al.
FSI Beuth Hochschule

Named entity recognition often fails in idiosyncratic domains. That causes a problem for depending tasks, such as entity linking and relation extraction. We propose a generic and robust approach for high-recall named entity recognition. Our approach is easy to train and offers strong generalization over diverse domain-specific language, such as news documents (e.g. Reuters) or biomedical text (e.g. Medline). Our approach is based on deep contextual sequence learning and utilizes stacked bidirectional LSTM networks. Our model is trained with only few hundred labeled sentences and does not rely on further external knowledge. We report from our results F1 scores in the range of 84-94 standard datasets.



There are no comments yet.


page 1

page 2

page 3

page 4


pioNER: Datasets and Baselines for Armenian Named Entity Recognition

In this work, we tackle the problem of Armenian named entity recognition...

Named Entity Recognition for Novel Types by Transfer Learning

In named entity recognition, we often don't have a large in-domain train...

BioNerFlair: biomedical named entity recognition using flair embedding and sequence tagger

Motivation: The proliferation of Biomedical research articles has made t...

Extracting UMLS Concepts from Medical Text Using General and Domain-Specific Deep Learning Models

Entity recognition is a critical first step to a number of clinical NLP ...

Robustness to Capitalization Errors in Named Entity Recognition

Robustness to capitalization errors is a highly desirable characteristic...

Vietnamese Named Entity Recognition using Token Regular Expressions and Bidirectional Inference

This paper describes an efficient approach to improve the accuracy of a ...

Biomedical Mention Disambiguation using a Deep Learning Approach

Automatically locating named entities in natural language text - named e...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Information extraction tasks have become very important not only in the Web, but also for in-house enterprise settings. One of the crucial steps towards understanding natural language is named entity recognition (NER), which aims to extract mentions of entity names in text. NER is necessary for many higher-level tasks such as entity linking, relation extraction, building knowledge graphs, question answering and intent based search. In these scenarios, NER recall is critical, as candidates that are never generated can not be recovered later

[Hachey et al.2013].


pink2014analysing show that NER components can reduce the search space for slot filling tasks by 99.8% with a recall loss of 15%. However, large effort is required to adapt most annotators to specialized domains, such as biomedical documents. When focusing on recall for these domains, we face three major problems. First, the language used in the documents is often idiosyncratic and cannot be effectively identified by standard natural language processing (NLP) tools

[Prokofyev et al.2014]. Second, training these domains is difficult: data is sparse, data may contain a large number of non-linkable entity mentions (NILs) and large labeled gold standards are hardly available. Third, applications vary greatly and we cannot standardize annotation guidelines to meet all of their requirements [Ling et al.2015]. For example, NER on news texts might focus on proper named entity annotation (e.g. people, companies and locations), whereas phrase recognition on medical text might include the annotation of common concepts (e.g. medical terms and treatments). We therefore focus on a generalized NER component with high recall, which can be trained ad-hoc with only few labeled examples.

Common error analysis.

ling2015design point out common errors of NER systems, which yield non-recognized mentions (false negatives), invalid detections (false positives), wrong boundaries (e.g. multi-word mentions, missing determiners) and annotation errors from human labelers (e.g., correct answers are not marked as correct, unclear annotation guidelines). Consider the following example taken from the biomedical GENIA corpus [Ohta et al.2002], with underlined named entity mentions:

Example: Engagement of the Lewis X antigen (CD15) results in monocyte activation. Nuclear extracts of anti-CD15 cross-linked cells demonstrated enhanced levels of the transcriptional factor activator protein-1, minimally changed nuclear factor-kappa B, and did not affect SV40 promoter specific protein-1.

We observe that common errors originate from a manifold number of sources, which are frequently:

  • [noitemsep]

  • non-verbatim mentions (e.g. misspellings, alternate writings: monoyctes, Lewis-X)

  • part-of-speech (POS) tagging errors (e.g. unidentified NP tags: monoycte/JJ)

  • wrong capitalization (e.g. uppercase headlines, lowercase proper names)

  • unseen or novel words (e.g. idiosyncratic language: anti-CD15)

  • irregular word context (e.g. collapsed lists, semi-structured data, invalid segmentation)

Our contribution.

We contribute DATEXIS-NER, a generic annotator for robust named entity recognition that can be trained for various domains with low human labeling effort. DATEXIS-NER does not depend on domain-specific rules, dictionaries, fine-tuning, syntactic annotation or external knowledge bases. Instead, our approach is built from scratch and is based on core character features of text. We train our model for news and biomedical domains with raw text data and few hundred labels. From our results, we report equal performance compared to state-of-the-art NER annotators with high 90% F1 scores for common NER corpora, such as CoNLL2003, KORE50, ACE2004 and MSNBC. We show on the highly domain-specific biomedical GENIA corpus that our approach adapts to various idiosyncratic domains. In particular, we observe that bidirectional long short-term memory (LSTM) networks capture useful distributional context for NER applications and generic letter-trigram word encoding with surface forms compensates typing and capitalization errors. With a combination of these techniques, we achieve better context representation than word2vec models trained with significantly larger corpora.

The rest of this paper is structured as follows. In Section 2, we discuss related work. We introduce our approach of robust named entity recognition in Section 3. In Section 4, we evaluate our approach compared to state-of-the-art annotators and discuss the most common errors in the components of our system. We conclude in Section 5 and propose future work on our approach.

2 Related Work

Named entity recognition.

The task of NER has been extensively studied with various evaluation in the last decades: MUC-6, MUC-7, CoNLL2002, CoNLL2003 and ACE. The standard approach to NER is the application of discriminative tagging [Collins2002] to the task of NER [McCallum and Li2003]

, often with linear chain Conditional Random Field (CRF), Hidden Markov (HMM) or Maximum Entropy Hidden Markov Models (MEMM). Later, bengio2003neural used continuous-space language models, where type-to-vector word mappings can be learned using backpropagation. mikolov2013efficient achieved a more effective vector representation using the skip-gram model. The model optimizes the likelihood of tokens over a window surrounding a given token. This training process produces a linear classifier that predicts words conditioned on the central token’s vector representation.

Recall bounds for idiosyncratic entity linking.

Named entity linking is the task to match textual mentions of named entities to a knowledge base [Shen et al.2015]. This task requires a set of candidate mentions from sentences. As a result, the recall from the underlying NER system constitutes an upper bound for entity linking accuracy [Hachey et al.2013]. Moreover, pink2014analysing show that “state-of-the-art systems are substantially limited by low recall” and don’t perform well especially on idiosyncratic data while prokofyev2014effective highlight that terms with high novelty or high specificity cannot efficiently be linked by current systems.

State-of-the-art NER implementations.

We distinguish between three broad categories for generating candidate entities: Babelfy [Moro et al.2014], [Dojchinovski and Kliegr2013], DBpedia Spotlight [Mendes et al.2011] or TagMe2 [Ferragina and Scaiella2010] spot noun chunks and filter them with dictionaries, often derived from Wikipedia. Stanford NER [Manning et al.2014] or LingPipe111 utilize discriminative tagging approaches. FOX [Speck and Ngomo2014] or NERD-ML [Van Erp et al.2013] combine several approaches in an ensemble learner for enhancing precision. The GENIA tagger222 is a tagger specifically tuned for biomedical text. It is trained on the GENIA-based BioNLP/NLPBA 2004 data set [Kim et al.2004] that includes named entity recognition for biomedical text. The biomedical NER system of zhou2004exploring is built using HMM and an additional SVM with sigmoid. It uses lexical-level features, e.g. word formation and morphological patterns, and utilizes dictionaries. The system of finkel2004exploiting uses a MEMM. settles2004biomedical use CRF classifiers with syntactical features and synset dictionaries. Basically, all these systems benefit from our work.

3 Robust Contextual Word Labeling

We abstract the task of NER as sequential word labeling problem. Figure 1 illustrates an example for sequential transformation of a sentence into word labels. We express each sentence in a document as a sequence of words: , e.g. Aspirin. We define a mention as the longest possible span of adjacent tokens that refer to a an entity or relevant concept of a real-world object, such as Aspirin (ASA). We further assume that mentions are non-recursive and non-overlapping. To encode boundaries of the mention span, we adapt the idea of ramshaw1995text, which has been adapted as BIO2 standard in the CoNLL2003 shared task [Tjong Kim Sang and De Meulder2003]. We assign labels to each token to mark begin, inside and outside of a mention from left to right. We use the input sequence together with a target sequence of the same length that contains a BIO2 label for each word: , e.g. B. To predict the most likely label

of a token regarding its context, we utilize recurrent neural networks.

3.1 Robust Word Encoding Methods

We have shown that most common errors for recall loss are misspellings, POS errors, capitalization, unseen words and irregular context. Therefore we generalize our model throughout three layers: robust word encoding, in-sentence word context and contextual sequence labeling.


Figure 1: Architecture of the LSTM network used for named entity recognition. The character stream “Aspirin has an antiplatelet effect.” is tokenized into words and converted into word vectors using letter-trigram hashing. These vectors are propagated through a recurrent neural network with four layers: (1) feed-forward encoding of word vectors, (2+3) bidirectional LSTM layers for context representation using forward and backward passes, (4) LSTM decoder layer for context-sensitive label prediction. The output labels follow the BIO2 standard and represent mention begin (B), inside (I) and outside (O) per token.

Letter-trigram word hashing to overcome spelling errors.

Dictionary-based word vectorization methods suffer from sparse training sets, especially in the case of non-verbatim mentions, rare words, typing and capitalization errors. For example, the word2vec model of mikolov2013efficient generalizes insufficiently for rare words in idiosyncratic domains or for misspelled words, since for these words no vector representation is learned at training time. In the GENIA data set, we notice 27% unseen words (dictionary misses) in the pretrained word2vec model333GoogleNews-vectors-negative300 embeddings (3.6 GB). As training data generation is expensive, we investigate a generic approach for the generation of word vectors. We use letter-trigram word hashing as introduced by huang2013learning. This technique goes beyond words and generates word vectors as a composite of discriminative three-letter “syllables”, that might also include misspellings. Therefore, it is robust against dictionary misses and has the advantage (despite its name) to group syntactically similar words in similar vector spaces. We compare this approach to word embedding models such as word2vec.

Surface form features for word vector robustness.

The most important features for NER are word shape properties, such as length, initial capitalization, all-word uppercase, in-word capitalization and use of numbers or punctuation [Ling and Weld2012]. Mixed-case word encodings implicitly include capitalization features. However, this approach impedes generalization, as words appear in various surface forms, e.g. capitalized at the beginning of sentences, uppercase in headlines, lowercase in social media text. The strong coherence between uppercase and lowercase characters – they might have identical semantics – is not encoded in the embedding. Therefore, we encode the words using lowercase letter-trigrams. To keep the surface information, we add flag bits to the vector that indicate initial capitalization, uppercase, lower case or mixed case.

3.2 Deep Contextual Sequence Learning

With sparse training data in the idiosyncratic domain, we expect input data with high variance. Therefore, we require a strong generalization for the syntactic and semantic representation of language. To reach into the high 80–90% NER F1 performance, long-range context-sensitive information is indispensable. We apply the computational model of recurrent neural networks, in particular long short-term memory networks (LSTMs)

[Hochreiter and Schmidhuber1997, Gers et al.2002] to the problem of sequence labeling. Like neural feed-forward networks, LSTMs are able to learn complex parameters using gradient descent, but include additional recurrent connections between cells to influence weight updates over adjacent time steps. With their ability to memorize and forget over time, LSTMs have proven to generalize context-sensitive sequential data well [Graves2012, Lipton and Berkowitz2015].

Figure 1 shows an unfolded representation of the steps through a sentence. We feed the LSTM with letter-trigram vectors as input data, one word at a time. The hidden layer of the LSTM represents context from long range dependencies over the entire sentence from left to right. However, to achieve deeper contextual understanding over the boundaries of multi-word annotations and at the beginning of sentences, we require a backwards pass through the sentence. We therefore implement a bidirectional LSTM and feed the output of both directions into a second LSTM layer for combined label prediction.

Bidirectional sequence learning.

For the use in the neural network, word encodings and labels are real-valued vectors. To predict the most likely label of a token, we utilize a LSTM with input nodes , input gates , forget gate , output gate and internal state . For the bidirectional case, all gates are duplicated and combined into forward state and backward state . The network is trained using backpropagation through time (BPTT) by adapting weights and bias parameters to fit the training examples.


We iterate over labeled sentences in mini-batches and update the weights accordingly. The network is then used to predict label probabilities

for unseen word sequences .

3.3 Implementation of NER Components

To show the impact of our bidirectional LSTM model, we measure annotation performance on three different neural network configurations. We implement all components using the Deeplearning4j framework444, version 0.4-rc3.9-SNAPSHOT. For preprocessing (sentence and word tokenization), we use Stanford CoreNLP555version 3.6.0 [Manning et al.2014]. We test the sequence labeler using three input encodings:

  • [noitemsep]

  • DICT: We build a dictionary over all words in the corpus and generate the input vector using 1-hot encoding for each word

  • EMB: We use the GoogleNews word2vec embeddings, which encodes each word as vector of size 300

  • TRI: we implement letter-trigram word hashing as described in Section 3.1.

During training and test, we group all tokens of a sentence as mini-batch. We evaluate three different neural network types to show the impact of the bidirectional sequence learner.

  • [noitemsep]

  • FF: As baseline, we train a non-sequential feed-forward model based on a fully connected multilayer perceptron network with 3 hidden layers of size 150 with relu activation, feeding into a 3-class softmax classifier. We train the model using backpropagation with stochastic gradient descent and a learning rate of 0.005.

  • LSTM: We use a configuration of a single feed-forward layer of size 150 with two additional layers of single-direction LSTM with 20 cells and a 3-class softmax classifier. We train the model using backpropagation-through-time (BPTT) with stochastic gradient descent and a learning rate of 0.005.

  • BLSTM: Our final configuration consists of a single feed-forward layer of size 150 with one bidirectional LSTM layer with 20 cells and an additional single-direction LSTM with 20 cells into a 3-class softmax classifier. The BLSTM model is trained the same way as the single-direction LSTM.

4 Evaluation

We evaluate nine configurations of our model on five gold standard evaluation data sets. We show that the combination of letter-trigram word hashing with bidirectional LSTM yields the best results and outperforms sequence learners based on dictionaries or word2vec. To highlight the generalization of our model to idiosyncratic domains, we run tests on common-typed data sets as well as on specialized medical documents. We compare our system on these data sets with specialized state-of-the-art systems.

4.1 Evaluation Set Up

We train two models with identical parameterization, each with 2000 randomly chosen labeled sentences from a standard data set. To show the effectiveness of the components, we evaluate different configurations of this setting with 2000 random sentences from the remaining set. The model was trained using Deeplearning4j with nd4j-x86 backend. Training the TRI+BLSTM configuration on a commodity Intel i7 notebook with 4 cores at 2.8GHz takes approximately 50 minutes.

Evaluation data sets.

Table 1 gives an overview of the standard data sets we use for training. The GENIA Corpus [Ohta et al.2002] contains biomedical abstracts from the PubMed database. We use GENIA technical term annotations 3.02, which cover linguistic expressions to entities of interest in molecular biology, e.g. proteins, genes and cells. CoNLL2003 [Kim et al.2004] is a standard NER dataset based on the Reuters RCV-1 news corpus. It covers named entities of type person, location, organization and misc.

For testing the overall annotation performance, we utilize CoNLL2003-testA and a 50 document split from GENIA. Additionally, we test on the complete KORE50 [Hoffart et al.2012], ACE2004 [Mitchell et al.2005] and MSNBC data sets using the GERBIL evaluation framework [Usbeck et al.2015].


Table 1: Overview of CoNLL2003 and GENIA training datasets and sizes of word encodings. We use 2000 sentences of each set for training.

4.2 Measurements

We measure precision, recall and F1 score of our DATEXIS-NER system and state-of-the-art annotators introduced in Section 2. For the comparison with black box systems, we evaluate annotation results using weak annotation match. For a more detailed in-system error analysis, we measure BIO2 labeling performance based on each token.


Table 2: Comparison of annotators trained for common English news texts (micro-averaged scores on match per annotation span). The table shows micro-precision, recall and NER-style F1 for CoNLL2003, KORE50, ACE2004 and MSNBC datasets.

Measuring annotation performance using NER-style F1.

We measure the overall performance of mention annotation using the evaluation measures defined by cornolti2013framework, which are also used by ling2015design. Let be a set of documents with gold standard mention annotations with a total of examples. Each mention is defined by start position and end position in the source document . To quantify the performance of the system, we compare to the set of predicted annotations with mentions :


We compare using a weak annotation match:


We measure micro-averaged precision (), recall () and NER-style () score:


Measuring BIO2 labeling performance.

Tuning the model configuration with annotation match measurement is not always feasible. We therefore measure , , , separately for each label class in our classification model and calculate binary classification precision , recall and

scores. To avoid skewed results from the expectedly large

class, we use macro-averaging over the three classes:


4.3 Evaluation Results

We now discuss the evaluation of our DATEXIS-NER system on common and idiosyncratic data.

Overall model performance on common types.

Table 2 shows the comparison of DATEXIS-NER with eight state-of-the-art annotators on four common news data sets. Both common and medical models are configured identically and trained on only 2000 labeled sentences, without any external prior knowledge. We observe that DATEXIS-NER achieves the highest recall scores of all tested annotators, with 95%–98% on all measured data sets. Moreover, DATEXIS-NER precision scores are equal or better than median. Overall, we achieve high micro-F1 scores of 84%–94% on news entity recognition, which is slightly better than the ontology-based NER and reveals a better generalization than the 3-type Stanford NER with distributional semantics. We notice that systems specialized on word-sense disambiguation (Babelfy, DBpedia Spotlight) don’t perform well on “raw” untyped entity recognition tasks. The highest precision scores are reached by Stanford NER. We also notice a low precision of all annotators on the ACE2004 dataset and high variance in MSNBC performance, which are probably caused by differing annotation standards.

Biomedical recognition performance.

Table 3 shows the results of biomedical entity recognition compared to the participants of the JNLPBA 2004 bio-entity recognition task [Kim et al.2004]

. We notice that for these well-written Medline abstracts, there is not such a strong skew between precision and recall. Our DATEXIS-NER system outperforms the HMM, MEMM, CRF and CDN based models with a micro-F1 score of 84%. However, the highly specialized GENIA chunker for LingPipe achieves higher scores. This chunker is a very simple generative model predictor that is based on a sliding window of two tokens, word shape and dictionaries. We interpret this score as strong overfitting using a dictionary of the well-defined GENIA terms. Therefore, this model will generalize hardly considering the simple model. We can confirm this presumption in the common data sets, where the MUC-6 trained HMM LingPipe chunker performs on average on unseen data.


Table 3: Comparison of annotators trained for biomedical text. The table shows NER annotation results for 50 documents from the GENIA dataset.

Evaluation of system components.

We evaluate different configurations of the components that we describe in Section 3.3. Table 4 shows the results of experiments on both CoNLL2003 and GENIA data sets. We report the highest macro-F1 scores for BIO2 labeling for the configuration of letter-trigram word vectors and bidirectional LSTM. We notice that dictionary-based word encodings (DICT) work well for idiosyncratic medical domains, whereas they suffer from high word ambiguity in the news texts. Pretrained word2vec embeddings (EMB) perform well on news data, but cannot adapt to the medical domain without retraining, because of a large number of unseen words. Therefore, word2vec generally achieves a high precision on news texts, but low recall on medical text. The letter-trigram approach (TRI) combines both word vector generalization and robustness towards idiosyncratic language.

We observe that the contextual LSTM model achieves scores throughout in the 85%–94% range and significantly outperforms the feed-forward (FF) baseline that shows a maximum of 75%. Bidirectional LSTMs can further improve label classification in both precision and recall.


Table 4: Comparison of nine configurations from our implementation (macro-averaged scores on BIO2 classification per token).

4.4 Discussion and Error Analysis

We investigate different aspects of the DATEXIS-NER components by manual inspection of classification errors in the context of the document. For the error classes described in the introduction (false negative detections, false positives and invalid boundaries), we observe following causes:

Unseen words and misspellings.

In dictionary based configurations (e.g. 1-hot word vector encoding DICT), we observe false negative predictions caused by dictionary misses for words that do not exist in the training data. The cause can be rare unseen or novel words (e.g. T-prolymphocytic cells) or misspellings (e.g. strengthnend). These words yield a null vector result from the encoder and can therefore not be distinguished by the LSTM. The error increases when using word2vec, because these models are trained with stop words filtered out. This implicates that e.g. mentions surrounded by or containing a determiner (e.g. The Sunday Telegraph quoted Majorie Orr) are highly error prone towards the detection of their boundaries. We resolve this error by the letter-trigram approach. Unseen trigrams (e.g. thh) may still be missing in the word vector, but only affect single dimensions as opposed to the vector as a whole.

Misleading surface form features.

Surface forms encode important features for NER (e.g. capitalization of “new” in Alan Shearer was named as the new England captain / as New York beat the Angels). However, case-sensitive word vectorization methods yield a large amount of false positive predictions caused by incorrect capitalization in the input data. An uppercase headline (e.g. TENNIS - U.S. TEAM ON THE ROAD FOR 1997 FED CUP) is encoded completely different than a lowercase one (e.g. U.S. team on the road for Fed Cup). Because of that, we achieve best results with lowercase word vectors and additional surface form feature flags, as described in Section 3.1.

Syntagmatic word relations.

We observe mentions that are composed of co-occurring words with high ambiguity (e.g. degradation of IkB alpha in T cell lines). These groups encode strong syntagmatic word relations [Sahlgren2008] that can be leveraged to resolve word sense and homonyms from sentence context. Therefore, correct boundaries in these groups can effectively be identified only with contextual models such as LSTMs.

Paradigmatic word relations.

Orthogonal to the previous problem, different words in a paradigmatic relation [Sahlgren2008] can occur in the same context (e.g. cyclosporin A-treated cells / HU treated cells). These groups are efficiently represented in word2vec. However, letter-trigram vectors cannot encode paradigmatic groups and therefore require a larger training sample to capture these relations.

Context boundaries.

Often, synonyms can only be resolved regarding a larger document context than the local sentence context known by the LSTM. In these cases, word sense is redefined by a topic model local to the paragraph (e.g. sports: Tiger was lost in the woods after divorce.). This problem does not heavily affect NER recall, but is crucial for named entity disambiguation and coreference resolution.


The proposed DATEXIS-NER model is restricted to recognize boundaries of generic mentions in text. We evaluate the model on annotations of isolated types (e.g. persons, organizations, locations) for comparison purposes only, but we do not approach NER-style typing. Contrary, we approach to detect mentions without type information. The detection of specific types can be realized by training multiple independent models on a selection of labels per type and nesting the resulting annotations using a longest-span semantic type heuristic

[Kholghi et al.2015].

5 Summary

ling2015design show that the task of NER is not clearly defined and rather depends on a specific problem context. Contrary, most NER approaches are specifically trained on fixed datasets in a batch mode. Worse, they often suffer from poor recall [Pink et al.2014]. Ideally, one could personalize the task of recognizing named entities, concepts or phrases according to the specific problem. “Personalizing” and adapting such annotators should happen with very limited human labeling effort, in particular for idiosyncratic domains with sparse training data.

Our work follows this line. From our results we report F1 scores between 84–94% when using bidirectional multi-layered LSTMs, letter-trigram word hashing and surface form features on only few hundred training examples.

This work is only a preliminary step towards the vision of personalizing annotation guidelines for NER [Ling et al.2015]. In our future work, we will focus on additional important idiosyncratic domains, such as health, life science, fashion, engineering or automotive. For these domains, we will consider the process of detecting mentions and linking them to an ontology as a joint task and we will investigate simple and interactive workflows for creating robust personalized named entity linking systems.


Our work is funded by the Federal Ministry of Economic Affairs and Energy (BMWi) under grant agreement 01MD15010B (Project: Smart Data Web).


  • [Bengio et al.2003] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model.

    Journal of Machine Learning Research

    , 3:1137–1155.
  • [Collins2002] Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In EMNLP’02, pages 1–8, Stroudsburg, PA, USA. ACL.
  • [Cornolti et al.2013] Marco Cornolti, Paolo Ferragina, and Massimiliano Ciaramita. 2013. A Framework for Benchmarking Entity-Annotation Systems. In WWW’13, pages 249–260. ACM.
  • [Dojchinovski and Kliegr2013] Milan Dojchinovski and Tomáš Kliegr. 2013. Entityclassifier. eu: Real-Time Classification of Entities in Text with Wikipedia. In Machine Learning and Knowledge Discovery in Databases, pages 654–658. Springer.
  • [Ferragina and Scaiella2010] Paolo Ferragina and Ugo Scaiella. 2010. TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities). In CIKM’10, pages 1625–1628, New York, NY, USA. ACM.
  • [Finkel et al.2004] Jenny Finkel, Shipra Dingare, Huy Nguyen, Malvina Nissim, Christopher Manning, and Gail Sinclair. 2004. Exploiting Context for Biomedical Entity Recognition: from Syntax to the Web. In JNLPBA’04, pages 88–91. ACL.
  • [Gers et al.2002] Felix Gers, Juan Antonio Perez-Ortiz, Douglas Eck, and Jürgen Schmidhuber. 2002.

    Learning Context Sensitive Languages with LSTM Trained with Kalman Filters.

  • [Graves2012] Alex Graves. 2012. Supervised Sequence Labelling with Recurrent Neural Networks, volume 385. Springer, Berlin Heidelberg.
  • [Hachey et al.2013] Ben Hachey, Will Radford, Joel Nothman, Matthew Honnibal, and James R. Curran. 2013. Evaluating Entity Linking with Wikipedia. Artificial intelligence, 194:130–150.
  • [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, 9(8):1735–1780.
  • [Hoffart et al.2012] Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. 2012. KORE: Keyphrase Overlap Relatedness for Entity Disambiguation. In CIKM’12, pages 545–554. ACM.
  • [Huang et al.2013] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. In CIKM’13, pages 2333–2338. ACM.
  • [Kholghi et al.2015] Mahnoosh Kholghi, Laurianne Sitbon, Guido Zuccon, and Anthony Nguyen. 2015.

    External Knowledge and Query Strategies in Active Learning: a Study in Clinical Information Extraction.

    In CIKM’15, pages 143–152. ACM.
  • [Kim et al.2004] Jin-Dong Kim, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi, and Nigel Collier. 2004. Introduction to the Bio-Entity Recognition Task at JNLPBA. In JNLPBA’04, pages 70–75. ACL.
  • [Ling and Weld2012] Xiao Ling and Daniel S. Weld. 2012. Fine-Grained Entity Recognition. In AAAI’12.
  • [Ling et al.2015] Xiao Ling, Sameer Singh, and Daniel S Weld. 2015. Design Challenges for Entity Linking. ACL’15, 3:315–328.
  • [Lipton and Berkowitz2015] Zachary C. Lipton and John Berkowitz. 2015. A Critical Review of Recurrent Neural Networks for Sequence Learning. arXiv:1506.00019 [cs.LG].
  • [Manning et al.2014] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In ACL System Demonstrations, pages 55–60.
  • [McCallum and Li2003] Andrew McCallum and Wei Li. 2003.

    Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-enhanced Lexicons.

    In CONLL ’03, pages 188–191, Stroudsburg, PA, USA. ACL.
  • [Mendes et al.2011] Pablo N Mendes, Max Jakob, Andrés Garcia-Silva, and Christian Bizer. 2011. DBpedia Spotlight: Shedding Light on the Web of Documents. In I-Semantics 2011, pages 1–8. ACM.
  • [Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs.CL].
  • [Mitchell et al.2005] Alexis Mitchell, Stephanie Strassel, Shudong Huang, and Ramez Zakhary. 2005. ACE 2004 Multilingual Training Corpus. LDC, Philadelphia, 1:1–1.
  • [Moro et al.2014] Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity Linking meets Word Sense Disambiguation: a Unified Approach. ACL’14, 2:231–244.
  • [Ohta et al.2002] Tomoko Ohta, Yuka Tateisi, and Jin-Dong Kim. 2002. The GENIA Corpus: An Annotated Research Abstract Corpus in Molecular Biology Domain. In International Conference on Human Language Technology Research 2002, pages 82–86. Morgan Kaufmann Publishers Inc.
  • [Pink et al.2014] Glen Pink, Joel Nothman, and James R. Curran. 2014. Analysing Recall Loss in Named Entity Slot Filling. In EMNLP’14, pages 820–830, Doha, Qatar. ACL.
  • [Prokofyev et al.2014] Roman Prokofyev, Gianluca Demartini, and Philippe Cudré-Mauroux. 2014. Effective Named Entity Recognition for Idiosyncratic Web Collections. In WWW’14, pages 397–408, Geneva, Switzerland. IW3C2.
  • [Ramshaw and Marcus1995] Lance A. Ramshaw and Mitchell P. Marcus. 1995. Text chunking using transformation-based learning. In WVLC’95. ACL.
  • [Sahlgren2008] Magnus Sahlgren. 2008. The Distributional Hypothesis. Italian Journal of Linguistics, 20(1):33–54.
  • [Settles2004] Burr Settles. 2004. Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In JNLPBA’04, pages 104–107. ACL.
  • [Shen et al.2015] Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Transactions on Knowledge and Data Engineering, 27(2):443–460, February.
  • [Speck and Ngomo2014] René Speck and Axel-Cyrille Ngonga Ngomo. 2014. Ensemble Learning for Named Entity Recognition. In ISWC’14, pages 519–534. Springer.
  • [Tjong Kim Sang and De Meulder2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL’03, pages 142–147. ACL.
  • [Usbeck et al.2015] Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, Paolo Ferragina, Christiane Lemke, Andrea Moro, Roberto Navigli, Francesco Piccinno, Giuseppe Rizzo, Harald Sack, René Speck, Raphaël Troncy, Jörg Waitelonis, and Lars Wesemann. 2015. GERBIL: General Entity Annotator Benchmarking Framework. In WWW’15, pages 1133–1143, Geneva, Switzerland. IW3C2.
  • [Van Erp et al.2013] Marieke Van Erp, Giuseppe Rizzo, and Raphaël Troncy. 2013. Learning with the Web: Spotting Named Entities on the Intersection of NERD and Machine Learning. In #MSM’13, pages 27–30, Rio de Janeiro, Brazil. ACM.
  • [Zhou and Su2004] GuoDong Zhou and Jian Su. 2004. Exploring Deep Knowledge Resources in Biomedical Name Recognition. In JNLPBA’04, pages 96–99. ACL.