Named Entity Recognition (NER) constitutes a core component in many NLP pipelines and is employed in a broad range of applications such as information extraction Raiman and Raiman (2018), question answering Mollá et al. (2006), document de-identification Stubbs et al. (2015), machine translation Ugawa et al. (2018) and even conversational models Ghazvininejad et al. (2018)
. Given a document, the goal of NER is to identify and classify spans referring to an entity belonging to pre-specified categories such as persons, organisations or geographical locations.
NER models often rely on convolutional or recurrent neural architectures, sometimes completed by a CRF layer Chiu and Nichols (2016); Lample et al. (2016); Yadav and Bethard (2018). More recently, deep contextualised representations relying on bidirectional LSTMS Peters et al. (2018), transformers Devlin et al. (2019); Yan et al. (2019) or contextual string embeddings Akbik et al. (2019) have also been shown to achieve state-of-the-art performance on NER tasks.
These neural architectures require large corpora annotated with named entities, such as Ontonotes Weischedel et al. (2011) or ConLL 2003 Tjong Kim Sang and De Meulder (2003). When only modest amounts of training data are available, transfer learning approaches can transfer the knowledge acquired from related tasks into the target domain, using techniques such as simple transfer Rodriguez et al. (2018), discriminative fine-tuning Howard and Ruder (2018), adversarial transfer Zhou et al. (2019) or layer-wise domain adaptation approaches Yang et al. (2017); Lin and Lu (2018).
However, in many practical settings, we wish to apply NER to domains where we have no labelled data, making such transfer learning methods difficult to apply. This paper presents an alternative approach using weak supervision to bootstrap named entity recognition models without requiring any labelled data from the target domain. The approach relies on labelling functions that automatically annotate documents with named-entity labels. A hidden Markov model (HMM) is then trained to unify the noisy labelling functions into a single (probabilistic) annotation, taking into account the accuracy and confusions of each labelling function. Finally, a sequence labelling model is trained using a cross-entropy loss on this unified annotation.
As in other weak supervision frameworks, the labelling functions allow us to inject expert knowledge into the sequence labelling model, which is often critical when data is scarce or non-existent Hu et al. (2016); Wang and Poon (2018). New labelling functions can be easily inserted to leverage the knowledge sources at our disposal for a given textual domain. Furthermore, labelling functions can often be ported across domains, which is not the case for manual annotations that must be reiterated for every target domain.
The contributions of this paper are as follows:
A broad collection of labelling functions for NER, including neural models trained on various textual domains, gazetteers, heuristic functions, and document-level constraints.
A novel weak supervision model suited for sequence labelling tasks and able to include probabilistic labelling predictions.
An open-source implementation of these labelling functions and aggregation model that can scale to large datasets111https://github.com/NorskRegnesentral/weak-supervision-for-NER..
2 Related Work
Unsupervised domain adaptation:
Unsupervised domain adaptation attempts to adapt knowledge from a source domain to predict new instances in a target domain which often has substantially different characteristics. Earlier approaches often try to adapt the feature space using pivots Blitzer et al. (2006, 2007); Ziser and Reichart (2017) to create domain-invariant representations of predictive features. Others learn low-dimensional transformation features of the data Guo et al. (2009); Glorot et al. (2011); Chen et al. (2012); Yu and Jiang (2016); Barnes et al. (2018). Finally, some approaches divide the feature space into general and domain-dependent features Daumé III (2007). Multi-task learning can also improve cross-domain performance Peng and Dredze (2017).
Recently, han-eisenstein-2019-unsupervised proposed domain-adaptive fine-tuning, where contextualised embeddings are first fine-tuned to both the source and target domains with a language modelling loss and subsequently fine-tuned to source domain labelled data. This approach outperforms several strong baselines trained on the target domain of the WNUT 2016 NER task Strauss et al. (2016).
Aggregation of annotations:
Approaches that aggregate annotations from multiples sources have largely concentrated on noisy data from crowd sourced annotations, with some annotators possibly being adversarial. The Bayesian Classifier Combination
approach of pmlr-v22-kim12 combines multiple independent classifiers using a linear combination of predictions. hovy-etal-2013-learning learn a generative model able to aggregate crowd-sourced annotations and estimate the trustworthiness of annotators. Rodrigues:2014:SLM:2843614.2843687 present an approach based on Conditional Random Fields (CRFs) whose model parameters are learned jointly using EM. nguyen-etal-2017-aggregating propose a Hidden Markov Model to aggregate crowd-sourced sequence annotations and find that explicitly modelling the annotator leads to improvements for POS-tagging and NER. Finally, simpson-gurevych-2019-bayesian proposed a fully Bayesian approach to the problem of aggregating multiple sequential annotations, using variational EM to compute posterior distributions over the model parameters.
The aim of weakly supervised modelling is to reduce the need for hand-annotated data in supervised training. A particular instance of weak supervision is distant supervision, which relies on external resources such as knowledge bases to automatically label documents with entities that are known to belong to a particular category Mintz et al. (2009); Ritter et al. (2013); Shang et al. (2018). Ratner:2017:SRT:3173074.3173077,Ratner2019 generalised this approach with the Snorkel framework which combines various supervision sources using a generative model to estimate the accuracy (and possible correlations) of each source. These aggregated supervision sources are then employed to train a discriminative model. Current frameworks are, however, not easily adaptable to sequence labelling tasks, as they typically require data points to be independent. One exception is the work of wang-poon-2018-deep, which relies on deep probabilistic logic to perform joint inference on the full dataset. Finally, fries2017swellshark presented a weak supervision approach to NER in the biomedical domain. However, unlike the model proposed in this paper, their approach relies on an ad-hoc mechanism for generating candidate spans to classify.
The approach most closely related to this paper is safranchik:aaai20, which describe a similar weak supervision framework for sequence labelling based on an extension of HMMs called linked hidden Markov models. The authors introduce a new type of noisy rules, called linking rules, to determine how sequence elements should be grouped into spans of same tag. The main differences between their approach and this paper are the linking rules, which are not employed here, and the choice of labelling functions, in particular the document-level relations detailed in Section 3.1.
The proposed approach is also loosely related to ensemble methods
such bagging, boosting and random forestsSagi and Rokach (2018). These methods rely on multiple classifiers run simultaneously and whose outputs are combined at prediction time. In contrast, our approach (as in other weak supervision frameworks) only requires labelling functions to be aggregated once, as an intermediary step to create training data for the final model. This is a non-trivial difference as running all labelling functions at prediction time is computationally costly due to the need to run multiple neural models along with gazetteers extracted from large knowledge bases.
The proposed model collects weak supervision from multiple labelling functions. Each labelling function takes a text document as input and outputs a series of spans associated with NER labels. These outputs are then aggregated using a hidden Markov model (HMM) with multiple emissions (one per labelling function) whose parameters are estimated in an unsupervised manner. Finally, the aggregated labels are employed to learn a sequence labelling model. Figure 1 illustrates this process. The process is performed on documents from the target domain, e.g. a corpus of financial news.
Labelling functions are typically specialised to detect only a subset of possible labels. For instance, a gazetteer based on Wikipedia will only detect mentions of persons, organisations and geographical locations and ignore entities such as dates or percents. This marks a departure from existing aggregation methods, which are originally designed for crowd-sourced data and where annotators are supposed to make use of the full label set. In addition, unlike previous weak supervision approaches, we allow labelling functions to produce probabilistic predictions instead of deterministic values. The aggregation model described in Section 3.2 directly captures these properties in the emission model associated with each labelling function.
We first briefly describe the labelling functions integrated into the current system. We review in Section 3.2 the aggregation model employed to combine the labelling predictions. The final labelling model is presented in Section 3.3. The complete list of 52 labelling functions employed in the experiments is available in Appendix A.
3.1 Labelling functions
Out-of-domain NER models
The first set of labelling functions are sequence labelling models trained in domains from which labelled data is available. In the experiments detailed in Section 4, we use four such models, respectively trained on Ontonotes Weischedel et al. (2011), CoNLL 2003 Tjong Kim Sang and De Meulder (2003)222The ConLL 2003 NER model is of course deactivated for the experimental evaluation on ConLL 2003., the Broad Twitter Corpus Derczynski et al. (2016) and a NER-annotated corpus of SEC filings Salinas Alvarado et al. (2015).
For the experiments in this paper, all aforementioned models rely on a transition-based NER model Lample et al. (2016)
which extracts features with a stack of four convolutional layers with filter size of three and residual connections. The model uses attention features and a multi-layer perceptron to select the next transition. It is initialised with GloVe embeddingsPennington et al. (2014) and implemented in Spacy Honnibal and Montani (2017). However, the proposed approach does not impose any constraints on the model architecture and alternative approaches based on e.g. contextualised embeddings can also be employed.
As in distant supervision approaches, we include a number of gazetteers from large knowledge bases to identify named entities. Concretely, we use resources from Wikipedia Geiß et al. (2018), Geonames Wick (2015), the Crunchbase Open Data Map, DBPedia Lehmann et al. (2015) along with lists of countries, languages, nationalities and religious or political groups.
To efficiently search for occurrences of these entities in large text collections, we first convert each knowledge base into a trie data structure. Prefix search is then applied to extract matches (using both case-sensitive and case-insensitive mode, as they have distinct precision-recall trade-offs).
We also include various heuristic functions, each specialised in the recognition of specific types of named entities. Several functions are dedicated to the recognition of proper names based on casing, part-of-speech tags or dependency relations. In addition, we integrate a variety of handcrafted functions relying on regular expressions to detect occurrences of various entities (see Appendix A for details). A probabilistic parser specialised in the recognition of dates, times, money amounts, percents, and cardinal/ordinal values Braun et al. (2017) is also incorporated.
All labelling functions described above rely on local decisions on tokens or phrases. However, texts are not loose collections of words, but exhibit a high degree of internal coherence Grosz and Sidner (1986); Grosz et al. (1995) which can be exploited to further improve the annotations.
, named entities occurring multiple times through a document have a high probability of belonging to the same category. For instance, whileKomatsu may both refer to a Japanese town or a multinational corporation, a text including this mention will either be about the town or the company, but rarely both at the same time. To capture these non-local dependencies, we define the following label consistency model: given a text span occurring in a given document, we look for all spans in the document that contain the same string as . The (probabilistic) output of the labelling function then corresponds to the relative frequency of each label for that string in the document:
The above formula depends on a distribution , which can be defined on the basis of other labelling functions. Alternatively, a two-stage model similar to Krishnan and Manning (2006) could be employed to first aggregate local labelling functions and subsequently apply document-level functions on aggregated predictions.
Another insight from grosz-sidner-1986-attention is the importance of the attentional structure. When introduced for the first time, named entities are often referred to in an explicit and univocal manner, while subsequent mentions (once the entity is a part of the focus structure) frequently rely on shorter references. The first mention of a person in a given text is for instance likely to include the person’s full name, and is often shortened to the person’s last name in subsequent mentions. As in ratinov-roth-2009-design, we determine whether a proper name is a substring of another entity mentioned earlier in the text. If so, the labelling function replicates the label distribution of the first entity.
3.2 Aggregation model
The outputs of these labelling functions are then aggregated into a single layer of annotation through an aggregation model. As we do not have access to labelled data for the target domain, this model is estimated in a fully unsupervised manner.
We assume a list of labelling functions and a list of mutually exclusive NER labels . The aggregation model is represented as an HMM, in which the states correspond to the true underlying labels. This model has multiple emissions (one per labelling function) assumed to be mutually independent conditional on the latent underlying label.
Formally, for each token and labelling function , we assume a Dirichlet distribution for the probability labels
. The parameters of this Dirichlet are separate vectors, for each of the latent states . The latent states are assumed to have a Markovian dependence structure between the tokens . This results in the HMM represented by a dependent mixtures of Dirichlet model:
Here, are the parameters of the transition probability matrix controlling for a given state the probability of transition to state . Figure 2 illustrates the model structure.
The learnable parameters of this HMM are (a) the transition matrix between states and (b) the vectors of the Dirichlet distribution associated with each labelling function. The transition matrix is of size , while we have vectors, each of size . The parameters are estimated with the Baum-Welch algorithm, which is a variant of EM algorithm that relies on the forward-backward algorithm to compute the statistics for the expectation step.
To ensure faster convergence, we introduce a new constraint to the likelihood function: for each token position , the corresponding latent label must have a non-zero probability in at least one labelling function (the likelihood of this label is otherwise set to zero for that position). In other words, the aggregation model will only predict a particular label if this label is produced by least one labelling function. This simple constraint facilitates EM convergence as it restricts the state space to a few possible labels at every time-step.
The HMM described above can be provided with informative priors. In particular, the initial distribution for the latent states can be defined as a Dirichlet based on counts for the most reliable labelling function333The most reliable labelling function was found in our experiments to be the NER model trained on Ontonotes 5.0.:
The prior for each row of the transition probabilities matrix is also a Dirichlet based on the frequencies of transitions between the observed classes for the most reliable labelling function :
Finally, to facilitate convergence of the EM algorithm, informative starting values can be specified for the emission model of each labelling function. Assuming we can provide rough estimates of the recall and precision for the labelling function on label , the initial values for the parameters of the emission model are expressed as:
The probability of observing a given label emitted by the labelling function is thus proportional to its recall if the true label is indeed . Otherwise (i.e. if the labelling function made an error), the probability of emitting is inversely proportional to the precision of the labelling function .
3.3 Sequence labelling model
Once the labelling functions are aggregated on documents from the target domain, we can train a sequence labelling model on the unified annotations, without imposing any constraints on the type of model to use. To take advantage of the posterior marginal distribution over the latent labels, the optimisation should seek to minimise the expected loss with respect to :
where is the output of the sequence labelling model. This is equivalent to minimising the cross-entropy error between the outputs of the neural model and the probabilistic labels produced by the aggregation model.
We evaluate the proposed approach on two English-language datasets, namely the CoNLL 2003 dataset and a collection of sentences from Reuters and Bloomberg news articles annotated with named entities by crowd-sourcing. We include a second dataset in order to evaluate the approach with a more fine-grained set of NER labels than the ones in CoNLL 2003. As the objective of this paper is to compare approaches to unsupervised domain adaptation, we do not rely on any labelled data from these two target domains.
The CoNLL 2003 dataset Tjong Kim Sang and De Meulder (2003) consists of 1163 documents, including a total of 35089 entities spread over 4 labels: ORG, PER, LOC and MISC.
Reuters & Bloomberg
We additionally crowd annotate 1054 sentences from Reuters and Bloomberg news articles from ding-etal-2014-using. We instructed the annotators to tag sentences with the following 9 Ontonotes-inspired labels: PERSON, NORP, ORG, LOC, PRODUCT, DATETIME, PERCENT, MONEY, QUANTITY. Note that the DATE and TIME labels from Ontonotes are merged into DATETIME, and the LOC and GPE labels are similarly merged into LOC. Each sentence was annotated by at least two annotators, and a qualifying test with gold-annotated questions was conducted for quality control. Cohen’s for sentences with two annotators is 0.39, while Krippendorff’s for three annotators is 0.44. We had to remove QUANTITY labels from the annotations as the crowd results for this particular label were highly inconsistent.
The first baseline corresponds to a neural sequence labelling model trained on the Ontonotes 5.0 corpus. We use here the same model from Section 3.1, which is the single best-performing labelling function (that is, without aggregating multiple predictions).
We also experimented with other neural architectures but these performed similar or worse than the transition-based model, presumably because they are more prone to overfitting on the source domain.
Majority voting (MV)
The simplest method for aggregating outputs is majority voting, i.e. outputting the most frequent label among the ones predicted by each labelling function. However, specialised labelling functions will output O for most tokens, which means that the majority label is typically O. To mitigate this problem, we first look at tokens that are marked with a non-O label by at least labelling functions (where is a hyper-parameter tuned experimentally), and then apply majority voting on this set of non-O labels.
The Snorkel framework Ratner et al. (2017) does not directly support sequence labelling tasks as data points are required to be independent. However, heuristics can be used to extract named-entity candidates and then apply labelling functions to infer their most likely labels Fries et al. (2017). For this baseline, we use the three functions nnp_detector, proper_detector and compound_detector (see Appendix A) to generate candidate spans. We then create a matrix expressing the prediction of each labelling function for each span (including a specific ”abstain” value to denote the absence of predictions) and run the matrix-completion-style approach of Ratner2019 to aggregate the predictions.
is a strong domain adaptation baseline Chen et al. (2012)
which augments the feature space of a model with intermediate representations learned using stacked denoising autoencoders. In our case, we learn the mSDA representations on the unlabeled source and target domain data. These 800 dimensional vectors are concatenated to 300 dimensional word embeddings and fed as input to a two-layer LSTM with a skip connection. Finally, we train the LSTM on the labeled source data and test on the target domain.
This baseline corresponds to a state-of-the-art unsupervised domain adaptation approach (AdaptaBERT) Han and Eisenstein (2019)
. The approach first uses unlabeled data from both the source and target domains to domain-tune a pretrained BERT model. The model is finally task-tuned in a supervised fashion on the source domain labelled data (Ontonotes). At inference time, the model is able to make use of the pretraining and domain tuning to predict entities in the target domain. In our experiments, we use the cased-version of the base BERT model (trained on Wikipedia and news text) and perform three fine-tuning epochs for both domain-tuning and task-tuning. We additionally include an ensemble model, which averages the predictions of five BERT models fine-tuned with different random seeds.
Mixtures of multinomials
Following the notation from Section 3.2, we define to be the most probable label for word by source . One can model with a Multinomial probability distribution. The first four baselines (the fifth one assumes Markovian dependence between the latent states) listed below use the following independent, i.e. , mixtures of Multinomials model for :
Accuracy model (ACC)
Rodrigues et al. (2014) assumes the following constraints on :
Here, for each labelling function it is assumed to have the same accuracy for all of the tokens.
Confusion vector (CV)
Nguyen et al. (2017a) extends ACC by relying on separate success probabilities for each token label:
Confusion matrix (CM)
Dawid and Skene (1979) allows for distinct accuracies conditional on the latent states, which results in:
Sequential Confusion Matrix (SEQ)
extends the CM model of Simpson and Gurevych (2019), where an ”auto-regressive” component is included in the observed part of the model. We assume dependence on a covariate indicating that the label has not changed for a given source, i.e.:
Dependent confusion matrix (DCM)
|Majority voting (MV)||0.815||0.675||0.738||2.047||0.751||0.619||0.678|
|Confusion Matrix (CM)||0.786||0.746||0.766||1.964||0.713||0.700||0.706|
|Sequential Confusion Matrix (SEQ)||0.736||0.716||0.726||2.254||0.642||0.668||0.654|
|Dependent Confusion Matrix (DCM)||0.785||0.744||0.764||1.983||0.710||0.698||0.704|
|HMM-aggregated labels (only NER models)||0.658||0.720||0.688||2.653||0.642||0.599||0.620|
|HMM-aggregated labels (only gazetteers)||0.759||0.394||0.518||3.678||0.687||0.367||0.478|
|HMM-aggregated labels (only heuristics)||0.722||0.771||0.746||1.989||0.718||0.683||0.700|
|HMM-aggregated labels (all but doc-level)||0.714||0.778||0.744||1.878||0.713||0.693||0.702|
|HMM-aggregated labels (all functions)||0.719||0.794||0.754||1.812||0.721||0.713||0.716|
|Neural net trained on HMM-agg. labels||0.712||0.790||0.748||2.282||0.715||0.707||0.710|
|Majority voting (MV)||0.832||0.713||0.768||2.454||0.699||0.644||0.670|
|Confusion Matrix (CM)||0.816||0.702||0.754||2.708||0.667||0.636||0.652|
|Sequential Confusion Matrix (SEQ)||0.741||0.630||0.682||3.261||0.535||0.547||0.540|
|Dependent Confusion Matrix (DCM)||0.819||0.706||0.758||2.702||0.673||0.641||0.656|
|HMM-aggregated labels (all functions)||0.804||0.823||0.814||2.219||0.749||0.697||0.722|
|Neural net trained on HMM-agg. labels||0.805||0.827||0.816||2.448||0.749||0.701||0.724|
The evaluation results are shown in Tables 1 and 2, respectively for the CoNLL 2003 data and the sentences extracted from Reuters and Bloomberg. The metrics are the (micro-averaged) precision, recall and scores at both the token-level and entity-level. In addition, we indicate the token-level cross-entropy error (in log-scale). As the labelling functions are defined on a richer annotation scheme than the four labels of ConLL 2003, we map GPE to LOC and EVENT, FAC, LANGUAGE, LAW, NORP, PRODUCT and WORK_OF_ART to MISC.
The results for the ACC and CV baselines are not included in the two tables as the parameter estimation did not converge (and thus did not provide reliable estimates of the parameters).
Table 1 further details the results obtained using only a subset of labelling functions. Of particular interest is the positive contribution of document-level functions, boosting the entity-level from 0.702 to 0.716. This highlights the importance of document-level relations in NER.
The last line of the two tables reports the performance of the neural sequence labelling model (described in Section 3.3) trained on the basis of the aggregated labels. We observe that the performance of this neural model remains close to the performance of the HMM-aggregated labels. This result shows that the knowledge from the labelling functions can be injected into a standard neural model without substantial loss.
Although not shown in the results due to space constraints, we also analysed whether the informative priors described in Section 3.2 influenced the performance of the aggregation model. We found informative and non-informative priors to yield similar performance for CoNLL 2003. However, the performance of non-informative priors was very poor on the Reuters and Bloomberg sentences ( at 0.12), thereby demonstrating the usefulness of informative priors for small datasets.
We provide in Figure 3 an example with a few selected labelling functions. In particular, we can observe that the Ontonotes-trained NER model mistakenly labels ”Heidrun” as a product. This erroneous label, however, is counter-balanced by other labelling functions, notably a document-level function looking at the global label frequency of this string through the document. We do, however, notice a few remaining errors, e.g. the labelling of ”Status Weekly” as an organisation.
Figure 4 illustrates the pairwise agreement and disagreement between labelling functions on the CoNLL 2003 dataset. If both labelling functions make the same prediction on a given token, we count this as an agreement, whereas conflicting predictions (ignoring O labels), are seen as disagreement. Large differences may exist between these functions for specific labels, especially MISC. The functions with the highest overlap are those making predictions on all labels, while labelling functions specialised to few labels (such as legal_detector) often have less overlap. We also observe that the two gazetteers from Crunchbase and Geonames disagree in about 15% of cases, presumably due to company names that are also geographical locations, as in the earlier Komatsu example.
In terms of computational efficiency, the estimation of HMM parameters is relatively fast, requiring less than 30 mins on the entire CoNLL 2003 data. Once the aggregation model is estimated, it can be directly applied to new texts with a single forward-backward pass, and can therefore scale to datasets with hundreds of thousands of documents. This runtime performance is an important advantage compared to approaches such as AdaptaBERT Han and Eisenstein (2019) which are relatively slow at inference time. The proposed approach can also be ported to other languages than English, although heuristic functions and gazetteers will need to be adapted to the target language.
This paper presented a weak supervision model for sequence labelling tasks such as Named Entity Recognition. To leverage all possible knowledge sources available for the task, the approach uses a broad spectrum of labelling functions, including data-driven NER models, gazetteers, heuristic functions, and document-level relations between entities. Labelling functions may be specialised to recognise specific labels while ignoring others. Furthermore, unlike previous weak supervision approaches, labelling functions may produce probabilistic predictions. The outputs of these labelling functions are then merged together using a hidden Markov model whose parameters are estimated with the Baum-Welch algorithm. A neural sequence labelling model can finally be learned on the basis of these unified predictions.
Evaluation results on two datasets (CoNLL 2003 and news articles from Reuters and Bloomberg) show that the method can boost NER performance by about 7 percentage points on entity-level . In particular, the proposed model outperforms the unsupervised domain adaptation approach through contextualised embeddings of han-eisenstein-2019-unsupervised. Of specific linguistic interest is the contribution of document-level labelling functions, which take advantage of the internal coherence and narrative structure of the texts.
Future work will investigate how to take into account potential correlations between labelling functions in the aggregation model, as done in e.g. Bach et al. (2017). Furthermore, some of the labelling functions can be rather noisy and model selection of the optimal subset of the labelling functions might well improve the performance of our model. Model selection approaches that can be adapted are discussed in Adams and Beling (2019); Hubin (2019). We also wish to evaluate the approach on other types of sequence labelling tasks beyond Named Entity Recognition.
The research presented in this paper was conducted as part of the innovation project ”FinAI: Artificial Intelligence tool to monitor global financial markets” in collaboration with Exabel AS444www.exabel.com. This collaboration is supported through the funding programme for ”User-driven Research based Innovation” of the Research Council of Norway.
Additionally, this work is supported by the SANT project (Sentiment Analysis for Norwegian Text), funded by the Research Council of Norway (grant number 270908)
- Adams and Beling (2019) Stephen Adams and Peter A Beling. 2019. Artificial Intelligence Review, 52(3):1739–1779.
- Akbik et al. (2019) Alan Akbik, Tanja Bergmann, and Roland Vollgraf. 2019. Pooled contextualized embeddings for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 724–728, Minneapolis, Minnesota. Association for Computational Linguistics.
Bach et al. (2017)
Stephen H. Bach, Bryan He, Alexander Ratner, and Christopher Ré. 2017.
the structure of generative models without labeled data.
Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 273–282. JMLR.org.
- Barnes et al. (2018) Jeremy Barnes, Roman Klinger, and Sabine Schulte im Walde. 2018. Projecting embeddings for domain adaption: Joint modeling of sentiment analysis in diverse domains. In Proceedings of the 27th International Conference on Computational Linguistics, pages 818–830, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Blitzer et al. (2007) John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 440–447, Prague, Czech Republic. Association for Computational Linguistics.
Blitzer et al. (2006)
John Blitzer, Ryan McDonald, and Fernando Pereira. 2006.
with structural correspondence learning.
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 120–128, Sydney, Australia. Association for Computational Linguistics.
- Braun et al. (2017) Daniel Braun, Adrian Hernandez Mendez, Florian Matthes, and Manfred Langen. 2017. Evaluating natural language understanding services for conversational question answering systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 174–185, Saarbrücken, Germany. Association for Computational Linguistics.
- Chen et al. (2012) Minmin Chen, Zhixiang Xu, Kilian Q. Weinberger, and Fei Sha. 2012. Marginalized denoising autoencoders for domain adaptation. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pages 1627–1634, USA. Omnipress.
- Chiu and Nichols (2016) Jason P.C. Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4:357–370.
- Daumé III (2007) Hal Daumé III. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 256–263, Prague, Czech Republic. Association for Computational Linguistics.
- Dawid and Skene (1979) A. P. Dawid and A. M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 28(1):20–28.
- Derczynski et al. (2016) Leon Derczynski, Kalina Bontcheva, and Ian Roberts. 2016. Broad twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1169–1179, Osaka, Japan. The COLING 2016 Organizing Committee.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Ding et al. (2014) Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. 2014. Using structured events to predict stock price movement: An empirical investigation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1415–1425, Doha, Qatar. Association for Computational Linguistics.
- Fries et al. (2017) Jason Fries, Sen Wu, Alex Ratner, and Christopher Ré. 2017. Swellshark: A generative model for biomedical named entity recognition without labeled data.
- Geiß et al. (2018) Johanna Geiß, Andreas Spitz, and Michael Gertz. 2018. Neckar: A named entity classifier for wikidata. In Language Technologies for the Challenges of the Digital Age, pages 115–129, Cham. Springer International Publishing.
- Ghazvininejad et al. (2018) Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Scott Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. In AAAI.
- Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 513–520, USA. Omnipress.
- Grosz et al. (1995) Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. 1995. Centering: A framework for modeling the local coherence of discourse. Computational Linguistics, 21(2):203–225.
- Grosz and Sidner (1986) Barbara J. Grosz and Candace L. Sidner. 1986. Attention, intentions, and the structure of discourse. Computational Linguistics, 12(3):175–204.
- Guo et al. (2009) Honglei Guo, Huijia Zhu, Zhili Guo, Xiaoxun Zhang, Xian Wu, and Zhong Su. 2009. Domain adaptation with latent semantic association for named entity recognition. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 281–289, Boulder, Colorado. Association for Computational Linguistics.
- Han and Eisenstein (2019) Xiaochuang Han and Jacob Eisenstein. 2019. Unsupervised domain adaptation of contextualized embeddings for sequence labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4237–4247, Hong Kong, China. Association for Computational Linguistics.
Honnibal and Montani (2017)
Matthew Honnibal and Ines Montani. 2017.
spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing.To appear.
- Hovy et al. (2013) Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. 2013. Learning whom to trust with MACE. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1120–1130, Atlanta, Georgia. Association for Computational Linguistics.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
- Hu et al. (2016) Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. 2016. Harnessing deep neural networks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2410–2420, Berlin, Germany. Association for Computational Linguistics.
- Hubin (2019) Aliaksandr Hubin. 2019. An adaptive simulated annealing EM algorithm for inference on non-homogeneous hidden Markov models. In Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing, pages 1–9.
- Kim and Ghahramani (2012) Hyun-Chul Kim and Zoubin Ghahramani. 2012. Bayesian classifier combination. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 619–627, La Palma, Canary Islands. PMLR.
- Krishnan and Manning (2006) Vijay Krishnan and Christopher D. Manning. 2006. An effective two-stage model for exploiting non-local dependencies in named entity recognition. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 1121–1128, Sydney, Australia. Association for Computational Linguistics.
- Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260–270, San Diego, California. Association for Computational Linguistics.
- Lehmann et al. (2015) Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. Dbpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195.
- Lin and Lu (2018) Bill Yuchen Lin and Wei Lu. 2018. Neural adaptation layers for cross-domain named entity recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2012–2022, Brussels, Belgium. Association for Computational Linguistics.
- Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011, Suntec, Singapore. Association for Computational Linguistics.
- Mollá et al. (2006) Diego Mollá, Menno van Zaanen, and Daniel Smith. 2006. Named entity recognition for question answering. In Proceedings of the Australasian Language Technology Workshop 2006, pages 51–58, Sydney, Australia.
- Nguyen et al. (2017a) An T Nguyen, Byron C Wallace, Junyi Jessy Li, Ani Nenkova, and Matthew Lease. 2017a. Aggregating and predicting sequence labels from crowd annotations. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2017, page 299. NIH Public Access.
- Nguyen et al. (2017b) An Thanh Nguyen, Byron Wallace, Junyi Jessy Li, Ani Nenkova, and Matthew Lease. 2017b. Aggregating and predicting sequence labels from crowd annotations. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 299–309, Vancouver, Canada. Association for Computational Linguistics.
- Peng and Dredze (2017) Nanyun Peng and Mark Dredze. 2017. Multi-task domain adaptation for sequence tagging. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 91–100, Vancouver, Canada. Association for Computational Linguistics.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- Rabiner (1990) Lawrence R. Rabiner. 1990. In Alex Waibel and Kai-Fu Lee, editors, Readings in Speech Recognition, chapter A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, pages 267–296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. [link].
- Raiman and Raiman (2018) Jonathan Raiman and Olivier Raiman. 2018. Deeptype: Multilingual entity linking by neural type system evolution. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5406–5413.
- Ratinov and Roth (2009) Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 147–155, Boulder, Colorado. Association for Computational Linguistics.
- Ratner et al. (2017) Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. Proc. VLDB Endow., 11(3):269–282.
- Ratner et al. (2019) Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2019. Snorkel: rapid training data creation with weak supervision. The VLDB Journal.
- Ritter et al. (2013) Alan Ritter, Luke Zettlemoyer, Mausam, and Oren Etzioni. 2013. Modeling missing data in distant supervision for information extraction. Transactions of the Association for Computational Linguistics, 1:367–378.
- Rodrigues et al. (2014) Filipe Rodrigues, Francisco Pereira, and Bernardete Ribeiro. 2014. Sequence labeling with multiple annotators. Mach. Learn., 95(2):165–181.
- Rodriguez et al. (2018) Juan Diego Rodriguez, Adam Caldwell, and Alexander Liu. 2018. Transfer learning for entity recognition of novel classes. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1974–1985, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Safranchik et al. (2020) Esteban Safranchik, Shiying Luo, and Stephen H. Bach. 2020. Weakly supervised sequence tagging from noisy rules. In AAAI Conference on Artificial Intelligence (AAAI).
- Sagi and Rokach (2018) Omer Sagi and Lior Rokach. 2018. Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery, 8(4):e1249.
- Salinas Alvarado et al. (2015) Julio Cesar Salinas Alvarado, Karin Verspoor, and Timothy Baldwin. 2015. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pages 84–90, Parramatta, Australia.
- Shang et al. (2018) Jingbo Shang, Liyuan Liu, Xiaotao Gu, Xiang Ren, Teng Ren, and Jiawei Han. 2018. Learning named entity tagger using domain-specific dictionary. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2054–2064, Brussels, Belgium. Association for Computational Linguistics.
- Simpson and Gurevych (2019) Edwin D. Simpson and Iryna Gurevych. 2019. A Bayesian approach for sequence tagging with crowds. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1093–1104, Hong Kong, China. Association for Computational Linguistics.
Strauss et al. (2016)
Benjamin Strauss, Bethany Toma, Alan Ritter, Marie-Catherine de Marneffe, and
Wei Xu. 2016.
Results of the
WNUT16 named entity recognition shared task.
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), pages 138–144, Osaka, Japan. The COLING 2016 Organizing Committee.
- Stubbs et al. (2015) Amber Stubbs, Christopher Kotfila, and Özlem Uzuner. 2015. Automated systems for the de-identification of longitudinal clinical narratives. Journal of Biomedical Informatics, 58(S):S11–S19.
- Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
- Ugawa et al. (2018) Arata Ugawa, Akihiro Tamura, Takashi Ninomiya, Hiroya Takamura, and Manabu Okumura. 2018. Neural machine translation incorporating named entity. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3240–3250, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Wang and Poon (2018) Hai Wang and Hoifung Poon. 2018. Deep probabilistic logic: A unifying framework for indirect supervision. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1891–1902, Brussels, Belgium. Association for Computational Linguistics.
- Wang et al. (2018) Limin Wang, Shoushan Li, Qian Yan, and Guodong Zhou. 2018. Domain-specific named entity recognition with document-level optimization. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 17(4):33:1–33:15.
- Weischedel et al. (2011) R. Weischedel, E. Hovy, M. Marcus, Palmer M., R. Belvin, S. Pradhan, L. Ramshaw, and N. Xue. 2011. OntoNotes: A large training corpus for enhanced processing. In Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation. Springer.
- Wick (2015) Marc Wick. 2015. Geonames ontology.
- Yadav and Bethard (2018) Vikas Yadav and Steven Bethard. 2018. A survey on recent advances in named entity recognition from deep learning models. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2145–2158, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Yan et al. (2019) Hang Yan, Bocao Deng, Xiaonan Li, and Xipeng Qiu. 2019. Tener: Adapting transformer encoder for name entity recognition. arXiv preprint arXiv:1911.04474.
- Yang et al. (2017) Zhilin Yang, Ruslan Salakhutdinov, and William W. Cohen. 2017. Transfer learning for sequence tagging with hierarchical recurrent networks. In International Conference on Learning Representations.
- Yu and Jiang (2016) Jianfei Yu and Jing Jiang. 2016. Learning sentence embeddings with auxiliary tasks for cross-domain sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 236–246, Austin, Texas. Association for Computational Linguistics.
- Zhou et al. (2019) Joey Tianyi Zhou, Hao Zhang, Di Jin, Hongyuan Zhu, Meng Fang, Rick Siow Mong Goh, and Kenneth Kwok. 2019. Dual adversarial neural transfer for low-resource named entity recognition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3461–3471, Florence, Italy. Association for Computational Linguistics.
- Ziser and Reichart (2017) Yftah Ziser and Roi Reichart. 2017. Neural structural correspondence learning for domain adaptation. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 400–410, Vancouver, Canada. Association for Computational Linguistics.
Appendix A Labelling functions
Appendix B Label matching problem
The baseline models relying on mixtures of multinomials have to address the so-called label matching problem, which needs some extra care.
The following approach was employed in the experiments from Section 4:
First, we put strong initial values to the probabilities of individual classes based on the frequency of appearance of these classes in the most reliable labelling function. This is expected to increase the probability of EM exploring the mode around the initialised values.
Second, we perform post-processing and set the labels to the states corresponding to the labels with the highest pairwise correlations to the latent labels from one of the three options:
the most reliable labelling function (Ontonotes-trained NER);
the majority voting labelling function;
the suggested Dirichlet dependent mixture model.
Additionally, if this highest correlation is below the threshold of the O label is assigned to the corresponding state. We empirically observed that the label matching technique that performed best was to map the states to the labels produced by the majority voter (based on the pairwise correlations).
Appendix C Detailed results
In Table 4, we provide the detailed results distributed by NER label for the CoNLL data 2003 which were presented in micro-averaged form in Table 1 of the main paper.
|LOC||30.3 %||Ontonotes-trained NER||0.767||0.812||0.788||0.764||0.800||0.782|
|Majority voting (MV)||0.740||0.839||0.786||0.739||0.828||0.780|
|Sequential Confusion Matrix||0.681||0.856||0.758||0.664||0.848||0.744|
|Dependent Confusion Matrix||0.718||0.890||0.794||0.710||0.886||0.788|
|HMM (only NER models)||0.601||0.825||0.696||0.650||0.733||0.690|
|HMM (only gazetteers)||0.707||0.632||0.668||0.694||0.630||0.660|
|HMM (all but doc-level)||0.701||0.862||0.774||0.724||0.838||0.776|
|HMM (all functions)||0.726||0.859||0.786||0.738||0.839||0.786|
|NN trained on HMM||0.736||0.851||0.790||0.734||0.850||0.788|
|PER||28.7 %||Ontonotes-trained NER||0.850||0.833||0.842||0.787||0.741||0.764|
|Majority voting (MV)||0.915||0.871||0.892||0.831||0.775||0.802|
|Sequential Confusion Matrix||0.849||0.879||0.864||0.730||0.789||0.758|
|Dependent Confusion Matrix||0.892||0.920||0.906||0.806||0.834||0.820|
|HMM (only NER models)||0.837||0.860||0.848||0.770||0.744||0.756|
|HMM (only gazetteers)||0.917||0.452||0.606||0.835||0.391||0.532|
|HMM (all but doc-level)||0.859||0.917||0.888||0.814||0.782||0.798|
|HMM (all functions)||0.857||0.947||0.900||0.820||0.826||0.822|
|NN trained on HMM||0.856||0.946||0.898||0.814||0.824||0.818|
|ORG||26.6 %||Ontonotes-trained NER||0.536||0.517||0.526||0.437||0.306||0.360|
|Majority voting (MV)||0.725||0.512||0.600||0.610||0.434||0.508|
|Sequential Confusion Matrix||0.632||0.590||0.610||0.485||0.515||0.500|
|Dependent Confusion Matrix||0.696||0.613||0.652||0.567||0.536||0.552|
|HMM (only NER models)||0.516||0.549||0.532||0.425||0.333||0.374|
|HMM (only gazetteers)||0.648||0.304||0.414||0.512||0.235||0.322|
|HMM (all but doc-level)||0.565||0.631||0.596||0.551||0.494||0.520|
|HMM (all functions)||0.542||0.665||0.598||0.545||0.527||0.536|
|NN trained on HMM||0.539||0.665||0.596||0.537||0.519||0.528|
|MISC||14.4 %||Ontonotes-trained NER||0.676||0.599||0.636||0.702||0.583||0.636|
|Majority voting (MV)||0.861||0.187||0.308||0.809||0.193||0.312|
|Sequential Confusion Matrix||0.850||0.320||0.464||0.791||0.333||0.468|
|Dependent Confusion Matrix||0.893||0.318||0.468||0.844||0.330||0.474|
|HMM (only NER models)||0.667||0.544||0.600||0.708||0.518||0.598|
|HMM (only gazetteers)||0.745||0.011||0.022||0.594||0.008||0.016|
|HMM (all but doc-level)||0.714||0.596||0.650||0.781||0.575||0.662|
|HMM (all functions)||0.814||0.571||0.672||0.830||0.565||0.672|
|NN trained on HMM||0.852||0.577||0.688||0.866||0.583||0.696|