Lightly-supervised Representation Learning with Global Interpretability

05/29/2018 ∙ by Marco A. Valenzuela-Escarcega, et al. ∙ The University of Arizona 0

We propose a lightly-supervised approach for information extraction, in particular named entity classification, which combines the benefits of traditional bootstrapping, i.e., use of limited annotations and interpretability of extraction patterns, with the robust learning approaches proposed in representation learning. Our algorithm iteratively learns custom embeddings for both the multi-word entities to be extracted and the patterns that match them from a few example entities per category. We demonstrate that this representation-based approach outperforms three other state-of-the-art bootstrapping approaches on two datasets: CoNLL-2003 and OntoNotes. Additionally, using these embeddings, our approach outputs a globally-interpretable model consisting of a decision list, by ranking patterns based on their proximity to the average entity embedding in a given class. We show that this interpretable model performs close to our complete bootstrapping model, proving that representation learning can be used to produce interpretable models with small loss in performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One strategy for mitigating the cost of supervised learning in information extraction (IE) is to bootstrap extractors with light supervision from a few provided examples (or seeds). Traditionally, bootstrapping approaches iterate between learning extraction patterns such as word

-grams, e.g., the pattern “@ENTITY , former president” could be used to extract person names,111In this work we use surface patterns, but the proposed algorithm is agnostic to the types of patterns learned. and applying these patterns to extract the desired structures (entities, relations, etc.) (Carlson et al., 2010; Gupta and Manning, 2014, 2015, inter alia)

. One advantage of this direction is that these patterns are interpretable, which mitigates the maintenance cost associated with machine learning systems 

Sculley et al. (2014).

On the other hand, representation learning has proven to be useful for natural language processing (NLP) applications 

(Mikolov et al., 2013; Riedel et al., 2013; Toutanova et al., 2015, 2016, inter alia)

. Representation learning approaches often include a component that is trained in an unsupervised manner, e.g., predicting words based on their context from large amounts of data, mitigating the brittle statistics affecting traditional bootstrapping approaches. However, the resulting real-valued embedding vectors are hard to interpret.

Here we argue that these two directions are complementary, and should be combined. We propose such a bootstrapping approach for information extraction (IE), which blends the advantages of both directions. As a use case, we instantiate our idea for named entity classification (NEC), i.e., classifying a given set of unknown entities into a predefined set of categories 

Collins and Singer (1999). The contributions of this work are:

(1)

We propose an approach for bootstrapping NEC that iteratively learns custom embeddings for both

the multi-word entities to be extracted and the patterns that match them from a few example entities per category. Our approach changes the objective function of a neural network language models (NNLM) to include a semi-supervised component that models the known examples, i.e., by

attracting entities and patterns in the same category to each other and repelling them from elements in different categories, and it adds an external iterative process that “cautiously” augments the pools of known examples Collins and Singer (1999).

(2)

We demonstrate that our representation learning approach is suitable for semi-supervised NEC. We compare our approach against several state-of-the-art semi-supervised approaches on two datasets: CoNLL-2003 Tjong Kim Sang and De Meulder (2003) and OntoNotes Pradhan et al. (2013). We show that, despite its simplicity, our method outperforms all other approaches.

(3)

Our approach also outputs an interpretation of the learned model, consisting of a decision list of patterns, where each pattern gets a score per class based on the proximity of its embedding to the average entity embedding in the given class. This interpretation is global, i.e., it explains the entire model rather than local predictions. We show that this decision-list model performs comparably to the complete model on the two datasets. This guarantees that the resulting system can be understood, debugged, and maintained by non-machine learning experts. Further, this model outperforms considerably an interpretable model that uses pretrained embeddings, demonstrating that our custom embeddings help interpretability.

2 Related Work

Bootstrapping is an iterative process that alternates between learning representative patterns, and acquiring new entities (or relations) belonging to a given category Riloff (1996); McIntosh (2010). Patterns and extractions are ranked using either formulas that measure their frequency and association with a category, or classifiers, which increases robustness due to regularization Carlson et al. (2010); Gupta and Manning (2015).

Distributed representations of words Mikolov et al. (2013); Levy and Goldberg (2014a) serve as underlying representation for many NLP tasks such as information extraction and question answering Riedel et al. (2013); Toutanova et al. (2015, 2016); Sharp et al. (2016). However, most of these works that customize embeddings for a specific task rely on some form of supervision. In contrast, our approach is lightly supervised, with a only few seed examples per category. Batista et al. (2015) perform bootstrapping for relation extraction using pre-trained word embeddings. They do not learn custom pattern embeddings that apply to multi-word entities and patterns. We show that customizing embeddings for the learned patterns is important for interpretability.

Recent work has focused on explanations of machine learning models that are model-agnostic but local, i.e., they interpret individual model predictions Ribeiro et al. (2018, 2016a). In contrast, our work produces a global interpretation, which explains the entire extraction model rather than individual decisions.

Lastly, our work addresses the interpretability aspect of information extraction methods. Interpretable models mitigate the technical debt of machine learning Sculley et al. (2014). For example, it allows domain experts to make manual, gradual improvements to the models. This is why rule-based approaches are commonly used in industry applications, where software maintenance is crucial Chiticariu et al. (2013). Furthermore, the need for interpretability also arises in critical systems, e.g., recommending treatment to patients, where these systems are deployed to aid human decision makers Lakkaraju and Rudin (2016). The benefits of interpretability have encouraged efforts to either extract interpretable models from opaque ones Craven and Shavlik (1996), or to explain their decisions Ribeiro et al. (2016b).

As machine learning models are becoming more complex, the focus on interpretability has become more important, with new funding programs focused on this topic.222 DARPA’s Explainable AI program: http://www.darpa.mil/program/explainable-artificial-intelligence. Our approach for exporting an interpretable model (§3) is similar to Valenzuela-Escárcega et al. (2016)

, but we start from distributed representations, whereas they started from a logistic regression model with explicit features.

3 Approach

Bootstrapping with representation learning

Our algorithm iteratively grows a pool of multi-word entities () and -gram patterns () for each category of interest , and learns custom embeddings for both, which we will show are crucial for both performance and interpretability.

The entity pools are initialized with a few seed examples () for each category. For example, in our experiments we initialize the pool for a person names category with 10 names such as Mother Teresa. Then the algorithm iteratively applies the following three steps for epochs:

(1) Learning custom embeddings

: The algorithm learns custom embeddings for all entities and patterns in the dataset, using the current s as supervision. This is a key contribution, and is detailed in the second part of this section.

(2) Pattern promotion

: We generate the patterns that match the entities in each pool , rank those patterns using point-wise mutual information (PMI) with the corresponding category, and select the top ranked patterns for promotion to the corresponding pattern pool .

(3) Entity promotion

: Entities are promoted to

 using a multi-class classifier that estimates the likelihood of an entity belonging to each class 

Gupta and Manning (2015). Our feature set includes, for each category : (a) edit distance between the candidate entity and current s , (b) the PMI (with ) of the patterns in  that matched in the training documents, and (c) similarity between and

s in a semantic space. For the latter feature group, we use two sets of vector representations for entities. The first is the set of embedding vectors learned in step (1). The second includes pre-trained word embeddings; for multi-word entities and patterns, we simply average the embeddings of the component words. We use these vectors to compute the cosine similarity score of a given candidate entity

to the entities in , and add the average and maximum similarities as features. The top 10 entities classified with the highest confidence for each class are promoted to the corresponding  after each epoch.

Learning custom embeddings

We train our embeddings for both entities and patterns by maximizing the objective function :

(1)

where SG, Attract, and Repel are individual components of the objective function designed to model both the unsupervised, language model part of the task as well as the light supervision coming from the seed examples, as detailed below.

The SG term captures the original objective function of the Skip-Gram model of  Mikolov et al. (2013), but, crucially, adapted to operate over multi-word entities and contexts consisting not of bags of context words, but of the patterns that match each entity:

(2)

where represents an entity, represents a positive pattern, i.e., a pattern that matches entity in the training texts, represents a negative pattern, i.e., it has not been seen with this entity, and

is the sigmoid function. Intuitively, this component forces the embeddings of entities to be similar to the embeddings of the patterns that match them, and dissimilar to the negative patterns.

The second component, Attract, encourages entities or patterns in the same pool to be close to each other. For example, if we have two entities in the pool known to be person names, they should be close to each other in the embedding space:

(3)

where is the entity/pattern pool for a category, and are entities/patterns in said pool.

Lastly, the third term, Repel, encourages that the pools be mutually exclusive, which is a soft version of the counter training approach of Yangarber (2003) or the weighted mutual-exclusive bootstrapping algorithm of McIntosh and Curran (2008). For example, person names should be far from organization names in the semantic embedding space:

(4)

where are different pools, and and are entities/patterns in , and , respectively.

We term the complete algorithm that learns and uses custom embeddings as Emboot (Embeddings for bootstrapping) and the stripped-down version without them as EPB (Explicit Pattern-based Bootstrapping).333EPB is similar to Gupta and Manning (2015); the main difference is that we use pretrained embeddings in the entity promotion classifier rather than Brown clusters.

Interpretable model

In addition of its output (s), Emboot produces custom entity and pattern embeddings that can be used to construct a a decision-list model, which provides a global, deterministic interpretation of what Emboot learned.

This interpretable model is constructed as follows. First, we produce an average embedding per category by averaging the embeddings of the entities in each

. Second, we estimate the cosine similarity between each of the pattern embeddings and these category embeddings, and convert them to a probability distribution using a softmax function;

is the resulting probability of pattern for class . Third, each candidate entity to be classified, , receives a score for a given class from all patterns in  that match it. The entity score aggregates the relevant pattern probabilities using Noisy-Or:

(5)

Each entity is then assigned to the category with the highest overall score.

4 Experiments

(a) Embeddings initialized randomly
(b) Bootstrapping epoch 5
(c) Bootstrapping epoch 10
Figure 1: t-SNE visualizations of the entity embeddings at three stages during training.
Legend: = LOC. = ORG. = PER. = MISC.
Figure 2: Overall results on the CoNLL and Ontonotes datasets. Throughput is the number of entities classified, and precision is the proportion of entities that were classified correctly. Please see Sec. 4 for a description of the systems listed in the legend.

We evaluate the above algorithms on the task of named entity classification from free text.

Datasets:

We used two datasets, the CoNLL-2003 shared task dataset Tjong Kim Sang and De Meulder (2003), which contains 4 entity types, and the Ontonotes dataset Pradhan et al. (2013), which contains 11.444We excluded numerical categories such as DATE. These datasets contain marked entity boundaries with labels for each marked entity. Here we only use the entity boundaries but not the labels of these entities during the training of our bootstrapping systems. To simulate learning from large texts, we tuned hyper parameters on development, but ran the actual experiments on the train partitions.

Baselines:

In addition to the EPB algorithm, we compare against the approach proposed by Gupta and Manning (2014)555https://nlp.stanford.edu/software/patternslearning.shtml. This algorithm is a simpler version of the EPB system, where entities are promoted with a PMI-based formula rather than an entity classifier.666We did not run this system on Ontonotes dataset as it uses a builtin NE classifier with a predefined set of labels which did not match the Ontonotes labels. Further, we compare against label propagation (LP) Zhu and Ghahramani (2002), with the implementation available in the scikit-learn package.777http://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html In each bootstrapping epoch, we run LP, select the entities with the lowest entropy, and add them to their top category. Each entity is represented by a feature vector that contains the co-occurrence counts of the entity and each of the patterns that matches it in text.888We experimented with other feature values, e.g., pattern PMI scores, but all performed worse than raw counts.

Settings:

For all baselines and proposed models, we used the same set of 10 seeds/category, which were manually chosen from the most frequent entities in the dataset. We used dependency-based word embeddings (Levy and Goldberg, 2014b) of size 300 as the predefined embedding vectors in the entity promotion classifier. For the custom embedding features, we used randomly initialized 15d embeddings. Here we consider patterns to be -grams of size up to 4 tokens on either side of an entity. For instance, “@ENTITY , former President” is one of the patterns learned for the class person. We ran all algorithms for 20 bootstrapping epochs, and the embedding learning component for 100 epochs in each bootstrapping epoch. We add 10 entities and 10 patterns to each category during every bootstrapping epoch.

5 Discussion and Conclusion

Before we discuss overall results, we provide a qualitative analysis of the learning process for Emboot for the CoNLL dataset in Figure 1. The figure shows t-SNE visualizations van der Maaten and Hinton (2008) of the entity embeddings at several stages of the algorithm. This visualization matches our intuition: as training advances, entities belonging to the same category are indeed grouped together. In particular, Figure 0(c) shows five clusters, four of which are dominated by one category (and centered around the corresponding seeds), and one, in the upper left corner, with the entities that haven’t yet been added to any of the pools.

A quantitative comparison of the different models on the two datasets is shown in Figure 2.

Figure 2 shows that Emboot considerably outperforms LP and Gupta and Manning (2014), and has a small but consistent improvement over EPB. This demonstrates the value of our approach, and the importance of custom embeddings.

Importantly, we compare Emboot against: (a) its interpretable version (Emboot), which is constructed as a decision list containing the patterns learned (and scored) after each bootstrapping epoch, and (b) an interpretable system built similarly for EPB (EPB), using the pretrained Levy and Goldberg embeddings rather than our custom ones. This analysis shows that Emboot performs close to Emboot on both datasets, demonstrating that most of the benefits of representation learning are available in an interpretable model. Further, the large gap between Emboot and EPB indicates that the custom embeddings are critical for the interpretable model.

Note that Emboot

’s decisions are easy to interpret. Due to the sparsity of patterns, the majority of predictions are triggered by 1 or 2 patterns. For example, the entity “Syrian” is correctly classified as MISC (which includes demonyms) due to two patterns matching it in the CoNLL dataset: “

@ENTITY President” and “@ENTITY troops”. In general, for the CoNLL dataset, 59% of Emboot’s predictions are triggered by 1 or 2 patterns; 84% are generated by 5 or fewer patterns; only 1.1% of predictions are generated by 10 or more patterns.

This work demonstrates that representation learning can be successfully combined with traditional, pattern-based bootstrapping, yielding models that perform well despite the limited supervision, and that are interpretable, i.e., end users can understand why an extraction was generated.

References

  • Batista et al. (2015) David S Batista, Bruno Martins, and Mário J Silva. 2015. Semi-supervised bootstrapping of relationship extractors with distributional semantics. In In Empirical Methods in Natural Language Processing, ACL.
  • Carlson et al. (2010) Andrew Carlson, Justin Betteridge, Richard C Wang, Estevam R Hruschka Jr, and Tom M Mitchell. 2010.

    Coupled semi-supervised learning for information extraction.

    In Proceedings of the third ACM international conference on Web search and data mining. ACM, pages 101–110.
  • Chiticariu et al. (2013) Laura Chiticariu, Yunyao Li, and Frederick R Reiss. 2013. Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP. October, pages 827–832.
  • Collins and Singer (1999) Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
  • Craven and Shavlik (1996) Mark W Craven and Jude W Shavlik. 1996. Extracting tree-structured representations of trained networks. Advances in neural information processing systems pages 24–30.
  • Gupta and Manning (2014) Sonal Gupta and Christopher D Manning. 2014. Improved pattern learning for bootstrapped entity extraction. In CoNLL. pages 98–108.
  • Gupta and Manning (2015) Sonal Gupta and Christopher D. Manning. 2015. Distributed representations of words to guide bootstrapped entity classifiers. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics.
  • Lakkaraju and Rudin (2016) Himabindu Lakkaraju and Cynthia Rudin. 2016. Learning cost-effective treatment regimes using markov decision processes. CoRR abs/1610.06972. http://arxiv.org/abs/1610.06972.
  • Levy and Goldberg (2014a) Omer Levy and Yoav Goldberg. 2014a. Dependency-based word embeddings. In ACL (2). pages 302–308.
  • Levy and Goldberg (2014b) Omer Levy and Yoav Goldberg. 2014b. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Baltimore, Maryland, pages 302–308. http://www.aclweb.org/anthology/P14-2050.
  • McIntosh (2010) Tara McIntosh. 2010.

    Unsupervised discovery of negative categories in lexicon bootstrapping.

    In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pages 356–365.
  • McIntosh and Curran (2008) Tara McIntosh and James R Curran. 2008. Weighted mutual exclusion bootstrapping for domain independent lexicon and template acquisition. In Proceedings of the Australasian Language Technology Association Workshop. volume 2008.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. pages 3111–3119.
  • Pradhan et al. (2013) Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bj rkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using ontonotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, Sofia, Bulgaria, pages 143–152. http://www.aclweb.org/anthology/W13-3516.
  • Ribeiro et al. (2016a) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016a. ”why should i trust you?”: Explaining the predictions of any classifier. In Knowledge Discovery and Data Mining (KDD).
  • Ribeiro et al. (2016b) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016b. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pages 1135–1144.
  • Ribeiro et al. (2018) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-precision model-agnostic explanations. In

    AAAI Conference on Artificial Intelligence (AAAI)

    .
  • Riedel et al. (2013) Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of NAACL-HLT.
  • Riloff (1996) Ellen Riloff. 1996. Automatically generating extraction patterns from untagged text. In Proceedings of the national conference on artificial intelligence. pages 1044–1049.
  • Sculley et al. (2014) D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. 2014. Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop).
  • Sharp et al. (2016) Rebecca Sharp, Mihai Surdeanu, Peter Jansen, Peter Clark, and Michael Hammond. 2016. Creating causal embeddings for question answering with minimal supervision. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003.

    Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition.

    In Walter Daelemans and Miles Osborne, editors, Proceedings of CoNLL-2003. Edmonton, Canada, pages 142–147.
  • Toutanova et al. (2015) Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. 2015. Representing text for joint embedding of text and knowledge bases. In EMNLP. Citeseer, volume 15, pages 1499–1509.
  • Toutanova et al. (2016) Kristina Toutanova, Xi Victoria Lin, Wen-tau Yih, Hoifung Poon, and Chris Quirk. 2016. Compositional learning of embeddings for relation paths in knowledge bases and text. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. volume 1, pages 1434–1444.
  • Valenzuela-Escárcega et al. (2016) Marco A Valenzuela-Escárcega, Gus Hahn-Powell, Dane Bell, and Mihai Surdeanu. 2016. Snaptogrid: From statistical to interpretable models for biomedical information extraction. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing. pages 56–65.
  • van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. The Journal of Machine Learning Research 9(2579-2605):85.
  • Yangarber (2003) Roman Yangarber. 2003. Counter-training in discovery of semantic patterns. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Zhu and Ghahramani (2002) X. Zhu and Z. Ghahramani. 2002. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University. citeseer.ist.psu.edu/zhu02learning.html.