Entity extraction is one of the most major NLP components. Most NLP tools (e.g., NLTK, Stanford CoreNLP, etc.), including commercial services (e.g., Google Cloud API, Alchemy API, etc.), provide entity extraction functions to recognize named entities (e.g., PERSON, LOCATION, ORGANIZATION, etc.) from texts. Some studies have defined fine-grained entity types and developed extraction methods (Ling & Weld, 2012) based on these types. However, these methods cannot comprehensively cover domain-specific entities. For instance, a real estate search engine needs housing equipment names to index these terms for providing fine-grained search conditions. There is a significant demand for constructing user-specific entity dictionaries, such as the case of cuisine and ingredient names for restaurant services. A straightforward solution is to prepare a set of these entity names as a domain-specific dictionary. Therefore, this paper focuses on the entity population task, which is a task of collecting entities that belong to an entity type required by a user.
We develop LUWAK, a lightweight tool for effective interactive entity population. The key features are four-fold:
An entity table dashboard for quickly viewing and modifying a dictionary
A feedback table dashboard for supporting an effective interactive entity population
Entity highlighting on documents for quickly viewing the performance of the current entity dictionary
We think these features are key components for effective interactive entity population.
We choose an interactive user feedback strategy for entity population for LUWAK. A major approach to entity population is bootstrapping, which uses several entities that have been prepared as a seed set for finding new entities. Then, these new entities are integrated into the initial seed set to create a new seed set. The bootstrapping approach usually repeats the procedure until it has collected a sufficient number of entities. The framework cannot prevent the incorporation of incorrect entities that do not belong to the entity type unless user interaction between iterations. The problem is commonly called semantic drift (Curran et al., 2007). Therefore, we consider user interaction, in which feedback is given to expanded candidates, as essential to maintaining the quality of an entity set. LUWAK implements fundamental functions for entity population, including (a) importing an initial entity set, (b) generating entity candidates, (c) obtaining user feedback, and (d) publishing populated entity dictionary.
We aim to reduce the user’s total workload as a key metric of an entity population tool. That is, an entity population tool should provide the easiest and fastest solution to collecting entities of a particular entity type. User interaction cost is a dominant factor in the entire workload of an interactive tool. Thus, we carefully design the user interface for users to give feedbacks to the tool intuitively. Furthermore, we also consider the end-to-end user cost reduction. We adhere to the concept of developing installation-free software to distribute the tool among a wide variety of users, including nontechnical clusters. This lightweight design of LUWAK might speed up the procedure of the whole interactive entity population workflow. Furthermore, this advantage might be beneficial to continuously improve the whole pipeline of interactive entity population system.
2 LUWAK: A lightweight tool for interactive entity population
Our framework adopts the interactive entity expansion approach. This approach organizes the collaboration of a human worker and entity expansion algorithms to generate a user-specific entity dictionary efficiently. We show the basic workflow of LUWAK in Figure 1. (Step 1) LUWAK assumes that a user prepares an initial seed set manually. The seed set is shown in the Entity table. (Step 2) A user can send entities in the Entity table to an Expansion API for obtaining entity candidates. (Step 3) LUWAK shows the entity candidates in the Candidate table for user interaction. Then, the user checks accept/reject buttons to update the Entity table. After submitting the judgments, LUWAK shows the Entity table again. The user can directly add, edit, or delete entities in the table at any time. (Step 4) the user can also easily see how these entities stored in the Entity table appear in a document. (Step 5) After repeating the same procedure (Steps 2–4) for a sufficient time, the user can publish the Entity table as an output.
2.2 LUWAK Dashboard
LUWAK has a dashboard for quickly viewing an entity dictionary in progress. The dashboard consists of two tables: the Entity table and the Feedback table. The Entity table provides efficient ways to construct and modify an entity dictionary. Figure 2 shows the screenshot of the Entity table. The table shows entities in the current entity set. Each row corresponds to an entity entry. Each entry has a label, which denotes whether the predefined entity type is a positive or a negative example, an original entity, which was used to find the entity, and the score, which denotes the confidence score. A user can directly edit the table by adding, renaming, and deleting entities. Moreover, the entity inactivation function allows a user to manually inactivate entities, so that entity expansion algorithms do not use the inactivated entities. The table implements a page switching function, a search function, and a sorting function to ensure visibility even when there is a large number of entities in the table.
2.3 Entity Candidate Generation
We design the entity candidate generation module as an external API (Expansion API). The Expansion API receives a set of entities with positive labels. The Expansion API returns top- entity candidates.
As an initial implementation, we used GloVe (Pennington et al., 2014)
as word embedding models for implementing an Expansion API. This API calculates the cosine similarity between a set of positive entities and entities candidates to generate a ranked list. We prepared models trained based on the CommonCrawl corpus and the Twitter corpus111http://nlp.stanford.edu/projects/glove/. Note that the specification of the expansion algorithm is not limited to the algorithm described in this paper, as LUWAK considers the Expansion API as an external function.
Moreover, we also utilize the category-based expansion module, in which we used is-a relationship between the ontological category and each entity and expanded seeds via category-level. For example, if most of the entities already inserted in the dictionary share the same category, such as Programming Languages, the system suggests that ”Programming Language” entities should be inserted in the dictionary when we develop a job skill name dictionary. Category-based entity expansion is helpful to avoid the candidate entity one by one. We used Yago (Hoffart et al., 2013) as an existing knowledge base.
External API. In our design of LUWAK, Expansion APIs are placed as an external function outside LUWAK. There are three reasons why we adopt this design. First, we want LUWAK to remain a corpus-free tool. Users do not have to download any corpora or models to start using LUWAK, and it takes too much time to launch an Expansion API server. Second, LUWAK’s design allows external contributors to build their own expansion APIs that are compatible with LUWAK’s interface. We developed the initial version of the LUWAK package to contain an entity Expansion API so users can launch their expansion APIs internally. Third, the separation between LUWAK and the Expansion APIs enables Expansion APIs to use predetermined options for algorithms, including non-embedding-based methods (e.g., pattern-based methods). We can use more than one entity expansion model to find related entities. For instance, general embedding models, such as those built on Wikipedia, might be a good choice in early iterations, whereas more domain-specific models trained on domain-specific corpora might be helpful in later iterations. LUWAK is flexible to change and use more than one Expansion API. This design encourages us to continuously refine the entity expansion module easily.
2.4 Example: Housing Equipment Entity Population
We show an example of populating house equipment entities using LUWAK for improving a real estate search engine. The preliminary step is to prepare seed entities that belong to the real estate house equipment entity type (e.g., kitchen, bath). In this case, a user is supposed to provide several entities ( 10) as an initial set of the category. LUWAK first asks the user to upload an initial seed set. The user can add, rename, and delete entities on the Entity table as he or she wants. The user can also choose a set of entity expansion models at any time. Figure 2 shows the entity dashboard in this example.
When the user submits the current entity set by clicking the Expand Seed Set button (Figure 2), LUWAK sends a request to the external Expansion APIs that are selected to obtain expanded entities. The returned values will be stored in the Feedback table, as Figure 2 shows. The Feedback table provides a function to capture user feedback intuitively. The user can click the + or - buttons to assign positive or negative labels to the entity candidates. The score column stores the similarity score, which is calculated by the Expansion API as reference information for users. The user can also see how these entities are generated by looking at the original entities in the original column. The original entity information can be used to detect semantic drift. For instance, if the user finds the original entity of some entity candidates has negative labels, the user might consider inactivating the entity to prevent semantic drift.
In the next step, the user reflects the feedback by clicking the Submit Feedback button. Then, the user will see the entity dashboard with the newly added entities as shown in Figure 2. The user can inactivate the entity by clicking the inactivate button. The user can sort rows by column values to take a brief look at the current entity set. Also, the entity dashboard provides a search function to find an entity for action. The user can also check how entities appear in a test document. As shown in Figure 2, LUWAK highlights these entities in the current entity set. After the user is satisfied with the amount of the current entity set in the table, the Export button allows the user to download the entire table, including inactivated entities.
3 Related Work and Discussion
Entity population is one of the important practical problems in NLP. Generated entity dictionaries can be used in various applications, including search engines, named entity extraction, and entity linking. Iterative seed expansion is known to be an efficient approach to construct user-specific entity dictionaries. Previous studies have aimed to construct a high-quality entity dictionary from a small number of seed entities (Ghahramani & Heller, 2005; He & Xin, 2011; Tao et al., 2015; Rong et al., 2016). As we stated in 2.3, LUWAK is flexible with the types of algorithms used for entity population. A user can select any combinations of different methods once the Expansion API of the methods are available.
Stanford Pattern-based Information Extraction and Diagnostics (SPIED) (Gupta & Manning, 2014) is a pattern-based entity population system. SPIED requires not only an initial seed set but also document collection because it uses the pattern-based approach. After a user inputs initial seed entities, SPIED generates regular expression patterns to find entity candidates from a given document collection. This approach incurs a huge computational cost for calculating the scores of every regular expression pattern and every entity candidate in each iteration. Furthermore, SPIED adopts a bootstrapping approach, which does not involve user feedback for each iteration. This approach can easily result in semantic drift.
Interactive Knowledge Extraction (Dalvi et al., 2016) (IKE) is an interactive bootstrapping tool for collecting relation-extraction patterns. IKE also provides a search-based entity extraction function and an embedding-based entity expansion function for entity population. A user can interactively add entity candidates generated by an embedding-based algorithm to an entity dictionary. LUWAK is a more lightweight tool than IKE, which only focuses on the entity population task. LUWAK has numerous features, such as the multiple entity expansion model choices, that are not implemented in IKE. Moreover, LUWAK is a corpus-free tool that does not require a document collection for entity population. Thus, we differentiate LUWAK from IKE, considering it a more lightweight entity population tool.
- Curran et al. (2007) Curran, James R, Murphy, Tara, and Scholz, Bernhard. Minimising semantic drift with Mutual Exclusion Bootstrapping. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING ’07), pp. 172–180, 2007.
- Dalvi et al. (2016) Dalvi, Bhavana, Bhakthavatsalam, Sumithra, Clark, Chris, Clark, Peter, Etzioni, Oren, Fader, Anthony, and Dirk Groeneveld. IKE - An Interactive Tool for Knowledge Extraction. In 5th AKBC Workshop, pp. 12–17, 2016.
- Ghahramani & Heller (2005) Ghahramani, Zoubin and Heller, Katherine A. Bayesian sets. In Advances in Neural Information Processing Systems 18 (NIPS ’05), pp. 435–442, 2005.
- Gupta & Manning (2014) Gupta, Sonal and Manning, Christopher D. SPIED: Stanford Pattern-based Information Extraction and Diagnostics. In Proceedings of the ACL 2014 Workshop on Interactive Language Learning, Visualization, and Interfaces (ACL-ILLVI), pp. 38–44, 2014. ISBN 9781941643150.
- He & Xin (2011) He, Yeye and Xin, Dong. SEISA: Set Expansion by Iterative Similarity Aggregation. In Proceedings of the 20th international conference on World wide web (WWW ’11), pp. 427. ACM Press, 2011. ISBN 9781450306324.
- Hoffart et al. (2013) Hoffart, Johannes, Suchanek, Fabian M., Berberich, Klaus, and Weikum, Gerhard. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence, 194:28–61, jan 2013. ISSN 00043702. doi: 10.1016/j.artint.2012.06.001.
- Ling & Weld (2012) Ling, Xiao and Weld, DS. Fine-Grained Entity Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI ’12), pp. 94–100, 2012. ISBN 9781577355687.
Pennington et al. (2014)
Pennington, Jeffrey, Socher, Richard, and Manning, Christopher.
GloVe: Global Vectors for Word Representation.In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP ’14), pp. 1532–1543. Association for Computational Linguistics, 2014.
- Rong et al. (2016) Rong, Xin, Chen, Zhe, Mei, Qiaozhu, and Adar, Eytan. EgoSet: ExploitingWord Ego-networks and User-generated Ontology for Multifaceted Set Expansion. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM ’16), pp. 645–654, New York, New York, USA, 2016. ACM Press. ISBN 9781450337168.
- Tao et al. (2015) Tao, Fangbo, Zhao, Bo, Fuxman, Ariel, Li, Yang, and Han, Jiawei. Leveraging Pattern Semantics for Extracting Entities in Enterprises. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15), pp. 1078–1088. ACM Press, 2015. ISBN 9781450334693.