An Operator for Entity Extraction in MapReduce

12/15/2015
by   Ndapandula Nakashole, et al.
0

Dictionary-based entity extraction involves finding mentions of dictionary entities in text. Text mentions are often noisy, containing spurious or missing words. Efficient algorithms for detecting approximate entity mentions follow one of two general techniques. The first approach is to build an index on the entities and perform index lookups of document substrings. The second approach recognizes that the number of substrings generated from documents can explode to large numbers, to get around this, they use a filter to prune many such substrings which do not match any dictionary entity and then only verify the remaining substrings if they are entity mentions of dictionary entities, by means of a text join. The choice between the index-based approach and the filter & verification-based approach is a case-to-case decision as the best approach depends on the characteristics of the input entity dictionary, for example frequency of entity mentions. Choosing the right approach for the setting can make a substantial difference in execution time. Making this choice is however non-trivial as there are parameters within each of the approaches that make the space of possible approaches very large. In this paper, we present a cost-based operator for making the choice among execution plans for entity extraction. Since we need to deal with large dictionaries and even larger large datasets, our operator is developed for implementations of MapReduce distributed algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2019

Entity Extraction with Knowledge from Web Scale Corpora

Entity extraction is an important task in text mining and natural langua...
research
01/14/2022

The Lokahi Prototype: Toward the automatic Extraction of Entity Relationship Models from Text

Entity relationship extraction envisions the automatic generation of sem...
research
05/03/2018

Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling

This paper presents a Kernel Entity Salience Model (KESM) that improves ...
research
08/01/2017

A Lightweight Front-end Tool for Interactive Entity Population

Entity population, a task of collecting entities that belong to a partic...
research
11/24/2022

Detecting Entities in the Astrophysics Literature: A Comparison of Word-based and Span-based Entity Recognition Methods

Information Extraction from scientific literature can be challenging due...
research
02/28/2022

GausSetExpander: A Simple Approach for Entity Set Expansion

Entity Set Expansion is an important NLP task that aims at expanding a s...
research
06/05/2022

Story Beyond the Eye: Glyph Positions Break PDF Text Redaction

In the past redaction involved the use of black or white markers or pape...

Please sign up or login with your details

Forgot password? Click here to reset