Skyline Operators for Document Spanners

04/12/2023
by   Antoine Amarilli, et al.
0

When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples which extend it by assigning additional attributes, or assigning larger intervals. The result of filtering the relation would then be the skyline according to this partial order. As this filtering may remove most of the extracted tuples, we study whether we can improve the performance of the extraction by compiling the domination rule into the extractor. To this aim, we introduce the skyline operator for declarative information extraction tasks expressed as document spanners. We show that this operator can be expressed via regular operations when the domination partial order can itself be expressed as a regular spanner, which covers several natural domination rules. Yet, we show that the skyline operator incurs a computational cost (under combined complexity). First, there are cases where the operator requires an exponential blowup on the number of states needed to represent the spanner as a sequential variable-set automaton. Second, the evaluation may become computationally hard. Our analysis more precisely identifies classes of domination rules for which the combined complexity is tractable or intractable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/14/2018

Constant delay algorithms for regular document spanners

Regular expressions and automata models with capture variables are core ...
research
08/30/2019

Weight Annotation in Information Extraction

The framework of document spanners abstracts the task of information ext...
research
12/21/2017

Recursive Programs for Document Spanners

A document spanner models a program for Information Extraction (IE) as a...
research
06/30/2023

Multi-votes Election Control by Selecting Rules

We study the election control problem with multi-votes, where each voter...
research
11/09/2021

Learning Logic Rules for Document-level Relation Extraction

Document-level relation extraction aims to identify relations between en...
research
02/21/2017

Systèmes du LIA à DEFT'13

The 2013 Défi de Fouille de Textes (DEFT) campaign is interested in two ...
research
02/20/2020

The Complexity of Aggregates over Extractions by Regular Expressions

Regular expressions with capture variables, also known as "regex formula...

Please sign up or login with your details

Forgot password? Click here to reset