Complexity Bounds for Relational Algebra over Document Spanners

01/14/2019
by   Liat Peterfreund, et al.
0

We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e., consisting of only natural joins, projections and unions); here, the RA expression is fixed and the input consists of both the regex formulas and the document. In this work, we explore the implication of two fundamental generalizations. The first is adopting the "schemaless" semantics for spanners, as proposed and studied by Maturana et al. The second is going beyond the positive RA to allowing the difference operator. We show that each of the two generalizations introduces computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between two regex formulas under both the ordinary and schemaless semantics. Nevertheless, we propose and analyze syntactic constraints, on the RA expression and the regex formulas at hand, such that the expressive power is fully preserved and, yet, evaluation can be done with polynomial delay. Unlike the previous work on RA over regex formulas, our technique is not (and provably cannot be) based on the static compilation of regex formulas, but rather on an ad-hoc compilation into an automaton that incorporates both the query and the document. This approach also allows us to include black-box extractors in the RA expression.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/20/2020

The Complexity of Aggregates over Extractions by Regular Expressions

Regular expressions with capture variables, also known as "regex formula...
research
12/21/2017

Recursive Programs for Document Spanners

A document spanner models a program for Information Extraction (IE) as a...
research
03/15/2020

Grammars for Document Spanners

A new grammar-based language for defining information-extractors from te...
research
03/15/2020

Grammars for Document Spanenrs

A new grammar-based language for defining information-extractors from te...
research
08/30/2019

Weight Annotation in Information Extraction

The framework of document spanners abstracts the task of information ext...
research
08/08/2020

dxo: A System for Relational Algebra and Differentiation

We present dxo, a relational system for algebra and differentiation, wri...
research
08/30/2019

Annotated Document Spanners

We introduce annotated document spanners, which are document spanners th...

Please sign up or login with your details

Forgot password? Click here to reset