Recursive Programs for Document Spanners

12/21/2017
by   Liat Peterfreund, et al.
0

A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are obtained by adding capture variables to regular expressions. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (extracting relations that play the role of EDBs from the input document). In this paper, we investigate the expressive power of recursive Datalog over regex formulas. Our main result is that such programs capture precisely the document spanners computable in polynomial time. Additional results compare recursive programs to known formalisms such as the language of core spanners (that extends regular spanners by allowing to test for string equality) and its closure under difference. Finally, we extend our main result to a recently proposed framework that generalizes both the relational model and document spanners.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/20/2020

The Complexity of Aggregates over Extractions by Regular Expressions

Regular expressions with capture variables, also known as "regex formula...
research
09/19/2017

Programming from Metaphorisms

This paper presents a study of the metaphorism pattern of relational spe...
research
01/14/2019

Complexity Bounds for Relational Algebra over Document Spanners

We investigate the complexity of evaluating queries in Relational Algebr...
research
04/12/2023

Skyline Operators for Document Spanners

When extracting a relation of spans (intervals) from a text document, a ...
research
08/30/2019

Weight Annotation in Information Extraction

The framework of document spanners abstracts the task of information ext...
research
08/30/2019

Annotated Document Spanners

We introduce annotated document spanners, which are document spanners th...
research
10/26/2020

A Purely Regular Approach to Non-Regular Core Spanners

The regular spanners (characterised by vset-automata) are closed under t...

Please sign up or login with your details

Forgot password? Click here to reset