Constant-Delay Enumeration for Nondeterministic Document Spanners

07/24/2018
by   Antoine Amarilli, et al.
0

We consider the information extraction approach known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the input document and in the VA; while ensuring the best possible data complexity bounds in the input document, in particular constant delay in the document. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document or with an exponential dependency in the (generally nondeterministic) input VA. Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2020

Grammars for Document Spanenrs

A new grammar-based language for defining information-extractors from te...
research
03/15/2020

Grammars for Document Spanners

A new grammar-based language for defining information-extractors from te...
research
01/25/2021

Spanner Evaluation over SLP-Compressed Documents

We consider the problem of evaluating regular spanners over compressed d...
research
09/25/2022

Constant-delay enumeration for SLP-compressed documents

We study the problem of enumerating results from a query over a compress...
research
05/08/2020

On the complexity of computing integral bases of function fields

Let 𝒞 be a plane curve given by an equation f(x,y)=0 with f∈ K[x][y] a m...
research
10/12/2020

Constant-delay enumeration algorithms for document spanners over nested documents

Some of the most relevant document schemas used online, such as XML and ...
research
10/25/2016

How Document Pre-processing affects Keyphrase Extraction Performance

The SemEval-2010 benchmark dataset has brought renewed attention to the ...

Please sign up or login with your details

Forgot password? Click here to reset