Spanner Evaluation over SLP-Compressed Documents

01/25/2021
by   Markus L. Schmid, et al.
0

We consider the problem of evaluating regular spanners over compressed documents, i.e., we wish to solve evaluation tasks directly on the compressed data, without decompression. As compressed forms of the documents we use straight-line programs (SLPs) – a lossless compression scheme for textual data widely used in different areas of theoretical computer science and particularly well-suited for algorithmics on compressed data. In terms of data complexity, our results are as follows. For a regular spanner M and an SLP S that represents a document D, we can solve the tasks of model checking and of checking non-emptiness in time O(size(S)). Computing the set M(D) of all span-tuples extracted from D can be done in time O(size(S) size(M(D))), and enumeration of M(D) can be done with linear preprocessing O(size(S)) and a delay of O(depth(S)), where depth(S) is the depth of S's derivation tree. Note that size(S) can be exponentially smaller than the document's size |D|; and, due to known balancing results for SLPs, we can always assume that depth(S) = O(log(|D|)) independent of D's compressibility. Hence, our enumeration algorithm has a delay logarithmic in the size of the non-compressed data and a preprocessing time that is at best (i.e., in the case of highly compressible documents) also logarithmic, but at worst still linear. Therefore, in a big-data perspective, our enumeration algorithm for SLP-compressed documents may nevertheless beat the known linear preprocessing and constant delay algorithms for non-compressed documents.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/25/2022

Constant-delay enumeration for SLP-compressed documents

We study the problem of enumerating results from a query over a compress...
research
07/24/2018

Constant-Delay Enumeration for Nondeterministic Document Spanners

We consider the information extraction approach known as document spanne...
research
10/11/2014

Direct Processing of Document Images in Compressed Domain

With the rapid increase in the volume of Big data of this digital era, f...
research
12/22/2018

Enumeration on Trees with Tractable Combined Complexity and Efficient Updates

We give an algorithm to enumerate the results on trees of monadic second...
research
06/02/2023

DWT-CompCNN: Deep Image Classification Network for High Throughput JPEG 2000 Compressed Documents

For any digital application with document images such as retrieval, the ...
research
01/03/2022

Efficient enumeration algorithms for annotated grammars

We introduce annotated grammars, an extension of context-free grammars w...

Please sign up or login with your details

Forgot password? Click here to reset