# Constant-Delay Enumeration for Nondeterministic Document Spanners

We consider the information extraction approach known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the input document and in the VA; while ensuring the best possible data complexity bounds in the input document, in particular constant delay in the document. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document or with an exponential dependency in the (generally nondeterministic) input VA. Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs.

• 22 publications
• 13 publications
• 18 publications
• 6 publications
03/15/2020

### Grammars for Document Spanners

A new grammar-based language for defining information-extractors from te...
03/15/2020

### Grammars for Document Spanenrs

A new grammar-based language for defining information-extractors from te...
01/25/2021

### Spanner Evaluation over SLP-Compressed Documents

We consider the problem of evaluating regular spanners over compressed d...
09/25/2022

### Constant-delay enumeration for SLP-compressed documents

We study the problem of enumerating results from a query over a compress...
10/25/2016

### How Document Pre-processing affects Keyphrase Extraction Performance

The SemEval-2010 benchmark dataset has brought renewed attention to the ...
05/08/2020

### On the complexity of computing integral bases of function fields

Let 𝒞 be a plane curve given by an equation f(x,y)=0 with f∈ K[x][y] a m...
02/21/2017

### Systèmes du LIA à DEFT'13

The 2013 Défi de Fouille de Textes (DEFT) campaign is interested in two ...

## 1 Introduction

Information extraction from text documents is an important problem in data management. One approach to this task has recently attracted a lot of attention: it uses document spanners, a declarative logic-based approach first implemented by IBM in their tool SystemT [24] and whose core semantics have then been formalized in [8]. The spanner approach uses variants of regular expressions (e.g. regex formulas with variables), compiles them to variants of finite automata (e.g., variable-set automata, for short VAs), and evaluates them on the input document to extract the data of interest. After this extraction phase, algebraic operations like joins, unions and projections can be performed. The formalization of the spanner framework in [8] has led to a thorough investigation of its properties by the theoretical database community [11, 13, 19, 12, 9].

We here consider the basic task in the spanner framework of efficiently computing the results of the extraction, i.e., computing without duplicates all tuples of ranges of the input document (called mappings) that satisfy the conditions described by a VA. As many algebraic operations can also be compiled into VAs [13], this task actually solves the whole data extraction problem for so-called regular spanners [8]. While the extraction task is intractable for general VAs [11], it is known to be tractable if we impose that the VA is sequential [13, 9], which requires that all accepting runs actually describe a well-formed mapping; we will make this assumption throughout our work. Even then, however, it may still be unreasonable in practice to materialize all mappings: if there are variables to extract, then mappings are -tuples and there may be up to mappings on an input document of size , which is unrealistic if  is large. For this reason, recent works [19, 9, 13] have studied the extraction task in the setting of enumeration algorithms: instead of materializing all mappings, we enumerate them one by one while ensuring that the delay between two results is always small. Specifically, [13, Theorem 3.3] has shown how to enumerate the mappings with delay linear in the input document and quadratic in the VA, i.e., given a document  and a functional VA (a subclass of sequential VAs), the delay is .

Although this result ensures tractability in both the size of the input document and the automaton, the delay may still be long as  is generally very large. By contrast, enumeration algorithms for database tasks often enforce stronger tractability guarantees in data complexity [25, 26], in particular linear preprocessing and constant delay. Such algorithms consist of two phases: a preprocessing phase which precomputes an index data structure in linear data complexity, and an enumeration phase which produces all results so that the delay between any two consecutive results is always constant, i.e., independent from the input data. It was recently shown in [9] that this strong guarantee could be achieved when enumerating the mappings of VAs if we only focus on data complexity, i.e., for any fixed VA, we can enumerate its mappings with linear preprocessing and constant delay in the input document. However, the preprocessing and delay in [9] are exponential in the VA because they first determinize it [9, Propositions 4.1 and 4.3]. This is problematic because the VAs constructed from regex formulas [8] are generally nondeterministic.

Thus, to efficiently enumerate the results of the extraction, we would ideally want to have the best of both worlds: ensure that the combined complexity (in the sequential VA and in the document) remains polynomial, while ensuring that the data complexity (in the document) is as small as possible, i.e., linear time for the preprocessing phase and constant time for the delay of the enumeration phase. However, there is no known algorithm that satisfies these requirements while working on nondeterministic sequential VAs. Further, as the related task of counting the number of mappings is SpanL-hard for such VAs, [9] suggests it is unlikely that such an algorithm could exist.

The question of nondeterminism is also unsolved for the related problem of enumerating the results of monadic second-order (MSO) queries on words and trees: there are several approaches for this task where the query is given as an automaton, but they require the automaton to be deterministic [4, 1] or their delay is not constant in the input document [17]. Hence, also in the context of MSO enumeration, it is not known whether we can achieve linear preprocessing and constant delay in data complexity while remaining tractable in the (generally non-deterministic) automaton.

### Contributions.

In this work, we show that nondeterminism is in fact not an obstacle to enumerating the results of document spanners: we present an algorithm that enumerates the mappings of a nondeterministic sequential VAs in polynomial combined complexity while ensuring linear preprocessing and constant delay in the input document. This answers the open question of [9], and improves on the bounds of [13]. More precisely, we show:

Let be an exponent for Boolean matrix multiplication. Let be a sequential VA with variable set and with state set , and let be an input document. We can enumerate the mappings of  on  with preprocessing time in and with delay , i.e., linear preprocessing and constant delay in the input document, and polynomial preprocessing and delay in the input VA.

The existence of such an algorithm is surprising but in hindsight not entirely unexpected: remember that, in formal language theory, when we are given a word and a nondeterministic finite automaton, then we can evaluate the automaton on the word with tractable combined complexity by determinizing the automaton “on the fly”, i.e., computing at each position of the word the set of states where the automaton can be. Our algorithm generalizes this intuition, and extends it to the task of enumerating mappings without duplicates: we first present it for so-called extended sequential VAs, a variant of sequential VAs introduced in [9], before generalizing it to sequential VAs. Our overall approach is to construct a kind of product of the input document with the extended VA, similarly to [9]. We then use several tricks to ensure the constant delay bound despite nondeterminism; in particular we precompute a jump function that allows us to skip quickly the parts of the document where no variable can be assigned. The resulting algorithm is rather simple and has no large hidden constants. Note that our enumeration algorithm does not contradict the counting hardness results of [9, Theorem 5.2]: while our algorithm enumerates mappings with constant delay and without duplicates, we do not see a way to adapt it to count the mappings efficiently.

To extend our result to sequential VAs that are not extended, one possibility would be to convert them to extended VAs, but this necessarily entails an exponential blowup [9, Proposition 4.2]. We avoid this by adapting our algorithm to work with non-extended sequential VAs directly. Our idea for this is to efficiently enumerate at each position the possible sets of markers that can be assigned by the VA: we do so by enumerating paths in the VA, relying on the fact that the VA is sequential so these paths are acyclic. The challenge is that the same set of markers can be captured by many different paths, but we explain how we can explore efficiently the set of distinct paths with a technique known as flashlight search [18, 23]: the key idea is that we can efficiently determine which partial sets of markers can be extended to the label of a path (Lemma 6).

Of course, our main theorem (Theorem 1) implies analogous results for all spanner formalisms that can be translated to sequential VAs. In particular, spanners are not usually written as automata by users, but instead given in a form of regular expressions called regex-formulas, see [8] for exact definitions. As we can translate sequential regex-formulas to sequential VAs in linear time [8, 13, 19], our results imply that we can also evaluate them:

Let be an exponent for Boolean matrix multiplication. Let be a sequential regex-formula with variable set , and let be an input document. We can enumerate the mappings of  on  with preprocessing time in and with delay , i.e., linear preprocessing and constant delay in the input document, and polynomial preprocessing and delay in the input regex-formula.

Another direct application of our result is for so-called regular spanners which are unions of conjunctive queries (UCQs) posed on regex-formulas, i.e., the closure of regex-formulas under union, projection and joins. We again point the reader to [8, 13] for the full definitions. As such UCQs can in fact be evaluated by VAs, our result also implies tractability for such representations, as long as we only perform a bounded number of joins:

For every fixed , let denote the class of document spanners represented by UCQs over functional regex-formulas with at most applications of the join operator. Then the mappings of a spanner in can be enumerated with linear preprocessing and constant delay in the document size, and with polynomial preprocessing and delay in the size of the spanner representation.

### Paper structure.

In Section 2, we formally define spanners, VAs, and the enumeration problem that we want to solve on them. In Sections 35, we prove our main result (Theorem 1) for extended VAs, where the sets of variables that can be assigned at each position are specified explicitly. We first describe in Section 3 the main part of our preprocessing phase, which converts the extended VA and input document to a mapping DAG whose paths describe the mappings that we wish to enumerate. We then describe in Section 4 how to enumerate these paths, up to having precomputed a so-called jump function whose computation is explained in Section 5. Last, we adapt our scheme in Section 6 for sequential VAs that are not extended. We conclude in Section 7.

## 2 Preliminaries

### Document spanners.

We fix a finite alphabet . A document is just a word over . A span of is a pair with which represents a substring (contiguous subsequence) of starting at position and ending at position . To describe the possible results of an information extraction task, we will use a finite set of variables, and define a result as a mapping from these variables to spans of the input document. Following [9, 19] but in contrast to [8], we will not require mappings to assign all variables: formally, a mapping of  on  is a function from some domain to spans of . We define a document spanner to be a function assigning to every input document a set of mappings, which denotes the set of results of the extraction task on the document .

### Variable-set automata.

We will represent document spanners using variable-set automata (or VAs). The transitions of a VA can carry letters of  or variable markers, which are either of the form for a variable (denoting the start of the span assigned to ) or (denoting its end). Formally, a variable-set automaton (or VA) is then defined to be an automaton where the transition relation consists of letter transitions of the form for and , and of variable transitions of the form or for and . A configuration of a VA is a pair where and is a position of the input document . A run of on  is then a sequence of configurations

 (q0,i0)σ1−→(q1,i1)σ2−→⋯σm−→(qm,im)

where , , and where for every :

• Either is a letter of , we have , we have , and is a letter transition of ;

• Or is a variable marker, we have , and is a variable transition of . In this case we say that the variable marker is read at position .

As usual, we say that a run is accepting if . A run is valid if it is accepting, every variable marker is read at most once, and whenever an open marker is read at a position then the corresponding close marker is read at a position with . From each valid run, we define a mapping where each variable is mapped to the span such that is read at position  and is read at position ; if these markers are not read then is not assigned by the mapping (i.e., it is not in the domain ). The document spanner of the VA is then the function that assigns to every document  the set of mappings defined by the valid runs of on : note that the same mapping can be defined by multiple different runs. The task studied in this paper is the following: given a VA and a document , enumerate without duplicates the mappings that are assigned to by the document spanner of . The enumeration must write each mapping as a set of pairs where is a variable marker and is a position of .

### Sequential VAs.

We cannot hope to efficiently enumerate the mappings of arbitrary VAs because it is already NP-complete to decide if, given a VA and a document , there are any valid runs on on  [11]. For this reason, we will restrict ourselves to so-called sequential VAs [19]. A VA is sequential if for every document , every accepting run of  of  is also valid: this implies that the document spanner of  can simply be defined following the accepting runs of . If we are given a VA, then we can convert it to an equivalent sequential VA (i.e., that defines the same document spanner) with an unavoidable exponential blowup in the number of variables (not in the number of states), using existing results:

Given a VA on variable set , letting and be the number of states of , we can compute an equivalent sequential VA with states. Conversely, for any , there exists a VA with 1 state on a variable set with variables such that any sequential VA equivalent to  has at least states.

###### Proof.

This can be shown exactly like [11, Proposition 12] and [10, Proposition 3.9]. In short, the upper bound is shown by modifying  to remember in the automaton state which variables have been opened or closed, and by re-wiring the transitions to ensure that the run is valid: this creates copies of every state because each variable can be either unseen, opened, or closed. For the lower bound, [10, Proposition 3.9] gives a VA for which any equivalent sequential VA must remember the status of all variables in this way. ∎

All VAs studied in this work will be sequential, and we will further assume that they are trimmed in the sense that for every state there is a document and an accepting run of the VA where the state  appears. This condition can be enforced in linear time on any input sequential VA: we do a graph traversal to identify the accessible states (the ones that are reachable from the initial state), we do another graph traversal to identify the co-accessible states (the ones from which we can reach a final state), and we remove all states that are not accessible or not co-accessible. We will silently assume that all sequential VAs have been trimmed, which implies that they cannot contain any cycle of variable transitions (as such a cycle would otherwise appear in a run, which would not be valid).

### Extended VAs.

We will first prove our results for a variant of sequential VAs introduced by [9], called sequential extended VAs. An extended VA on alphabet and variable set is an automaton where the transition relation consists of letter transitions as before, and of extended variable transitions (or ev-transitions) of the form where is a possibly empty set of variable markers. Intuitively, on ev-transitions, the automaton reads multiple markers at once. Formally, a run of on  is a sequence of configurations (defined like before) where letter transitions and ev-transitions alternate:

where is a letter transition of  for all , and is an ev-transition of  for all where is the set of variable markers read at position . Accepting and valid runs are defined like before, and the extended VA is sequential if all accepting runs are valid, in which case its document spanner is defined like before.

Our definition of extended VAs generalizes slightly that of [9] because we do not impose that ev-transitions that read the empty set do not change the state. We will also make a small additional assumption to simplify our proofs: we require that the states of extended VAs are partitioned between ev-states, from which only ev-transitions originate (i.e., the above), and letter-states, from which only letter transitions originate (i.e., the above); and we impose that the initial state is an ev-state and the final states are all letter-states. This requirement can be imposed in linear time on any input extended VA by rewriting each state to one letter-state and one ev-state, and re-wiring the transitions and changing the initial/final status of states appropriately. This rewriting preserves sequentiality and guarantees that any path in the rewritten extended VA must alternate between letter transitions and ev-transitions. Hence, we silently make this assumption on all extended VAs from now on.

### Matrix multiplication.

The complexity bottleneck for some of our results will be the complexity of multiplying two Boolean matrices, which is a long-standing open problem, see e.g. [14] for a recent discussion. When stating our results, we will often denote by an exponent for Boolean matrix multiplication: this is a constant such that the product of two -by- Boolean matrices can be computed in time . For instance, we can take if we use the naive algorithm for Boolean matrix multiplication, and it is obvious that we must have . The best known upper bound is currently , see [15].

## 3 Computing Mapping DAGs for Extended VAs

We start our paper by studying extended VAs, which are easier to work with because the set of markers that can be assigned at every position is explicitly written as the label of a single transition. We accordingly show Theorem 1 for the case of extended VAs in Sections 35. We will then cover the case of non-extended VAs in Section 6.

To show Theorem 1 for extended VAs, we will reduce the problem of enumerating the mappings captured by to that of enumerating path labels in a special kind of directed acyclic graph (DAG), called a mapping DAG:

A mapping DAG consists of a set of vertices, an initial vertex , a final vertex , and a set of edges where each edge has a source vertex , a target vertex , and a label that may be (in which case we call the edge an -edge) or a finite (possibly empty) set. We require that the graph is acyclic. We say that a mapping DAG is alternating if every path from the initial vertex to the final vertex starts with a non--edge, ends with an -edge, and alternates between -edges and non--edges.

The mapping of a path in the mapping DAG is the union of labels of the non--edges of : we require of any mapping DAG that, for every path , this union is disjoint. Given a set of vertices of , we write for the set of mappings of paths from a vertex of  to the final vertex; note that the same mapping may be captured by multiple different paths. The set of mappings captured by  is then .

We will reduce our enumeration problem on extended VAs to that of enumerating the set of mappings captured by a mapping DAG. To do so, we will build a mapping DAG that is a variant of the product of  and of the document , where we represent simultaneously the position in the document and the corresponding state of . As letter transitions will no longer be important, we will erase their labels to call them -transitions, and we will label the other transitions with sets of markers that also indicate the position in the document. Here is our formal construction:

Let be a sequential extended VA and let be an input document. The product DAG of and is the alternating mapping DAG whose vertex set is with for some fresh value . Its edges are:

• For every letter-transition in , for every such that , there is an -edge from to ;

• For every ev-transition in , for every , there is an edge from to  labeled with the (possibly empty) set .

• For every final state , an -edge from to .

The initial vertex of the product DAG is and the final vertex is .

Note that, contrary to [9], we do not contract the -edges but keep them throughout our algorithm. It is easy to see that this construction satisfies the definition:

The product DAG of  and  is an alternating mapping DAG.

The mapping DAG is acyclic and alternating because its edges follow the transitions of the extended VA, which we had preprocessed to distinguish letter-states and ev-states. Paths in the mapping DAG cannot contain multiple occurrences of the same label, because the labels in the mapping DAG include the position in the document.

###### Proof.

It is immediate that the product DAG is indeed acyclic, because the second component is always nondecreasing, and an edge where the second component does not increase (corresponding to an ev-transition of the VA) must be followed by an edge where it does (corresponding to a letter-transition of the VA). What is more, we claim that no path in the product DAG can include two edges whose labels contain the same pair , so that the unions used to define the mappings of the mapping DAG are indeed disjoint. To see this, consider a path from an edge to an edge where and , we have and  and  are disjoint because all elements of  have  as their first component, and all elements of  have  as their first component. Further, the product DAG is also alternating because is an extended VA that we have preprocessed to distinguish letter-states and ev-states. ∎

Further, the product DAG clearly captures what we want to enumerate. Formally:

The set of mappings of on is exactly the set of mappings captured by the product DAG .

###### Proof.

This is immediate as there is a clear bijection between accepting runs of  on  and paths from the initial vertex of  to its final vertex, and this bijection ensures that the label of the path in  is the mapping corresponding to that accepting run. ∎

Our task is to enumerate without duplicates, and this is still non-obvious: because of nondeterminism, the same mapping in the product DAG may be witnessed by exponentially many paths, corresponding to exponentially many runs of the nondeterministic extended VA . We will present in the next section our algorithm to perform this task on the product DAG . To do this, we will need to preprocess by trimming it, and introduce the notion of levels to reason about its structure.

First, we present how to trim . We say that is trimmed if every vertex  is both accessible (there is a path from the initial vertex to ) and co-accessible (there is a path from  to the final vertex). Given a mapping DAG, we can clearly trim in linear time by two linear-time graph traversals. Hence, we will always silently assume that the mapping DAG is trimmed. If the mapping DAG may be empty once trimmed, then there are no mappings to enumerate, so our task is trivial. Hence, we assume in the sequel that the mapping DAG is non-empty after trimming. Further, if then the only possible mapping is the empty mapping and we can produce it at that stage, so in the sequel we assume that  is non-empty.

Second, we present an invariant on the structure of  by introducing the notion of levels:

A mapping DAG is leveled if its vertices are pairs whose second component  is a nonnegative integer called the level of the vertex and written , and where the following conditions hold:

• For the initial vertex (which has no incoming edges), the level is ;

• For every -edge from  to , we have ;

• For every non--edge from  to , we have ;

• Whenever two edges and  of  have a label which is not  or the empty set, and when these labels share some element, then the source vertices (and, hence, target vertices) of  and  must have the same level.

The depth of  is the maximal level. The width of  is the maximal number of vertices that have the same level.

The following is then immediate by construction:

The product DAG of  and  is leveled, and we have and .

###### Proof.

This is clear because the second component of every vertex of the product DAG indicates how many letters of  have been read so far, and because each level of the product DAG corresponds to a copy of , so it has at most  vertices. ∎

In addition to levels, we will need the notion of a level set:

A level set is a non-empty set of vertices in a leveled alternating mapping DAG that all have the same level (written ) and which are all the source of some non--edge. The singleton of the final vertex is also considered as a level set.

In particular, letting be the initial vertex, the singleton is a level set. Further, if we consider a level set which is not the final vertex, then we can follow non--edges from all vertices of  (and only such edges) to get to other vertices, and follow -edges from these vertices (and only such edges) to get to a new level set  with .

## 4 Enumeration for Mapping DAGs

In the previous section, we have reduced our enumeration problem for extended VAs on documents to an enumeration problem on alternating leveled mapping DAGs. In this section, we describe our main enumeration algorithm on such DAGs and show the following:

Let be an exponent for Boolean matrix multiplication. Given an alternating leveled mapping DAG of depth and width , we can enumerate (without duplicates) with preprocessing and delay where is the size of each produced mapping.

Remember that, as part of our preprocessing, we have ensured that the leveled alternating mapping DAG  has been trimmed. We will also preprocess  to ensure that, given any vertex, we can access its adjacency list (i.e., the list of its outgoing edges) in some sorted order on the labels, where we assume that -edges come last. This sorting can be done in linear time on the RAM model [16, Theorem 3.1], so the preprocessing is in .

Our general enumeration algorithm is then presented as Algorithm 1. We explain the missing pieces next. The function Enum is initially called with , the level set containing only the initial vertex, and with being the empty list.

For simplicity, let us assume for now that the Jump function just computes the identity, i.e., . As for the call , it returns the pairs where:

• The label set is a (non-) edge label such that there is an edge labeled with  that starts at some vertex of

• The level set  is formed of all the vertices at level that can be reached from such an edge followed by an -edge. Formally, a vertex  is in  if and only if there is an edge labeled from some vertex to some vertex , and there is an -edge from  to .

Remember that, as the mapping DAG is alternating, we know that all edges starting at vertices of the level set  are non--edges (several of which may have the same label); and for any target  of these edges, all edges that leave  are -edges whose targets  are at the level .

It is easy to see that the NextLevel function can be computed efficiently:

Given a leveled trimmed alternating mapping DAG with width , and a level set , we can enumerate without duplicates all the pairs with delay in an order such that comes last if it is returned.

###### Proof.

We simultaneously go over the sorted lists of the outgoing edges of each vertex of , of which there are at most , and we merge them. Specifically, as long as we are not done traversing all lists, we consider the smallest value of (according to the order) that occurs at the current position of one of the lists. Then, we move forward in each list until the list is empty or the edge label at the current position is no longer equal to , and we consider the set  of all vertices  that are the targets of the edges that we have seen. This considers at most  edges and reaches at most vertices (which are at the same level as ), and the total time spent reading edge labels is in , so the process is in  so far. Now, we consider the outgoing edges of all vertices  (all are -edges) and return the set  of the vertices  to which they lead: this only adds to the running time because we consider at most vertices  with at most  outgoing edges each. Last, comes last because of our assumption on the order of adjacency lists. ∎

The design of Algorithm 1 is justified by the fact that, for any level set , the set can be partitioned based on the value of . Formally:

For any level set of  which is not the final vertex, we have:

 M(Λ)=⋃(locmark,Λ′′)∈\textscNextLevel(Λ)locmark⋅M(Λ′′). (1)

Furthermore, this union is disjoint, non-empty, and none of its terms is empty.

###### Proof.

The definition of a level set and of an alternating DAG ensures that we can decompose any path from  to  as a non--edge  from to some vertex , an -edge from  to some vertex , and a path from  to . Further, the set of such is clearly a level set. Hence, the left-hand side of Equation (LABEL:eq:partitioncalL-apx) is included in the right-hand side. Conversely, given such , , , and , we can combine them into a path , so the right-hand side is included in the left-hand side. This proves Equation (LABEL:eq:partitioncalL-apx).

The fact that the union is disjoint is because, by definition of a leveled mapping DAG, the labels of edges starting at vertices in  cannot occur at a different level, i.e., they cannot occur on the path ; so the mappings in are indeed partitioned according to their intersection with the set of labels that occur on the level .

The fact that the union is non-empty is because is non-empty and its vertices must be co-accessible so they must have some outgoing non--edge, which implies that is non-empty.

The fact that none of the terms of the union is empty is because, for each , we know that is non-empty because the mapping DAG is trimmed so all vertices are co-accessible. ∎

Thanks to this claim, we could easily prove by induction that Algorithm 1 correctly enumerates when Jump is the identity function. However, this algorithm would not achieve the desired delay bounds: indeed, it may be the case that only contains , and then the recursive call to Enum would not make progress in constructing the mapping, so the delay would not generally be linear in the size of the mapping. To avoid this issue, we use the Jump function to directly “jump” to a place in the mapping DAG where we can read a label different from . Let us first give the relevant definitions:

Given a level set in a leveled mapping DAG , the jump level of  is the first level containing a vertex such that some has a path to  and such that is either the final vertex or has an outgoing edge with a label which is and . In particular we have if some vertex in  already has an outgoing edge with such a label, or if is the singleton set containing only the final vertex.

The jump set of  is then if , and otherwise is formed of all vertices at level  to which some have a directed path whose last edge is labeled . This ensures that is always a level set.

The definition of Jump ensures that we can jump from to  when enumerating mappings, and it will not change the result because we only jump over -edges and -edges:

For any level set of , we have .

###### Proof.

As contains all vertices from level that can be reached from , any path from a vertex to the final vertex can be decomposed into a path from to a vertex and a path from  to . By definition of , we know that all edges in  are labeled with  or , so . Hence, we have .

Conversely, given a path from a vertex to the final vertex, the definition of  ensures that there is a vertex and a path from  to , which again consists only of -edges or -edges. Hence, letting be the concatenation of and , we have and is a path from to the final vertex. Thus, we have , concluding the proof. ∎

Claims 4 and 4 imply that Algorithm 1 is correct with this implementation of Jump:

Enum correctly enumerates (without duplicates).

###### Proof.

We show the stronger claim that for every level set , and for every sequence of labels, we have that Enum enumerates (without duplicates) the set . The base case is when is the final vertex, and then and the algorithm correctly returns .

For the induction case, let us consider a level set which is not the final vertex, and some sequence of labels. We let , and by Claim 4 we have that . Now we know by Claim 4 that can be written as in Equation (1) and that the union is disjoint; the algorithm evaluates this union. So it suffices to show that, for each , the corresponding iteration of the for loop enumerates (without duplicates) the set . By induction hypothesis, the call Enum enumerates (without duplicates) the set . So this establishes that the algorithm is correct. ∎

What is more, Algorithm 1 now achieves the desired delay bounds, as we will show. Of course, this relies on the fact that the Jump function can be efficiently precomputed and evaluated. We only state this fact for now, and prove it in the next section:

Given a leveled mapping DAG with width , we can preprocess in time such that, given any level set of , we can compute the jump set of  in time .

We can now conclude the proof of Theorem 4 by showing that the preprocessing and delay bounds are as claimed. For the preprocessing, this is clear: we do the preprocessing in presented at the beginning of the section (i.e., trimming, and computing the sorted adjacency lists), followed by that of Proposition 4. For the delay, we claim:

Algorithm 1 has delay , where is the size of the mapping of each produced path. In particular, the delay is independent of the size of .

The time to call Jump is in by Proposition 4, and the time spent to move to the next iteration of the for loop with a label set is in time using Proposition 4: now the operations in the loop body run in constant time if we represent as a linked list so that we do not have to copy it when making the recursive call. As Proposition 4 ensures that  comes last, when producing the first solution, we make at most  calls to produce a solution of size , and the time is in . We adapt this argument to show that each successive solution is also produced within that bound: note that when we use in the for loop (which does not contribute to ) then the next call to Enum either reaches the final vertex or uses a non-empty set which contributes to . What is more, as is considered last, the corresponding call to Enum is tail-recursive, so we can ensure that the size of the stack (and hence the time to unwind it) stays .

###### Proof.

Let us first bound the delay to produce the first solution. When we enter the Enum function, we call the Jump function to produce in time by Proposition 4, and either is the final vertex or some vertex in  must have an outgoing edge with a label different from . Then we enumerate with delay for each  using Proposition 4. Remember that Proposition 4 ensures that the label comes last; so by definition of Jump the first value of  that we consider is different from . At each round of the for loop, we recurse in constant time: in particular, we do not copy when writing , because we represent it as a linked list. Eventually, after calls, by definition of a leveled mapping DAG, must be the final vertex, and then we output a mapping of size  in time : the delay is indeed in because the sizes of the values of  seen along the path sum up to , and the unions of and are always disjoint by definition of a mapping DAG.

Let us now bound the delay to produce the next solution. To do so, we will first observe that when enumerating a mapping of cardinality , then the size of the recursion stack is always . This is because Proposition 4 ensures that the value is always considered last in the for loop on . Thanks to this, every call to Enum where is actually a tail recursion, and we can avoid putting another call frame on the call stack using tail recursion elimination. This ensures that each call frame on the stack (except possibly the last one) contributes to the size of the currently produced mapping, so that indeed when we reach the final vertex of  then the call stack is no greater than the size of the mapping that we produce.

Now, let us use this fact to bound the delay between consecutive solutions. When we move from one solution to another, it means that some for loop has moved to the next iteration somewhere in the call stack. To identify this, we must unwind the stack: when we produce a mapping of size , we unwind the stack until we find the next for loop that can move forward. By our observation on the size of the stack, the unwinding takes time with is the size of the previously produced mapping; so we simply account for this unwinding time as part of the computation of the previous mapping. Now, to move to the next iteration of the for loop and do the computations inside the loop, we spend a delay by Proposition 4. Let be the current size of , including the current . The for loop iteration finishes with a recursive call to Enum, and we can re-apply our argument about the first solution above to argue that this call identifies a mapping of some size in delay . However, because the argument to the recursive call had size , the mapping which is enumerated actually has size and it is produced in delay . This means that the overall delay to produce the next solution is indeed in where is the size of the mapping that is produced, which concludes the proof. ∎

### Memory usage.

We briefly discuss the memory usage of the enumeration phase, i.e., the maximal amount of working memory that we need to keep throughout the enumeration phase, not counting the precomputation phase. Indeed, in enumeration algorithms the memory usage can generally grow to be very large even if one adds only a constant amount of information at every step. We will show that this does not happen here, and that the memory usage throughout the enumeration remains polynomial in  and constant in the input document size.

All our memory usage during enumeration is in the call stack, and thanks to tail recursion elimination (see the proof of Claim 4) we know that the stack depth is at most , where is the size of the produced mapping as in the statement of Theorem 4. The local space in each stack frame must store and , which have size , and the status of the enumeration of NextLevel in Proposition 4, i.e., for every vertex , the current position in its adjacency list: this also has total size , so the total memory usage of these structures over the whole stack is in . Last, we must also store the variables  and , but their total size of the variables across the stack is clearly , and the same holds of because each occurrence is stored as a linked list (with a pointer to the previous stack frame). Hence, the total memory usage is , i.e., in terms of the extended VA.

## 5 Jump Function

The only missing piece in the enumeration scheme of Section 4 is the proof of Proposition 4. We first explain the preprocessing for the Jump function, and then the computation scheme.

### Preprocessing scheme.

Recall the definition of the jump level and jump set of a level set  (Definition 4). We assume that we have precomputed in the mapping associating each vertex  to its level , as well as, for each level , the list of the vertices such that .

The first part of the preprocessing is then to compute, for every individual vertex , the jump level , i.e., the minimal level containing a vertex such that is reachable from  and is either the final vertex or has an outgoing edge which is neither an -edge nor an -edge. We claim:

We can precompute in the jump level of all vertices of .

We do the computation along a reverse topological order: we have for the final vertex , we have if has an outgoing edge which is not an -edge or an -edge, and otherwise we have .

###### Proof.

This construction can be performed iteratively from the final vertex to the initial vertex : we have for the final vertex , we have if has an outgoing edge which is not an -edge or an -edge, and otherwise we have .

This computation can be performed along a reverse topological order, which by [CormenLRS09, Section 22.4] takes linear time in . However, note that has at most vertices, and we only traverse -edges and -edges: we just check the existence of edges with other labels but we do not traverse them. Now, as each vertex has at most outgoing edges labeled  and at most  outgoing edges labeled , the number of edges in the DAG that we actually traverse is only , which shows our complexity bound and concludes the proof. ∎

The second part of the preprocessing is to compute, for each level  of , the reachable levels , which we can clearly do in linear time in the number of vertices of , i.e., in . Note that the definition clearly ensures that we have .

Last, the third step of the preprocessing is to compute a reachability matrix from each level to its reachable levels. Specifically, for any two levels of , let be the Boolean matrix of size at most which describes, for each with and , whether there is a path from  to  whose last edge is labeled . We claim that we can efficiently compute these matrices:

We can precompute in time the matrices for all pairs of levels such that .

We compute them in decreasing order on : the matrix can be computed in time from the edge relation, and matrices with can be computed in time as the product of and : note that has been precomputed because easily implies that .

###### Proof.

We compute the matrices in decreasing order on , then for each fixed  in arbitrary order on :

• if , then

is the identity matrix;

• if , then can be computed from the edge relation of in time , because it suffices to consider the edges labeled  and  between levels  and ;

• if , then is the product of and , which can be computed in time .

In the last case, the crucial point is that has already been precomputed, because we are computing in decreasing order on , and because we must have . Indeed, if , then there is a vertex with such that , and the inductive definition of implies that has an edge to a vertex such that and , which witnesses that .

The total running time of this scheme is in : indeed we consider each of the levels of , we compute at most matrices for each level of because we have for any , and each matrix is computed in time at most . ∎

### Evaluation scheme.

We can now describe our evaluation scheme for the jump function. Given a level set , we wish to compute . Let be the level of , and let be which we compute as . If , then and there is nothing to do. Otherwise, by definition there must be such that , so witnesses that , and we know that we have precomputed the matrix . Now are the vertices at level  to which the vertices of  (at level ) have a directed path whose last edge is labeled , which we can simply compute in time  by unioning the lines that correspond to the vertices of  in the matrix .

This concludes the proof of Proposition 4 and completes the presentation of our scheme to enumerate the set captured by mapping DAGs (Theorem 4). Together with Section 3, this proves Theorem 1 in the case of extended sequential VAs.

## 6 From Extended Sequential VAs to General Sequential VAs

In this section, we adapt our main result (Theorem 1) to work with sequential non-extended VAs rather than sequential extended VAs. Remember that we cannot tractably convert extended VAs into non-extended VAs [9, Proposition 4.2], so we must modify our construction in Sections 35 to work with sequential non-extended VAs directly. Our general approach will be the same: compute the mapping DAG and trim it like in Section 3, then precompute the jump level and jump set information as in Section 5, and apply the enumeration scheme of Section 4. The difficulty is that non-extended VAs may assign multiple markers at the same word position by taking multiple variable transitions instead of one single ev-transition. Hence, when enumerating all possible values for in Algorithm 1, we need to consider all possible sequences of variable transitions. The challenge is that there may be many different transitions sequences that assign the same set of markers, which could lead to duplicates in the enumeration. Thus, our goal will be to design a replacement to Proposition 4 for non-extended VAs, i.e., enumerating possible values for  at each level without duplicates.

We start as in Section 3 by computing the product DAG of and of the input document with vertex set with for some fresh value , and with the following edge set:

• For every letter-transition of , for every such that , there is an -edge from to ;

• For every variable-transition of  (where is a marker), for every , there is an edge from to  labeled with .

• For every final state , an -edge from to .

The initial vertex of  is and the final vertex is . Note that the edge labels are now always singleton sets or ; in particular there are no longer any -edges.

We can then adapt most of Claim 3: the product DAG is acyclic because all letter-transitions make the second component increase, and because we know that there cannot be a cycle of variable-transitions in the input sequential VA (remember that we assume VAs to be trimmed). We can also trim the mapping DAG in linear time as before, and Claim 3 also adapts to show that the resulting mapping DAG correctly captures the mappings that we wish to enumerate. Last, as in Claim 3, the resulting mapping DAG is still leveled, the depth (number of levels) is still , and the width (maximal size of a level) is still ; we will also define the complete width of  in this section as the maximal size, over all levels , of the sum of the number of vertices with level  and of the number of edges with a source vertex having level : clearly we have . The main change in Section 3 is that the mapping DAG is no longer alternating, i.e., we may follow several non--edges in succession (staying at the same level) or follow several -edges in succession (moving to the next level each time). Because of this, we change Definition 3 and redefine level sets to mean any non-empty set of vertices that are at the same level.

We then reuse the enumeration approach of Section 4 and 5. Even though the mapping DAG is no longer alternating, it is not hard to see that with our new definition of level sets we can reuse the jump function from Section 5 as-is, and we can also reuse the general approach of Algorithm 1. However, to accommodate for the different structure of the mapping DAG, we will need a new definition for NextLevel: instead of following exactly one non--edge before an -edge, we want to be able to follow any (possibly empty) path of non--edges before an -edge. We formalize this notion as an -path:

For a set of labels, an -path in the mapping DAG is a path of edges that includes no -edges and where the labels of the path are exactly the elements of  in some arbitrary order. Recall that the definition of a mapping DAG ensures that there can be no duplicate labels on the path, and that the start and end vertices of an -path must have the same level because no -edge is traversed in the path.

For a level set, is the set of all pairs where:

• is a set of labels such that there is an -path that goes from some vertex  of  to some vertex  which has an outgoing -edge;

• is the level set containing exactly the vertices that are targets of these -edges, i.e., there is an -path from some vertex to some vertex , and there is an -edge from  to .

Note that these definitions are exactly equivalent to what we would obtain if we converted to an extended VA and then used our original construction. This directly implies that the modified enumeration algorithm is correct (i.e., Proposition 4 extends). In particular, the modified algorithm still uses the jump pointers as computed in Section 5 to jump over positions where the only possibility is , i.e., positions where the sequential VA make no variable-transitions. The only thing that remains is to establish the delay bounds, for which we need to enumerate NextLevel efficiently without duplicates (and replace Proposition 4). To present our method for this, we will introduce the alphabet size as the maximal number, over all levels of the mapping DAG , of the different labels that can occur in non--edges between vertices at level ; in our construction this value is bounded by the number of different markers, i.e., . We can now state the claim:

Given a leveled trimmed mapping DAG with complete width and alphabet size , and a level set , we can enumerate without duplicates all the pairs with delay in an order such that comes last if it is returned.

###### Proof.

Clearly if is the singleton level set consisting only of the final vertex, then the set to enumerate is empty and there is nothing to do. Hence, in the sequel we assume that this is not the case.

Let be the level of . We call the set of possible labels at level , with being no greater than the alphabet size of . We fix an arbitrary order on the elements of . Remember that we want to enumerate , i.e., all pairs of a subset of  such that there is an -path in  from a vertex in to a vertex (which will be at level ) with an outgoing -edge; and the set of the targets of these -edges (at level

). Let us consider the complete decision tree

on : it is a complete binary tree of height , where, for all , every edge at height  is labeled with if it is a right child edge and with otherwise. For every node in the tree, we consider the path from the root of  to , and call the positive set of  the labels such that appears in the path, and the negative set  of  the labels such that appears in the path: it is immediate that for every node  of  the sets and are a partition of where is the depth of  in .

We say that a node of  is good if there is some -path in  starting at a vertex of and leading to a vertex which has an outgoing -edge. Our goal of determining can then be rephrased as finding the set of all positive sets for all good leaves  of  (and the corresponding level set ), because there is a clear one-to-one correspondence that sends each subset to a leaf of  such that .

Observe now that we can use Lemma 6 to determine in time , given a node  of , whether it is good or bad: call the procedure on the subgraph of that is induced by level (it has size ) and with the sets and , then check in  whether one of the vertices returned by the procedure has an outgoing -edge. A naive solution to find the good leaves would then be to test them one by one using Lemma 6; but a more efficient idea is to use the structure of  and the following facts:

• The root of  is always good. Indeed, is trimmed, so we know that any has a path to some -edge.

• If a node is good then all its ancestors are good. Indeed, if is an ancestor of , and there is a -path in  starting at a vertex of , then this path is also a path, because and .

• If a node is good, then it must have at least one good descendant leaf . Indeed, taking any -path that witnesses that  is good, we can take the leaf to be such that is exactly the set of labels that occur on the path, so that the same path witnesses that  is indeed good.

Our flashlight search algorithm will rely on these facts. We explore  depth-first, constructing it on-the-fly as we visit it, and we use Lemma 6 to guide our search: at a node  of  (inductively assumed to be good), we call Lemma 6 on its two children to determine which of them are good (from the facts above, at least one of them must be), and we explore recursively the first good child, and then the second good child if there is one. When the two children are good, we first explore the child labeled before exploring the child labeled : this ensures that if the empty set is produced as a label set in then we always enumerate it last, as we should. Once we reach a leaf  (inductively assumed to be good) then we output its positive set of labels .

It is clear that the algorithm only enumerates label sets which occur in . What is more, as the set of good nodes is upwards-closed in , the depth-first exploration visits all good nodes of , so it visits all good leaves and produces all label sets that should occur in . Now, the delay is bounded by : indeed, whenever we are exploring at any node , we know that the next good leaf will be reached in at most calls to the procedure of Lemma 6, and we know that the subgraph of  induced by level  has size bounded by the complete width of  so each call takes time , including the time needed to verify if any of the reachable vertices  has an outgoing -edge: this establishes the delay bound of that we claimed. Last, while doing this verification, we can produce the set of the targets of these edges in the same time bound. This set  is correct because any such vertex has an outgoing -edge and there is a -path from some vertex to . Now, as and the path cannot traverse an -edge, then these paths are actually -paths (i.e., they exactly use the labels in ), so  is indeed the set that we wanted to produce according to Definition 6. This concludes the proof. ∎

With this runtime, the delay of Theorem 4 becomes , and we know that , that , that , and that ; so this leads to the overall delay of in Theorem 1.

The idea to prove Proposition 6 is to use a general approach called flashlight search [18, 23]: we will use a search tree on the possible sets of labels on  to iteratively construct the set  that can be assigned at the current position, and we will avoid useless parts of the search tree by using a lemma to efficiently check if a partial set of labels can be extended to a solution. To formalize the notion of extending a partial set, we will need the notion of -paths:

For