Enumeration algorithms are a common way to compute large query results on databases, see, e.g., . Instead of computing all results, these algorithms compute results one after the other, while ensuring that the time between two successive results (the delay) remains small. Ideally, the delay should be linear in the size of each produced solution, and independent of the size of the input database. To make this possible, enumeration algorithms can build an index structure on the database during a preprocessing phase that ideally runs in linear time.
Most enumeration algorithms assume that the input database will not change. If we update the database, we must re-run the preprocessing phase from scratch, which is unreasonable in practice. Losemann and Martens  proposed the first enumeration algorithm that addresses this issue: they study monadic second-order (MSO) query evaluation on trees, and show that the index structure for enumeration can be maintained under updates. More precisely, they can update the index in time polylogarithmic in the input tree (much better than re-running the linear preprocessing). The tradeoff is that their delay is also polylogarithmic in , whereas the delay can be independent of when there are no updates .
This result of  leads to a natural question: does the support for updates inherently increase the delay of enumeration algorithms? This is not always the case: e.g., when evaluating first-order queries (plus modulo-counting quantifiers) on bounded-degree databases, updates can be applied in constant time  and the delay is constant, as in the case without updates [18, 22]. However, when evaluating conjunctive queries (CQs) on arbitrary databases, supporting updates has a cost: under complexity-theoretic assumptions, the class of CQs with efficient enumeration under updates  is a strict subclass of the class of CQs for the case without updates . Could the same be true of MSO on trees, as  would suggest?
In this work, we answer this question in the negative, for a restricted update language. Specifically, we show an enumeration algorithm for MSO on trees with the same delay as in the case without updates , while supporting updates with a better complexity than  (see detailed comparison of results in Section 3). The tradeoff is that we only allow updates that change the labels of nodes, called relabelings, unlike  where updates can also insert and delete leaves. We still show how these relabelings are useful to evaluate practical query languages, such as parameterized queries and group-by queries with aggregates. A parameterized query allows the user to specify some parameters for the evaluation (e.g., select some positions on the tree). Our results support such queries: we can model the parameters as labels and apply relabeling updates when the user changes the parameters. A group-by query with aggregates partitions the set of results into groups based on an attribute, and computes some aggregate quantity on each group (e.g., a sum). We show how to enumerate the results of such queries. For groups, our techniques can handle them with one single enumeration structure using relabelings to switch groups. For aggregates, we can efficiently compute and maintain them in arbitrary semirings; this problem was left open by  even for counting, and is practically relevant in its own right . Of course, by Courcelle’s theorem , our results generalize to MSO queries on bounded-treewidth data (see ), where relabelings mean adding or removing unary facts (i.e., the tree decomposition is unchanged).
The proof of our main result follows the approach of 
and is inspired by knowledge compilation in artificial intelligence and by factorized representations in database theory. Specifically, we encode knowledge (in our case, the query result) as a circuit in a restricted class, and we then use the circuit for efficient reasoning and for aggregates as in. In , we have used this circuit-based approach to recapture existing enumeration results for MSO on trees [8, 23]. In this work, we refine the approach and show that it can support updates. Our key new ingredient are hybrid circuits: they have both set-valued gates that represent the values to enumerate, and Boolean gates that encode the tree labels which can be updated. We first show that we can efficiently compute such circuits to capture the possible results of an MSO query under all possible labelings of a tree. Second, we show how to efficiently enumerate the set of assignments captured by these circuits, also supporting updates that toggle the Boolean gates affected by a relabeling. We also introduce some standalone tools, e.g., a lemma to balance the input trees to MSO queries (Lemma 4), ensuring that hybrid circuits have logarithmic depth so that changes can be propagated quickly; and a constant-delay enumeration algorithm for reachability in forests under updates (Section 7).
We start with preliminaries in Section 2, and define our problem and give our main result in Section 3. In Section 4, we review the set-valued provenance circuits of , and show our balancing lemma. We introduce hybrid circuits in Section 5, and show in Section 6 how to use them for enumeration under updates, using a standalone reachability indexing scheme on forests given in Section 7. Having shown our main result, we outline its consequences for application-oriented query languages in Section 8 and conclude in Section 9.
Trees, queries, answers, assignments.
In this work, unless otherwise specified, a tree is always binary, rooted, ordered, and full. Let be a finite set called a tree alphabet. A -tree is a pair of a tree and of a labeling function that maps each node of to a set of labels . We often abuse notation and identify to its node set, e.g., write as a function from to the powerset of ; we may also omit and write the -tree as just .
We consider queries in monadic second-order logic (MSO) on the signature of -trees: it features two binary relations and denoting the first and second child of each internal node, and a unary relation for each denoting the nodes that carry label (i.e., nodes for which ). MSO extends first-order logic, which builds formulas from atoms of this signature and from equality atoms, using the Boolean connectives and existential and universal quantification over nodes. Formulas in MSO can also use second-order quantification over sets of nodes, written as second-order variables. For instance, on , we can express in MSO that every node carrying labels and has a descendant carrying label .
In this work, we study MSO queries, i.e., MSO formulas with free variables. The free variables can be first-order or second-order, but we can rewrite any MSO query to ensure that all free variables are second-order: for instance as , where asserts that is exactly the singleton set . Hence, we usually assume without loss of generality that MSO queries only have second-order free variables.
Given a -tree and an MSO query , an -tuple of subsets of is an answer of on , written , if satisfies in the usual logical sense. It will be more convenient to represent each answer as an assignment, which is a set of pairs called singletons that indicate that an element is in the interpretation of a variable. Formally, given an -tuple of subsets of , the corresponding assignment is . We can convert each assignment in linear time to the corresponding answer and vice-versa, so we will use the assignment representation throughout this work. Our goal is to compute the set of assignments of on , which we call the output of on ; we abuse notation and write it . We measure the complexity of this task in data complexity, i.e., as a function of the input tree , with the query being fixed.
The output of an MSO query can be huge, so we work in the setting of enumeration algorithms [31, 28] which we present following . As usual for enumeration algorithms , we work in the RAM model with uniform cost measure (see, e.g., ), where pointers, numbers, labels for elements and facts, etc., have constant size.
An enumeration algorithm with linear-time preprocessing for a fixed MSO query on -trees takes as input a -tree and computes the output of on . It consists of two phases. First, the preprocessing phase takes as input and produces in linear time a data structure called the index, and an initial state . Second, the enumeration phase repeatedly calls an algorithm . Each call to takes as input the index and the current state , and returns one assignment and a new state : a special state value indicates that the enumeration is over so should not be called again. The assignments produced by the successive calls to must be exactly the elements of , with no duplicates.
We say that the enumeration algorithm has linear delay if the time to produce each new assignment is linear in its cardinality , and is independent of . In particular, if all answers to are tuples of singleton sets (for instance, if is the translation of a MSO query where all free variables are first-order), then the cardinality of each assignment is constant (it is the arity of ). In this case, the enumeration algorithm must produce each assignment with constant delay: this is called constant-delay enumeration. The memory usage of an enumeration algorithm is the maximum number of memory cells used during the enumeration phase (not counting the index , which resides in read-only memory), expressed as a function of the size of the largest assignment (as in ): we say that the enumeration algorithm has linear memory if its memory usage is linear in the size of the largest assignment.
Previous works have studied enumeration for MSO on trees. Bagan  showed that for any fixed MSO query , given a -tree , we can enumerate the output of on with linear delay and memory, i.e., constant delay and memory when all free variables are first-order. This result was re-proven by Kazana and Segoufin  via a result of Colcombet , and a third proof via provenance circuits was recently proposed by the present authors .
3 Problem Statement and Main Result
Our goal is to address a limitation of these existing results, namely, the assumption that the input -tree will never change. Indeed, if is updated, these results must discard the index and re-run the preprocessing phase on the new tree. To improve on this, we want our enumeration algorithm to support update operations on , and to update accordingly instead of recomputing it from scratch. Specifically, an algorithm for enumeration under updates on a tree has a preprocessing phase that produces the index as usual, but has two algorithms during the enumeration phase: (i.) an enumeration algorithm as presented before, and (ii.) an update algorithm . When we want to change the tree , we call with a description of the changes: modifies accordingly, updates the index , and resets the enumeration state (so enumeration starts over on the new tree, and all working memory of the enumeration phase is freed). The update time of the enumeration algorithm is the complexity of : like preprocessing, but unlike delay, it is a function of the size of the (current) tree .
To our knowledge, the only published result on enumeration for MSO queries under updates is the work of Losemann and Martens , which applies to words and to trees, for MSO queries with only free first-order variables. They show an enumeration algorithm with linear-time preprocessing: on words, the update complexity and delay is ; on trees, these complexities become . Thus the delay is worse than in the case without updates , and in particular it is no longer independent from .
In this work, we show that enumeration under updates for MSO queries on trees can be performed with a better complexity that matches the case without updates: linear-time preprocessing, linear delay and memory (in the assignments), and update time in . This improves on the bounds of  (and uses entirely different techniques). However, in exchange for the better complexity, we only support a weaker update language: we can change the labels of tree nodes, called a relabeling, but we cannot insert or delete leaf nodes as in , which we leave for future work (see the conclusion in Section 9). We show in Section 8 that relabelings are still useful to derive results for some practical query languages.
Formally, a relabeling on a -tree is a pair of a node and a label . To apply it, we change the label of by adding if , and removing it if . In other words, the tree never changes, and updates only modify . Our main result is then:
For any fixed tree alphabet and MSO query on -trees, given a -tree , we can enumerate the output of on with linear-time preprocessing, linear delay and memory, and logarithmic update time for relabelings.
See Appendix 7.2 for the proof of this result. ∎
In other words, after preprocessing in time to compute the index , we can:
Enumerate the assignments of on , using , with delay linear in the size of each assignment, so constant if the assignments to have constant size.
Toggle a label of a node of , update , and reset the enumeration, in time .
4 Provenance Circuits
Our general technique for enumeration follows our earlier work : from the query and input tree, we compute in linear time a structure called a provenance circuit to represent the results to enumerate, we observe that it falls in a restricted circuit class, and we conclude by showing a general enumeration result for circuits of this class. In this section, we review our construction of provenance circuits in , with some additional observations that will be useful for updates. In particular, we show an independent balancing lemma on input trees, which allows us to bound a parameter of the circuit called dependency size. We will extend the formalism of this section to so-called hybrid circuits in the next section; and we will show our enumeration result for such circuits in Sections 6 and 7.
We start with some preliminaries about circuits. A circuit is a directed acyclic graph whose vertices are called gates, whose edges are called wires, where is the output gate, and where is a function giving a type to each gate of (the possible types depend on the kind of circuit). The inputs to a gate are and the fan-in of is its number of inputs .
We define set-valued circuits, which are an equivalent rephrasing of the circuits in zero-suppressed semantics used in . They can also be seen to be isomorphic to arithmetic circuits, and generalize factorized representations used in database theory . The type function of a set-valued circuit maps each gate to one of , , . We require that -gates have fan-in 0 or 2, and that -gates have fan-in 0: the latter are called the variables of , with denoting the set of variables. Each gate of captures a set of assignments, where each assignment is a subset of . These sets are defined bottom-up as follows:
For a variable gate , we have .
For a -gate , we have . In particular, if then .
For a -gate with no inputs, we have .
For a -gate with two inputs and , we have , which we write (this is the relational product).
The set captured by is for the output gate of . Note that each assignment of is a satisfying assignment of when seen in the usual semantics of monotone circuits.
Before defining our provenance circuits, we introduce some structural restrictions that they will respect, and that will be useful for enumeration.
The first requirement is that the circuit is a d-DNNF. Our definition of d-DNNF is inspired by  but applies to set-valued circuits, as in  (see also the z-st-d-DNNFs of ). For each gate of a set-valued circuit , we define the domain of as the variable gates having a directed path to . In particular, for , we have , and if then . We now call a -gate decomposable if it has no inputs or if, letting be its two inputs, the domains and are disjoint. This ensures that no variable of occurs both in an assignment of and in an assignment of . We call a -gate deterministic if, for any two inputs of , the sets and are disjoint, i.e., there is no assignment that occurs in both sets. We call a d-DNNF if every -gate is decomposable and every -gate is deterministic. This assumption allows us, e.g., to tractably compute the cardinality of the set captured by .
The second requirement on circuits is called upwards-determinism and was introduced in . In that paper, it was used to show an improved memory bound; in the present paper, we will always be able to enforce it. A wire in a set-valued circuit is called pure if:
is a -gate; or
is a -gate and, letting be the other input of , we have , i.e., captures the empty assignment.
We say that a gate is upwards-deterministic if there is at most one gate such that is pure. We call upwards-deterministic if every gate of is.
The third requirement concerns the maximal fan-in of circuits, which is simply defined for a set-valued circuit as the maximal fan-in of a gate of . We will require that the maximal fan-in is bounded by a constant.
The fourth and last requirement concerns a new parameter called dependency size. To introduce this, we define the dependent gates of a gate in a set-valued circuit as the gates such that there is a directed path from to . Intuitively, the set captured by may then depend on the set captured by . The dependency size of is , i.e., the maximal number of gates that are dependent on any given gate . We will require this parameter to be connected to the height of the input tree.
Set-valued provenance circuits.
We can now define provenance circuits like in . A set-valued circuit is a provenance circuit of a MSO query on a -tree if:
The variables of correspond to the possible singletons, formally: ; and
The set of assignments captured by is the output of on , formally: . Equivalently, for any tuple of subsets of , we have iff the assignment is in .
Consider the unlabeled tree of Figure fig:tree, the alphabet , and the MSO query with one free first-order variable asking for the leaf nodes whose -annotation is different from that of its parent (i.e., the node carries label and the parent does not, or vice-versa). Consider the labeling mapping to and and to . A set-valued circuit capturing the provenance of on is given in Figure fig:set.
We then know from  that provenance circuits can be computed efficiently, and they can be made to respect our structural requirements:
[(from , Theorem 7.3)] For any fixed MSO query on -trees, given a -tree , we can compute in time a set-valued provenance circuit of on . Further, is a d-DNNF, it is upwards-deterministic, its maximal fan-in is constant, and its dependency size is in , where denotes the height of .
We recall the main proof technique: we convert to a bottom-up deterministic tree automaton on -trees, and we add nodes to to describe the possible valuations of variables. The provenance circuit then captures the possible ways that can read depending on the valuation: we compute it with the construction of , and is a d-DNNF thanks to automaton determinism (see ). Upwards-determinism is shown like in .
The bounds on fan-in and dependency size are not stated in [3, 4] but already hold there. Specifically, the maximal fan-in is a function of the transition function of , i.e., it does not depend on . The bound on dependency size holds because is constructed following the structure of : we create for each tree node a gadget whose size depends only on , and we connect these gadgets precisely following the structure of , so that for any gate of can only contain gates from the node of or from ancestors of in the tree.
In the context of updates, the bound of dependency size will be crucial: intuitively, it describes how many gates need to be updated when an update operation modifies a gate of the circuit. As this bound depends on the height of the input tree, we will conclude this section by a balancing lemma that ensures that this height can always be made logarithmic (which matches our desired update complexity). We will then add support for updates in the next section by extending circuits to hybrid circuits.
In this appendix, we prove Lemma 4:
Our balancing lemma is a general observation on MSO query evaluation on trees, and is in fact completely independent from provenance circuits. It essentially says that the input tree can be assumed to be balanced. Formally, we will show that we can rewrite any MSO query on -trees to an MSO query on a larger tree alphabet so that any input tree for can be rewritten in linear time to a balanced tree on which returns exactly the same output. Because we intend to support update operations, the input tree will be unlabeled, and the rewritten tree will work for any labeling of . Formally:
For any tree alphabet and MSO query on -trees, we can compute a tree alphabet and MSO query on -trees such that the following holds. Given any unlabeled tree with node set , we can compute in linear time a -tree with node set , such that and such that, for any labeling function , we have , where maps to if and otherwise.
We prove Lemma 4 by seeing the input tree as a relational structure of treewidth 1, and invoking the result by Bodlaender  to compute in linear time a constant-width tree decomposition of which is of logarithmic height. We then translate the query to a MSO query on tree encodings of this width, and compute from the tree encoding corresponding to the tree decomposition (we rename some nodes of to ensure that the nodes of are reflected in ). Note that the balanced tree decompositions of  were already used for similar purposes elsewhere, e.g., in , end of Section 2.3.
To prove Lemma 4, we will need to introduce preliminaries about relational instances [abiteboul1995foundations], tree decompositions, and tree encodings.
A relational signature is a set of relation names together with an associated arity (a non-zero natural number). We fix a relational signature that codes unlabeled trees, consisting of two binary relations and indicating the first and second child of each internal node. For any tree alphabet , we let denote a signature to represent labels of , i.e., one unary relation for each . Last, for a tuple of second-order variables, we let denote a signature to represent the interpretation of these variables, i.e., one unary relation for each . By monadic second-order logic (MSO) over , we denote MSO with the relations of and equality in the usual way.
A relational instance of a relational signature is a set of -facts of the form where are elements, is a relation in , and is the arity of . The domain of is the set of elements that occur in .
Given a -tree , we can easily compute in linear time a couple where is a -instance describing the unlabeled tree in the expected way (in particular, is exactly the set of nodes of ), and is the -instance .
A tree decomposition of an undirected graph is a tree (whose nodes are called bags) and a labeling function such that:
For every , there is such that
For every , the set is a connected subtree of .
We still assume for convenience that tree decompositions are rooted, ordered, binary, and full trees. Specifically, they will be computed as rooted binary trees by , they can be made full without loss of generality (in linear time and without impacting the height) by adding empty bags, and we can add an arbitrary order on the children of each internal bag to make them ordered. The width of is , and the treewidth of is the smallest width of a tree decomposition of .
A tree decomposition of a relational instance is a tree decomposition of its Gaifman graph, i.e., the graph on vertex set where there is an edge between any two elements that occur together in some fact. The treewidth of is that of its Gaifman graph.
The definition of tree decompositions ensures that, for any relational instance and tree decomposition , for any , we can talk of the topmost bag of such that ; we write this bag . This mapping can be computed explicitly in linear time given and by [flum2002query, Lemma 3.1].
We will make a standard assumption on our tree decompositions, namely, that the function is an injective function: in other words, the root bag contains only one element, and for any non-root bag with parent bag , we have . This requirement can be enforced on a tree decomposition in linear time using standard techniques, without impacting the width of , and only multiplying the height of by a constant (assuming that the width is constant): specifically, we replace each bag violating the condition by a chain of bags where the new elements are introduced one after the other. Hence, we will always make this assumption.
We now recall the result of Bodlaender , which is the key to our construction:
[from ] For any relational signature , given a relational instance on of width , we can compute in linear time in a tree decomposition of of width , such that is in .
If we fix a relational signature and a treewidth bound , we can compute an alphabet , called the alphabet of tree encodings for and , which ensures the following: given any -instance with a tree decomposition of width , we can translate and in linear time to a -tree (called a tree encoding of ) that can be decoded back in linear time to an instance isomorphic to . What is more, Boolean MSO formulas on -instances (i.e., MSO formulas without free variables) can be translated to Boolean MSO formulas on -trees that are equivalent through encoding and decoding. An example of such a scheme is given in [flum2002query]; we will use a different scheme, detailed in , which ensures a property dubbed subinstance-compatibility: intuitively, removing a fact from amounts to toggling labels on a node of the tree encoding that corresponds to (without changing the skeleton of the tree encoding). The labels of intuitively consist of a pair comprising a domain, i.e., a subset of elements among fixed element names, and an optional fact on the elements of the domain. We omit the formal definition of ; see Section 3.2.1 of  for details.
We are now ready to conclude the proof of Lemma 4:
Proof of Lemma 4.
Let be the input query on -trees. Let . We let be the Boolean MSO query on -instances obtained from in the expected way, making it Boolean by replacing each second-order variable with the unary relation of . Given an input tree , we compute in linear time the -instance which represents it. It is clear that, given a labeling , recalling our earlier definition of the -instance from , the output of on is equal to the set of -instances of -facts on (seeing each such instance as a set of singletons of the form ) such that satisfies .
Let be the width of the tree decomposition obtained when applying Theorem 4 to an input tree decomposition of width (note that we have not specified the input yet). Let us compute from the Boolean MSO query on the alphabet of tree encodings for width which is equivalent to on -instances (up to encoding and decoding), i.e., an instance on satisfies iff its encoding as a -tree satisfies . We take to consist of plus a special label , to be used later.
Now, as is a tree, the treewidth of is . Let us define an instance by adding to the instance of all possible -facts on , plus the instance of all possible -facts on . As all these additional facts are unary, the instance still has treewidth . Hence, by Theorem 4, we can compute in linear time in a tree decomposition of of treewidth and logarithmic height. We also compute in linear time the mapping , and a tree encoding of , i.e., a -tree.
Thanks to subinstance-compatibility, we know that, for any labeling and answer tuple of subsets of , letting and be the - and -instances that respectively denote it, then we can obtain a tree encoding of by toggling the labels of some nodes of . Specifically, each fact of corresponds to one node of whose label has to be changed; further, this mapping can be computed in linear time (see , Lemma 3.2.6).
The last thing to argue is that we can rename the nodes of so that they correspond to the nodes of associated to them, ensuring that, given a labeling function of the tree , we can use it to relabel . (This differs slightly from the original construction of , because we want each node of to be associated to one single node in , carrying all possible variables and labels; by contrast, in the construction of , every fact corresponds to a specific node of .) To fix this, we modify in linear time to another -tree : for each , letting , we replace by a gadget with two copies and of , with being the left child of . The label of is that of , and the label of is made of the same domain as but without any fact; see the exact definition of in Section 3.2.1 of  for details. We then add a right child to which is a new node identified to the element in , which itself corresponds to the node in ; the label of is the fixed special label . This construction is well-defined because the function is injective. We must now argue that the query can be modified (independently from ) to a MSO query on -trees to read labels and variable assignments from these new nodes: specifically, instead of reading (the encodings of) the -facts about (the encoding of) an element , the query should read the label in of the new node of identified with ; likewise, instead of reading (the encodings of) the -facts on an element directly from , the query should read the -annotation of this same new node in identified with . To do this, the translations of the atoms from and in are replaced in by a gadget which finds the bag where the corresponding element was introduced (i.e., the one for which it is in the image of ), finds the new node that we added with label , and reads the label and annotation of this node. We also add a conjunct to to assert that the only nodes that can be part of the interpretation of the are the new nodes in with label , thus ensuring that the set of answers of on any labeling of is correct. This concludes the proof. ∎
5 Hybrid Circuits for Updates
In this section, we extend set-valued circuits to support updates, defining hybrid circuits. We then extend Theorem 4 for these circuits. Last, we introduce a new structural notion of homogenization of hybrid circuits and show how to enforce it. We close the section by stating our main enumeration result on hybrid circuits, which implies our main theorem (Theorem 3), and is proved in the two next sections.
A hybrid circuit is intuitively similar to a set-valued circuit, but it additionally has Boolean variables (which can be toggled when updating), Boolean gates (, , ), and gates labeled which keep or discard a set of assignments depending on a Boolean value. Formally, a hybrid circuit is a circuit where the possible gate types are (set-valued variables), (Boolean variables), , , , , , and . We call a gate Boolean if its type is , , , or ; and set-valued otherwise. We require that the output gate is set-valued and that the following conditions hold:
-gates and -gates have fan-in exactly 0;
All inputs to -gates, -gates, and -gates are Boolean, and -gates have fan-in exactly ;
All inputs to and -gates are set-valued, and -gates have fan-in either 0 or 2;
-gates have one set-valued input and one Boolean input (so they have fan-in exactly 2).
We write to denote the gates of of type , called the Boolean variables, and define likewise the set-valued variables . An example hybrid circuit is illustrated in Figure fig:hybrid.
Unlike set-valued circuits, which capture only one set of assignments, hybrid circuits capture several different sets of assignments, depending on the value of the Boolean variables (intuitively corresponding to the tree labels). This value is given by a valuation of , i.e., a function . Given such a valuation , each Boolean gate captures a Boolean value , computed bottom-up in the usual way: we set for , and otherwise is the result of the Boolean operation given by the type of , applied to the Boolean values captured by the inputs of (in particular, a -gate with no inputs always has value , and a -gate with no inputs always has value ).
We then define the evaluation of under as the set-valued circuit obtained as follows. First, replace each Boolean gate of by a -gate with no inputs (capturing ) if , and by a -gate with no inputs (capturing ) if . Second, relabel each -gate of to be a -gate. Using , for each set-valued gate of , we define the set captured by under : it is the set of assignments (subsets of ) that captures in . The set captured by under is then , for the output gate of .
We last lift the structural definitions from set-valued circuits to hybrid circuits. The maximal fan-in and dependency size of a hybrid circuit are defined like before (these definitions do not depend on the kind of circuit). A hybrid circuit is a d-DNNF, resp. is upwards-deterministic, if for every valuation of , the set-valued circuit has the same property. For instance, the hybrid circuit in Figure fig:hybrid is upwards-deterministic and is a d-DNNF.
Hybrid provenance circuits.
We can now use hybrid circuits to define provenance with support for updates. The set-valued variables of the circuit will correspond to singletons as before, describing the interpretation of the free variables of the query; and the Boolean variables stand for a different kind of singletons, describing which labels are carried by each node. To describe this formally, we will consider an unlabeled tree , and define a labeling assignment of for a tree alphabet as a set of singletons of the form where and . Given a labeling assignment , we can define a labeling function for , which maps each node to . Now, we say that a hybrid circuit is a provenance circuit of a MSO query on an unlabeled tree if:
The set-valued variables of correspond to the possible singletons in an assignment, formally ;
The Boolean variables of correspond to the possible singletons in a labeling assignment, formally ;
For any labeling assignment , let be the Boolean valuation of mapping each to or depending on whether or not, and let be the labeling function on defined as above. Then we require that the set of assignments captured by under is exactly the output of on , formally, .
In other words, for each labeling of the tree , considering the valuation that sets the Boolean variables of accordingly, then is a provenance circuit for on .
Recall the query and alphabet of Example 4, and the tree of Figure fig:tree. A hybrid circuit capturing the provenance of on is given in Figure fig:hybrid (with variable gates being drawn at multiple places for legibility): square leaves correspond to Boolean variables testing node labels, and circle leaves correspond to set-valued variables capturing a singleton of the form for some . In particular, for the labeling of Example 4, the corresponding valuation maps to and and to , and the evaluation of under captures the same set as the circuit of Figure fig:set.
We can now extend Theorem 4 to compute a hybrid provenance circuit as follows:
5.1 Proof of the Provenance Circuit Theorem
In this appendix, we prove Theorem 5.1:
For any fixed MSO query on -trees, given an unlabeled tree , we can compute in time a hybrid provenance circuit which is a d-DNNF, is upwards-deterministic, has constant maximal fan-in, and has dependency size in .
The proof is analogous to that of Theorem 4. The only difference is that the automaton now reads the label of each node as if it were a variable, so that the provenance circuit also reflects these label choices as Boolean variables.
The general idea is that, given the MSO query on -trees, writing , we define a query on unlabeled trees, where , with one second-order variable corresponding to each . The construction is simply that we replace each unary predicate in by the corresponding second-order variable . It is now obvious that, for any labeled tree , defining for each , for any set of subsets of , we have iff . In other words, we have simply turned node labels into second-order variables.
Now, at a high level, we can simply construct a provenance circuit of on in the sense of Theorem 4, replace the input gates corresponding to the variables by a Boolean input gate, and observe that the desired properties hold. We will now give a self-contained proof of the construction, to make sure that we reflect the changes in definitions between the present work and [3, 4].
We will need to introduce some prerequisites about tree automata. Given a tree alphabet , a bottom-up deterministic tree automaton on , or -bDTA, is a tuple where is a finite set of states, are the final states, is the initial function, and is the transition function. The run of a -bDTA on a -tree is the function defined inductively as when is a leaf, and when is an internal node with children and . We say that accepts the tree if the run of on maps the root of to a final state.
We will be interested in bDTAs to capture our non-Boolean query on unlabeled trees. Let be the set of variables, and let , where denotes the powerset of . Letting be an unlabeled tree, we call a -annotation of a function : the annotation intuitively describes the interpretation of the variables of by annotating each node with the set of variables to which it belongs. Letting be a -bDTA, be an unlabeled tree, and be a -annotation of , we say that is a satisfying annotation of on if accepts . In this case, we see as defining an assignment , which is the set . The output of on , written , is the set of assignments corresponding to its satisfying annotations. Following Thatcher and Wright [thatcher1968generalized], and determinizing the automaton using standard techniques [tata], the output of an MSO query (here, on an unlabeled tree) can be computed as the output of an automata for that query. Formally:
[[thatcher1968generalized, tata]] Given a MSO query on unlabeled trees, we can compute a -bDTA such that, for any unlabeled tree , we have .
Restricting to Boolean annotations.
It will be more convenient in the sequel to assume that each tree node carries one single Boolean annotation rather than many, and to distinguish the annotations corresponding to (the original variables of , called enumerable), and those corresponding to (the labels of the input tree, called updatable). We will do this by creating -copies of each tree node , to stand for each separate singleton . To do this, we will consider the fixed alphabet . Intuitively, will be the label of nodes whose annotation corresponds to a variable of , will be the label of nodes whose annotation corresponds to a variable of , and will be the label of nodes whose annotation does not code any variable and should be ignored. Given a -tree , we will write , , and to refer to the set of nodes carrying each label. We will then consider -trees, where , the alphabet of -trees annotated with a Boolean value at each node: as promised, each node carries one single value. Now, a Boolean annotation of a -tree is a function , and we see as a -tree defined in the expected way.
We want to rephrase the evaluation of on an unlabeled tree to a problem on -trees, where variable valuations are coded in Boolean annotations. This process is formalized in the following lemma, whose construction is illustrated in Figure 1; it is analogous to Lemma E.2 of :
For any variable set , given a -bDTA , we can compute a -bDTA such that the following holds: given an unlabeled tree , we can compute in linear time a tree of height and an injective function such that:
is exactly the set of nodes such that for some and ;
is exactly the set of nodes such that for some and ;
is exactly the set of nodes not in the image of , and it includes all internal nodes.
Further, for any -annotation , let be the Boolean valuation of defined by:
If is in the image of , then letting , we set iff ;
If is not in the image of , we set .
Then accepts iff accepts .
Given an input tree , we change it following the idea of Figure 1: we replace each node by a gadget of nodes labeled with , having two subtrees: one whose leaves are labeled and code the variables in order, and another whose leaves are labeled and code the variables in order. This gadget can be completed to a full binary tree by adding leaves labeled as necessary. Now we can clearly rewrite the -bDTA to a -bDTA which is equivalent in the sense required by the lemma. The states of consist of the states of , the pairs of states of , and annotation states which consist of binary sequences of length up to . The final states are the final states of . The initial function and transition function are informally coded as follows. The initial function maps nodes labeled or for to the singleton binary sequence formed of its Boolean value, and it maps nodes labeled for to the empty binary sequence. The transition function is defined only on nodes labeled for , because all internal nodes of carry such a label (as required); and it is defined as follows (where we ignore the Boolean annotation of the node):
Given two states and of , the new state is the pair ;
Given two states that are binary sequences of length , the new state is their concatenation;
Given a binary sequence of length and a pair of states , the new state is the state of , where is the transition function of ;
Given a binary sequence of length and an empty binary sequence, the new state is the state .
On Figure 1, the automaton would reach state on , reach state on and reach state on . Letting and be the states that reaches respectively on and , it reaches state on . Hence, on node , it reaches . This figure illustrates the translation when is an internal node with children and . The case where is a leaf is described in the last bullet point, and is analogous: the leaf in is translated to a node in with one left child that is the root of the tree describing the valuation of , and one right child labeled which is a leaf of .
Now, it is easy to show that is equivalent to in the sense of the lemma statement, which concludes the proof. ∎
We now have a -bDTA to run on a -tree . We can now rephrase our desired provenance result as a provenance result on such automata. We say that a hybrid circuit is a provenance circuit of a -bDTA on a -tree if:
The set-valued variables of correspond to the nodes of with label , formally,
The Boolean variables of correspond to the nodes of with label , formally,
For any Boolean valuation of such that for each , the automaton accepts iff, letting be the restriction of to , and letting be the set of nodes of corresponding to the restriction of to , we have .
We can now rephrase our desired result. Note that the statement of this result implies that our construction is also tractable in the automaton, as we mentioned in the conclusion (Section 9):
Given a -bDTA and a -tree where all internal nodes are labeled , we can compute in time a hybrid circuit which