 # Proving that a Tree Language is not First-Order Definable

We explore from an algebraic viewpoint the properties of the tree languages definable with a first-order formula involving the ancestor predicate, using the description of these languages as those recognized by iterated block products of forest algebras defined from finite counter monoids. Proofs of nondefinability are infinite sequences of sets of forests, one for each level of the hierarchy of quantification levels that defines the corresponding variety of languages. The forests at a given level are built recursively by inserting forests from previous level at the ports of a suitable set of multicontexts. We show that a recursive proof exists for the syntactic algebra of every non-definable language. We also investigate certain types of uniform recursive proofs. For this purpose, we define from a forest algebra an algebra of mappings and an extended algebra, which we also use to redefine the notion of aperiodicity in a way that generalizes the existing ones.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Words and trees are used almost universally in Computer Science, and logical formalisms are among the most convenient tools for specifying these objects, or sets thereof. Automata constitute another class of tools, procedural in nature, widely used to define languages; underlying this formalism is a rich algebraic theory, through which further tools from other areas of Mathematics can be used to better understand the properties of word and tree languages. Moreover, the most significant classes of these languages happen to have descriptions in several formalisms. For example, the regular word languages are exactly those that are recognized by finite automata and monoids, and those that are definable by monadic second-order formulas with a unary predicate for each letter and a “left-of” binary positional predicate. Similarly the regular tree languages are simultaneously those that are definable by monadic second-order formulas with two positional predicates (“ancestor-of” and “next-sibling-of”) and those that are recognized by finite tree automata, as well as various sorts of algebraic structures (e.g. finite term algebras, finite forest algebras, finitary preclones).
In the same vein, the word languages definable with first-order logical formulas with the “left-of” predicate have several algebraic and combinatorial descriptions, see [pi97, sc65, krrh65, rhti89, th82]. In particular, these languages are precisely those whose syntactic monoid is aperiodic; thanks to this property, whether a regular word language is first-order definable can be determined with a straightforward algorithm. In the world of tree languages, however, none of the definitions for aperiodicity of the syntactic algebra tried so far has managed to characterize precisely the first-order definable languages [he91, bostwa09], and the techniques invented to show that certain important subclasses of these languages are indeed decidable [bese09, bosest08, bose08, plse11, plse15] did not seem to extend to the whole class.
Forest algebras combine two monoids (horizontal and vertical) in a way which makes it easy for researchers to apply techniques from the theory of monoids and word languages. An encouraging harvest of results has already been obtained with this tool [bowa08, bosest08, bostwa09]. In this paper, we look at a counterpart, in the world of trees and forests, to the description of the aperiodic monoids as the variety of monoids generated by iterated block products of semilattices [rhti89]. This is a description of the first-order definable languages, developed in terms of the variety of finitary preclones [eswe10] generated by iterated block products of preclones that count occurrences of node labels, regardless of the actual tree structure. Intuitively, a block product works as the combination of two tree automata where at every node , the second automaton, besides the label of , also reads the current state reached by the first automaton after reading the subtree rooted at (“below” ) and the outcome of the processing, by the first automaton, of the context of within the tree (“above” ).
The description of first-order definable languages developed in [hath87, mora03] suggests that the threshold-, period- numerical congruences are a fundamental feature in the combinatorics of first-order definable languages. Consistently with this we use the same kind of counting and the corresponding quotient, counter monoids (in these notations, the Boolean OR and the cyclic group are respectively and ). In our formalism, we denote by the one-dimensional algebra where is the horizontal monoid, by the variety of forest algebras generated by , by the variety generated by iterated block products of algebras from , and by the closure of these varieties over joint. Let and denote both the class of forest languages definable by first-order formulas built with the “ancestor” positional predicate and the usual quantifiers only (for ) or the same with the , , modular quantifiers, and the varieties generated by their syntactic algebras. Using the formalism of finitary preclones they introduced in [eswe05], by Esík and Weil have established in [eswe10] the correspondences and .
We explore two ways of defining from another algebra, where multicontexts are the underlying objects. The algebra of mappings , and the “multivertical monoid” of  derived from it enable us to define notions of pumping and aperiodicity that generalize two of the known necessary conditions for membership in , namely aperiodicity of the vertical monoid and the “absence of vertical confusion on uniform multicontexts” defined in [bostwa09]. The extended algebra , where is the powerset of , makes explicit some properties of  that are not directly visible in  or . An example is described in Section LABEL:sec:potthoff: this is a language whose syntactic algebra has aperiodic vertical and multivertical monoids, but where is divided by the group . The language is defined with a formula that, among other things, counts the parity of the length of certain node-to-leaf paths; a pair of elements of does precisely this counting.
An algebra  lies outside of the variety if, and only if there exists an infinite sequence of sets , one for each , of forests belonging to different languages recognized by , such that the elements of cannot be told apart by any forest algebra in . This sequence is usually described through an Ehrenfeucht-Fraïssé game. We call such a sequence a proof of non-membership in . An Ehrenfeucht-Fraïssé game actually builds a recursive proof, where each forest of is built by inserting copies of the elements of at the ports of the corresponding element of a set of multicontexts. Such a proof can be specified with an infinite sequence of such sets ; we denote by the proposition that states the existence a that has the required properties; RC stands for “recursive construction”. We prove that if, and only if holds for every , that is, every algebra that is not first-order has a recursive proof of non-membership. Next, we observe that in the existing proofs, the circuit is either identical to , in which case each forest of is built from copies of the same, finite set of multicontexts (“proof-by-copy”), or is obtained by pumping a starting set of multicontexts (“proof-by-pumping”). The questions of the existence of a proof-by-copy and of a restricted form of proof-by-pumping are both recursively enumerable.
Section 2 contains background on forests, multicontexts and circuits, and on forest algebras and the varieties . In Section 3, we define the algebras and . and the related notions of pumping and aperiodicity. In Section 4, we prove that an algebra is outside if, and only if this can be asserted with a recursive proof; we then explore the notions of proof-by-copy and proof-by-pumping. In Section 5, we discuss in our formalism some typical examples of non-membership proofs. We conclude with some comments and open questions.

## 2 Definitions and Background

### 2.1 Forests, Multicontexts, and Circuits

We consistently work with a finite alphabet , which we assume to always contain a neutral letter , such that for every forest homomorphism , is the identity mapping. Let be another alphabet, disjoint from . A multicontext over is a sequence of trees in which a subset of the leaves consists of ports, where every non-port node carries a label . We denote by the set of all nodes in , by the set of its ports and by the set of the non-port nodes. We work with multicontext where each port either has a label , or has several labels, each specified as a mapping from to a set that is disjoint from . A forest is a multicontext without ports; a context in the usual sense is a forest with a unique port that carries the special label , called its -port. Throughout the paper, this port is considered apart from the others. Given , we denote by the multicontext consisting of all subtrees of rooted at the sons of . The subtree rooted at , i.e. plus the node , is denoted . The context of within , with notation , is built from and by replacing with a -port. The ancestors of this port constitute the trunk of the context. If we deal with a set of multicontexts instead of an individual , we use the notations , , , etc. The sets of all forests, and contexts over are respectively denoted  and . We use the notations and , respectively, for the set of all multicontexts over for the set of all multicontexts with a -port (the contexts-in-multicontexts, so to speak). We use the standard representation for individual forests of multicontexts, where nodes are listed in preorder and where concatenation and represent the father-son relation and horizontal addition, respectively. For example, is a tree with a root labelled and two sons labelled and , while is a forest of two trees, where nodes and are roots, and nodes and are leaves.
Inserting in a context consists in replacing the -port of with a copy of ; the resulting forest is denoted , or . Insertion in multicontexts is done here either on a wholesale basis, i.e. something is inserted at every port, or on a selective basis, when insertion occurs at a pre-specified set of ports. The latter method is defined in Section 3; the former is associated with circuits and the construction of witnesses, as follows.
Let , and be three sets. A circuit over is a set with an element for every ; this component is a multicontext . We can regard as having an input wire for every , an output wire for every , and is the result of unraveling into a tree all those nodes of from which the output wire is accessible. A set of forests over is defined similarly, with an element for every . The insertion of in a circuit over consists in inserting a copy of the forest at every ; the result is a set of forests over , denoted . If and are circuits over then inserting in builds a circuit over . It can be verified, using standard methods, that this operation is associative.

### 2.2 Forest algebras

The reader is assumed to be knowledgeable with the notions of semigroups and monoids, and their relations with regular languages, word congruences and monoid homomorphisms (see [pi84, pi97]). Two types of notations are used for the monoids discussed in the article. There is an additive, or “horizontal” notation where the identity and operation are denoted and , respectively, although this does in no way imply that the latter is commutative. In the multiplicative, or “vertical” notation, the neutral element is denoted and the operation is written with or by concatenation of the arguments.
A transformation of a set is a mapping , i.e. an element of the monoid . A translation in a monoid (with the additive notation) is a mapping , where , defined by . If is commutative, then the translations are of the form . The set with the composition of functions is the translation monoid of .

###### Definition 2.1

A forest algebra is a pair where is a monoid and is a submonoid of which contains .

Monoids and are the horizontal and vertical monoids of , respectively. Because is a submonoid of , its action on is faithful. Forest algebras were introduced in [bowa08] as pairs of abstract monoids; in that case, faithfulness has to be specified in the definition.
A forest algebra homomorphism from to is a pair of mappings where and are monoid homomorphisms and , respectively, and such that for every . The free forest algebra over is ; since it is generated from , a homomorphism is completely specified once  and every , , are known. A forest congruence in  is a pair of equivalence relations, both denoted by , such that in   iff for every context , and in   iff for every forest . A congruence refines another congruence over the same domain, when for all . A homomorphism defines its nuclear congruence: , and conversely a congruence defines a homomorphism from  to . A set is recognized by  if there exist a homomorphism and a subset such that for all , . A context language is recognized in the same way, with an accepting set . The syntactic congruences of these languages are refined by .
A variety of forest algebras is a class of finite forest algebras closed under finite direct product and division. Given forest algebras and , we say that  is a subalgebra of  iff and , and that it divides , with notation , if it is the homomorphic image of a subalgebra of . A variety of forest languages is formally defined as a mapping such that, for every alphabet , is closed under finite boolean operations, inverse homomorphism of free algebras and context quotients. With a language and , the context quotient of by is the set ; a forest algebra which recognizes also recognizes . The lattices of varieties of forest algebras and of varieties of forest languages are isomorphic [bostwa12].
Let be a forest algebra and let . An element is accessible from when for some . A set is strongly connected when its elements are mutually accessible; a strongly connected component of is a subset that is maximal for this property. Let be such a set: we define from it the set of all elements from which is accessible, and its complement , which is an ideal, that is, a subset of closed under the action of every element of . Let . The leaf-completion of a multicontext through a mapping is the forest , obtained by labeling every port with . Consistently with this, the leaf-extension of a homomorphism to is built by defining , for every and one-node tree with label . Then is the image by of the leaf-completion of through .

### 2.3 Block product congruences

It is known that the equivalence relation over  where every class consists in all forests that model the same set of formulas of quantifier depth , is a forest congruence. A generalized version of this congruence is , where are integers, defined as follows:

• it is built around the threshold-, period- counting congruence over , defined by

 p≡τ,πq ⇔ p=q∨(p≥τ∧q≥τ∧(p−q)≡0(mod π));

the quotient monoid is denoted ;

• given , we have if, and only if, for every , the number of nodes with label in and in are congruent under ; the quotient algebra is denoted ; the corresponding surjective homomorphism is ;

• for , given that and the quotient algebra are already known, we define a relabeling operation which consists in replacing, at every node of , the label with the triple

 λ(sαnτ,π,x) = ⟨λ(s,x),αnτ,π(Δ(s,x)),αnτ,π(∇(s,x))⟩;

this defines the relabeling alphabet ; the same is done in a context ; however, the new label of is different depending on whether is on the trunk, so that is a context over , where and are disjoint copies of ;

• for and , we have if, and only if and .

Example: we have and , which illustrates the distinction between trunk and off-trunk nodes.
The quotient algebra is isomorphic to the one-dimensional algebra . A one-dimensional forest algebra111Also called flat algebras in previous works on the topic: the homomorphic image of a forest is the image in a monoid of a “flattened”, “one-dimensional” version of the forest. The wording is also a reference to the notion that an algebra recognizes forest languages that are “more two-dimensional” than those recognized by . is a pair such that for every homomorphism and every , there exists such that . In such an algebra, the homomorphic image of a forest is independent of its structure, that is, the algebra only considers the string of its node labels, given in a predetermined order (e.g. in preorder). Therefore, associates to a monoid homomorphism , such that . We denote by the (unique) one-dimensional algebra built from .
The congruences can be defined algebraically, as follows. Let and denote respectively the variety of monoids generated by and the variety of forest algebras generated by the algebras where . Then for every language , its syntactic forest algebra belongs to if, and only if refines its syntactic congruence, or equivalently, iff divides . Next, every algebra is a block product , with . We use to denote the variety generated by block products of the form with . Finally, . We will make abundant use of the following.

###### Proposition 2.2

The following statements on a finite-index congruence over  are equivalent: ; ; the congruence refines .

Let denote the variety of all forest languages definable with first-order logic formulas with the and quantifiers and the ‘ancestor’ positional predicate; for , let denote the variety defined in terms of the same sort of formulas, where now the , , modular quantifiers are also allowed. It was proved in [eswe10] that the syntactic preclones of the languages in generate the same variety as the iterated block products of preclones defined in terms of counting under threshold one (i.e. the monoid ); adding to the generating preclones those defined with counting under the congruence yields a characterization for . It can be verified that these equivalences translate into and .
Remark. Actually, , where is the Boolean OR monoid, and similarly , so that working in terms of nontrivial thresholds is not mandatory. However, doing so makes it possible to follow more closely the counting-under-threshold that seems to be inherent to the construction of proofs of non-membership in , and is reminiscent to the description of the first-order definable forest languages developed in [hath87, mora03]. Note that a characterization also exists, where is the cyclic group of order ; we put aside this special case in the current version of this paper.

## 3 Algebras for Multicontexts

Forest algebras were designed as tools to handle trees, forests, and contexts over . Dealing with multicontexts over as we do in this article demands that a suitable algebraic structure be developed to describe how a forest algebra works on them. A first approach consists, given a forest algebra , in regarding a multicontext as a specification for a multivariate mapping from to . This defines the algebra of mappings ; it is used to define the notion of pumping, which underlies the construction of certain Ehrenfeucht-Fraïssé games, and to associate to  a threshold and a period that are consistent with those used so far in the literature. A second approach consists in considering that a port label specifies elements of are allowed as inputs at that port. This leads to the definition of the extended algebra , which we use to generalize once more the notions of threshold, period, and aperiodicity. Necessary conditions for first-order definability, that supersede some of the existing ones, are defined from the latter.

### 3.1 Multicontexts

We use both and to denote the pair consisting of a multicontext , where every interior node carries a label , and a port labeling . When this pair is equipped with a second port labeling , we denote the resulting tuple when is fixed and the emphasis is on as a whole, and when it is understood that is fixed and is one of several possible second port labelings. Next, instead of labeling a port directly with a horizontal monoid element, as it was done in the previous section, we take and in sets and , respectively, such that , and are pairwise disjoint; when dealing with specific algebras, leaf extensions of the appropriate homomorphisms are then defined on and . Note that we are ultimately interested in the recognition of languages over , so that and are artefacts used in this process and the ultimate results should not depend on them. The tuples over and , along with the contexts defined from them by replacing a leaf with a special port , constitute a forest algebra ; those over and constitute ; the reader can verify that both are free algebras.
Besides the insertion in a context, i.e. the monoid operations in , and , we define an operation that does multiple, simultaneous insertions in a multicontext from . Given sets of multicontexts and and , with for every , we denote by the set of all multicontexts that can be built by taking an element , inserting at each port a multicontext and replacing the label of with the neutral letter ; with this new label, has no effect on the image by a homomorphism of , while it remains available to be used in reasonings and proofs. No other label is modified, so that in particular if is a copy of and , then the counterpart of in satisfies .
Let and let . Then with , we use the notations and for and , respectively. Given a congruence , we say that is -stable when every pair of ports satisfies .
Next, let and let be the set of all ports with label in . With we define the set obtained by pumping times the set at ,by: and . This definition of pumping is consistent with the definition of the vertical monoid of (where both and are singletons), with the “vertical confusion” defined in [bostwa09] (where and are singletons), and with the “vertical confusion on uniform multicontexts” also discussed in [bostwa09] (where and are singletons and the ports of are indistinguishable by any congruence).

### 3.2 The algebra of mappings

Let be a finite algebra and a surjective homomorphism. We look for a reasonable way of extending  to , besides the one that consists in defining a leaf extension of to . With this in mind, we define the algebra of mappings of , which we denote . To do so, we show how to translate a congruence in into a congruence in , and vice versa. Define a mapping from to , that turns into a forest by replacing all port labels with the neutral letter ; define from to in exactly the same way; both mappings constitute a surjective homomorphism from to . Thus, given a forest congruence over , a forest congruence is defined in a natural way over . In the other direction, let be finite and let

be simultaneously regarded as a vector

a mapping . Given , define by . From and , we build a forest over by replacing in every port label with . We build from in the same way; the -port of retains its label. We extend to by defining for every , so that in fact is one of the leaf extensions for that can be built on , and define a mapping by ; similarly, we define by . Next, we define and , the sets of all mappings and , respectively, and given and , the operations and vertical action , , and , so that the pair constitutes a forest algebra222A notation that mentions , e.g. , would actually be more accurate, if more cumbersome. and defined by and is a homomorphism. Let be the nuclear congruence of : we define from it an equivalence between nodes, also denoted . Let where is a set of multicontexts closed under :

iff and and ;

nodes equivalent under this relation “cannot be told apart by ”. We also write and in order to specify where the nodes are located. Given and we define the mapping

 γ[m;h1,…,hN−1]:K→K by  γ[m;h1,…,hN−1](k)=γ[m](h1,…,hN−1,k).

Since is closed under , is the same for every and we can use the notation . Then with , we observe

 γ[M⋅Z––M′;h1,…,hN−1] = γ[M;h1,…,hN−1]∘γ[M′;h1,…,hN−1]
 andγ[M⋅Z––M′](h1,…,hN) = γ[M](h1,…,hN−1,γ[M′](h1,…,hN)).

Therefore, every operation satisfies the compatibility property for [busa00, Definition 5.1].

###### Proposition 3.1

Let and be built from and . Then the nuclear congruence of is refined by the congruence built recursively over , as in Section 2.3. Hence, .

Proof. By induction on . Recall that a given is regarded both as a mapping and as a vector . For the case, we associate to every a vector with component in and labels in , where , , and , , are respectively the number of nodes with label and the number of ports with . The algebra is isomorphic to , so that, with some abuse of notations, we can write , and given a mapping , the image of can be represented as a vector . Within , there is an equivalence class under for every possible value of , i.e. every vector in . Then given ,

 α1τ,π[m](ξ) = α1τ,π(m)+∑x∈ports(m)ξ(ν(x)) = ∑a∈Ap1(m)a+∑b∈Bξ1(b)⋅p1(m)b.

From there, if , then . With , the induction hypothesis states that if , then for every mapping , the leaf completions of and through satisfy . Assume that . Two nodes or ports of and of receive the same label in the versions of and relabeled according to iff , and . By the induction hypothesis, the last two items imply, for every :

 ˘ξ(Δ(m,x))≈n−1τ,π˘ξ(Δ(m′,x′))and˘ξ(∇(m,x))≈n−1τ,π˘ξ(∇(m′,x′)),

which means that and receive the same label in the versions of and relabeled according to , that is, .

The algebras and are not isomorphic, however. To see this, let and , so that , and , so that is isomorphic to , where, and finally . With and , we have , while is the constant function that map onto .

### 3.3 Equivalence under pumping

We use the algebra of mappings to define a “threshold , period equivalence under pumping” congruence within . First, let denote the relation where two forests are equivalent iff they are the same up to horizontal permutations within a sum (we might as well use instead of ). Then we consider a special case of a multicontext where any two ports satisfy , that is, their contexts within are indistinguishable. Then the -stable sets of ports are exactly the sets , ; we say that is suitable for pumping. Pumping333Note that this formalism also covers the case where pumping is done “horizontally”, i.e. where we are dealing with a multicontext of the form and where . the singleton along a -stable set of ports , we obtain for each a singleton . We now define , for every . First, coincides with . Next, the forest congruence is generated by the pairs and the corresponding context congruence by the pairs , where , and . Then recursively for , given a set of multicontexts closed under and a set that is -stable, is the congruence generated by the pairs and where , , , and . We denote by the (infinite) quotient algebra and by the corresponding surjective homomorphism.

###### Proposition 3.2

Every congruence of finite index over  is refined by a congruence .

Proof. Let be finite. Let and let be suitable for pumping. Assume that ; we pump the singleton along . Given we define and the mapping

 ζ:H→H  ζ(k)=α[m](h1,…,hN−1,k).

Observe that and in general,

 α[m(θ,Z)](h1,…,hN)=ζθ−1(g).

The mapping generates a subsemigroup of ; from the threshold and period of we obtain integers and such that, for all combination of and , we have