In this paper, we explore the problem of finding a first-order formula that describes a given sample of classified strings. This problem is meaningful because strings may be used to model sequences of symbolic data such as biological sequences. For instance, in Table 1, we present a sample of classified strings.
The sample in Table 1 represents biological sequences which have been associated with a group of diseases called amyloidosis . The first-order sentence below represents that occurs in a string, and it describes the sample. Variables range over positions in strings, is true if the symbol occurs in position , and represents the successor relation over positions.
An algorithm to deal with the problem of finding a formula of minimal quantifier rank consistent with a given sample of structures over an arbitrary vocabulary is introduced in . As this algorithm works for arbitrary finite relational structures, it runs in exponential time. This algorithm is applied in a general system for learning formulas defining board game rules. These results are also used in finding reductions and polynomial-time programs [12, 13]. The work in  investigates a variation of the problem introduced in  when the class of structures is fixed. This study in  considers monadic structures, equivalence structures, and disjoint unions of linear orders.
In this paper, we study the problem introduced in  when the sample consists of strings represented by finite structures with a successor relation and a finite number of pairwise disjoint unary predicates. We call such a structure a successor string structure. A sample consists of two disjoint, finite sets of successor string structures. Given a sample , the problem is to find a first-order sentence of minimal quantifier rank that is consistent with , i.e., it holds in all structures in and does not hold in any structure in . The size of the sample is the sum of the lengths of all strings in the sample. We intend to solve this problem in polynomial time in the size of .
Ehrenfeucht–Fraïssé games (EF games)  is a fundamental technique in finite model theory [4, 7] in proving the inexpressibility of certain properties in first-order logic. For instance, first-order logic cannot express that a finite structure has even cardinality. The Ehrenfeucht–Fraïssé game is played on two structures by two players, the Spoiler and the Duplicator. If the Spoiler has a winning strategy for rounds of such a game, it means that the structures can be distinguished by a first-order sentence whose quantifier rank is at most , i.e., holds in exactly one of these structures.
Besides providing a tool to measure the expressive power of a logic, Ehrenfeucht–Fraïssé games allow one to investigate the similarity between structures. In , explicit conditions are provided for the characterization of winning strategies for the Duplicator on successor string structures. Using these conditions, the minimum number of rounds such that the Spoiler has a winning strategy in a game between two such structures can be computed in polynomial time in the size of the structures. This allows one to define a notion of similarity between successor string structures using Ehrenfeucht–Fraïssé games.
An essential part of the algorithm in  is the computation of -Hintikka formulas from structures. An -Hintikka formula is a formula obtained from a structure and a positive integer that describes the properties of on the Ehrenfeucht–Fraïssé game with rounds . An -Hintikka formula has size exponential in the size of and holds exactly on all structures such that the Duplicator has a winning strategy for the Ehrenfeucht–Fraïssé game with rounds on and . Besides, Hintikka formulas are representative because any first-order formula is equivalent to a disjunction of Hintikka formulas.
We use results of the Ehrenfeucht–Fraïssé game over successor string structures  in order to design an algorithm to find a sentence which is consistent with the sample in polynomial time. Also, as the size of a Hintikka formula is exponential in the size of a given structure, our algorithm does not use Hintikka formulas. In our case, we define what we call distinguishability formulas. They are defined for two successor string structures , and a natural number based on conditions characterizing the winning strategies for the Spoiler on successor strings structures . In this way, we show that distinguishability formulas hold on , do not hold on , and they have quantifier rank at most . This result is also crucial for the definition of our algorithm and to guarantee its correctness. Our algorithm returns a disjunction of conjunctions of distinguishability formulas. We also show that any first-order formula over successor string structures is equivalent to a boolean combination of distinguishability formulas. This result suggests that our approach has the potential to find any first-order sentence.
Our framework is close to grammatical inference. Research in this area investigates the problem of finding a language model of minimal size consistent with a given sample of strings . A language model can be a deterministic finite automaton (DFA) or a context-free grammar, for instance. Grammatical inference has applications in many areas because strings may be used to model text data, traces of program executions, biological sequences, and sequences of symbolic data in general.
A recent model-theoretic approach to grammatical inference is introduced in . In this approach, it is also used successor string structures to represent strings and first-order sentences as a representation of formal languages. Then, our approach may also be seen as a model-theoretic framework to grammatical inference. The first main difference is that we work with full first-order logic, while the approach in  uses a fragment called CNPL. Formulas of CNPL have the form such that each is a first-order sentence which defines exactly all strings such that is a substring of . Also, CNPL is less expressive than first-order logic. Second, given , the goal of the framework in  is to find a CNPL formula such that and is the length of . Our goal is to find a first-order sentence of minimal quantifier rank.
It is well known that a language is definable in first-order logic over successor string structures if and only if it is a locally threshold testable (LTT) language . A language is LTT if membership of a string can be tested by inspecting its prefixes, suffixes, and infixes up to some length, and counting infixes up to some threshold. The class of LTT languages is a subregular class, i.e., a subclass of the regular languages . A grammatical inference algorithm that returns a DFA may return an automaton which recognizes a language not in LTT. Therefore, our results can be useful when one desires to find a model of an LTT language from a sample of strings. We believe that this is the first work on finding a language model of LTT languages from positive and negative strings.
. In this framework, a sample consists of classified elements from only one structure. The problem is to find a hypothesis consistent with the classified elements where this hypothesis is a formula from some logic. Recall that, in our framework, samples consist of many classified structures. Another logical framework for a similar problem is Inductive Logic Programming (ILP)[16, 3]. ILP uses logic programming as a uniform representation for the sample and hypotheses. As far as we know, our work has no direct relationship with these frameworks.
This paper is organized as follows. In Section 2, we give the necessary definitions of formal language theory and finite model theory used in this paper. Also in Section 2, we have an EF game characterization on strings, and, in Section 3, we translate it into first-order sentences. In Section 3, we also introduce the concept of distinguishability formulas providing some useful properties. In Section 4, we introduce our algorithm, give an example of how the algorithm works, and show its correctness. Furthermore, in this section, we briefly discuss how to find a formula with the minimum number of conjunctions. We conclude in Section 5.
2 Formal Languages and EF Games on Strings
We consider strings over an alphabet . The set of all such finite strings is denoted by , and the empty string by . If is a string, then is the length of . Let denote the concatenation of strings and . For all , , , , if , then is a substring of . Moreover, if (resp. ) we say that is a prefix (resp. suffix) of . We denote the prefix (resp. suffix) of length of by (resp. ). Let and be positions in a string. The distance between and , denoted by , is . A formal language is a subset of . A language is locally threshold testable (LTT) if it is a boolean combination of languages of the form is prefix of , for some , is suffix of , for some , and has as infix at least times , for some and . Therefore, membership of a string can be tested by inspecting its prefixes, suffixes and infixes up to some length, and counting infixes up to some threshold. We assume some familiarity with formal languages. See  for details.
We view a string as a logical structure over the vocabulary with domain , that is, the elements of are positions of . The predicate is the successor relation and each is a unary predicate for positions labeled with . The constants and are interpreted as the positions and , respectively. We call these structures successor string structures. We assume some familiarity with first-order logic (), and we use this logic over successor string structures. For details on first-order logic see [4, 5]. The size of a first-order formula is the number of symbols occurring in . By the quantifier rank of a formula, we mean the depth of nesting of its quantifiers as in the following.
Definition 1 (Quantifier Rank).
Let be a first-order formula. The quantifier rank of , written , is defined as
Given a first-order sentence over successor string structures, the formal language defined by is simply . In general, we do not distinguish between successor string structures and strings. As an example, if , then . LTT languages can be defined in terms of first-order logic. A language is definable by a sentence of over successor string structures if and only if it is LTT .
Now, we can formally define the problem we are interested in. A sample is a finite number of classified strings consisting of two disjoint, finite sets of strings over an alphabet . Intuitively, contains positively classified strings, and contains negatively classified strings. The size of a sample is the sum of the lengths of all strings it includes. We use to denote the size of the sample . A sentence is consistent with a sample if and . Therefore, a sentence is consistent with a sample if it holds in all strings in and does not hold in any string in . Given a sample , the problem consists of finding a first-order sentence of minimum quantifier rank such that is consistent with .
It is well known that every finite structure can be characterized in first-order logic up to isomorphism, i.e., for every finite structure , there is a first-order sentence such that for all structures we have iff and are isomorphic. Since samples are finite sets of finite structures, one can easily build in polynomial-time a first-order sentence consistent with a given sample. For example, let and . The sentence is consistent with the sample. Unfortunately, the quantifier rank of is the number of elements in the domain of plus one. Then, is also consistent with the sample and . Therefore, is not a solution to the problem.
Now, we focus on Ehrenfeucht–Fraïssé games and its importance in order to solve the problem we are considering. Let be an integer such that , and two successor string structures. The Ehrenfeucht–Fraïssé game is played by two players called the Spoiler and the Duplicator. Each play of the game has rounds and, in each round, the Spoiler plays first and picks an element from the domain of , or from the domain of . Then, the Duplicator responds by picking an element from the domain of the other structure. Let and be the two elements picked by the Spoiler and the Duplicator in the th round. The Duplicator wins the play if the mapping is an isomorphism between the substructures induced by and , respectively. Otherwise, Spoiler wins this play. We say that a player has a winning strategy in if it is possible for her to win each play whatever choices are made by the opponent. In this work, we always assume that is different from . Note that if , then the Spoiler has a winning strategy. Therefore, we assume that is bounded by . Now, we define formulas describing the properties of a structure in EF games.
Definition 2 (Hintikka Formulas).
Let be a structure, , and a tuple of variables,
A Hintikka formula describes the isomorphism type of the substructure generated by in . We write whenever . Given a string and a positive integer , the size of the -Hintikka formula is . Therefore, since is bounded by , the size of is exponential in the size of . The following theorems are important to prove our main results. They are presented in  (Theorem 2.2.8 and Theorem 2.2.11).
Theorem 1 (Ehrenfeucht’s Theorem).
Given and , and , the following are equivalent:
the Duplicator has a winning strategy in .
If is a sentence of quantifier rank at most , then iff .
Let be a sentence of quantifier rank at most . Then, there exists structures , …, such that
We use Theorem 2 in order to show that any first-order formula over successor string structures is equivalent to a boolean combination of distinguishability formulas. EF games are essential in our framework because if the Spoiler has a winning strategy in a game on strings and with rounds, then there exists a first-order sentence of quantifier rank at most that holds in and does not hold in . Also, in this case, the sentence is an example of such a sentence. Unfortunately, over arbitrary vocabularies, the problem of determining whether the Spoiler has a winning strategy is -complete .
However, it is possible to do better in the particular case of EF games on successor string structures. For details see . First, we need the following definitions. Let . A partition of is a collection of subsets of such that each element of is included in exactly one subset. An -segmentation of is a partition of with the minimum number of subsets such that for all in the same subset, and if are in the same subset and , then . Each subset in the partition is called a segment.
In the following, we consider substrings over such that for some . Let be a string such that , for . An occurrence of is centered on a position in a string if . An occurrence of centered on a position in is free if and . The set of free occurrences of in is . The free multiplicity of in , denoted by , is the number of free occurrences of in , i.e., . The free scattering of in , denoted by , is the number of segments in a -segmentation of .
Let and . Note that . The occurrence of centered on position in is not free because . However, the occurrence of centered on position in is free. The set of free occurrences of in is . Therefore, . A -segmentation of is . Then, .
Now, we have a result of EF games on successor string structures.
 Let be a natural number, and be strings. The Duplicator has a winning strategy in if and only if the following conditions hold:
or and ;
and for all such that and or .
Besides the importance of EF games on strings to our framework, we also use the above result to define the distinguishability formulas. These formulas are defined based on the conditions characterizing a winning strategy for the Spoiler on successor string structures. In , this result is also used to define a notion of similarity between successor string structures using Ehrenfeucht–Fraïssé games. The EF-similarity between strings and , written , is the minimum number of rounds such that the Spoiler has a winning strategy in the game . Then, the EF-similarity between two strings can be computed in polynomial time in the size of the strings in the following way.
Given two strings and , can be computed in , that is, it can be computed in polynomial time . Our algorithm’s first step is to compute the sufficient quantifier rank to distinguish between any two strings and . Then, the fact that can be computed in polynomial time is important to show that our algorithm runs in polynomial time as well.
It is easy to build a first-order sentence consisting of a disjunction of Hintikka formulas of minimal quantifier rank that is consistent with a given sample. For example, let , , , and . The sentence is a first-order sentence of minimal quantifier rank that is consistent with . Unfortunately, the size of is exponential in the size of . Therefore, can not be built in polynomial time in the size of the sample. This motivates the introduction of distinguishability formulas in Section 3.
3 Distinguishability Formulas
In this section, we define distinguishability formulas for strings , and a natural number . Distinguishability formulas are formulas that hold on , do not hold on and they have quantifier rank at most . The first step is to show that the conditions of Theorem 3 can be expressed by first-order formulas. These formulas are defined recursively in order to reduce the quantifier rank. The recursive definitions can all be simplified to direct definitions with higher quantifier ranks but, in this case, we can not guarantee that the quantifier rank is adequate. These formulas are also important to help the explanation, and they improve readability of sentences returned by our algorithm.
We also set , , , and . Clearly, and for . Besides, the size of is . For example, for and strings and such that and , we have that , , and . Then, and . Therefore, the Spoiler has a winning strategy for .
Now, we turn to the cases in which substrings are important. These cases are conditions 2 and 3 from Theorem 3. Formulas hold in a string when the string between and is . Formulas and express that a string occurs immediately on the right and immediately on the left of a term , respectively.
With respect to the quantifier rank, we have . Furthermore, the size of these formulas is . Now, we define sentences to handle the prefix and suffix of strings. These sentences express that the prefix of length is and the suffix of length is , respectively.
We also set abbreviations and . Therefore, and iff , where . Analogously for . Also, the size of and is . We use these formulas to express condition 2 of Theorem 3. To see why, Let , and . Thus, and . Then, and, from condition 2 of Theorem 3, it follows that the Spoiler has a winning strategy in .
Now, we need sentences regarding free multiplicity and free scattering. Let be a string such that each , and for as in condition 3 from Theorem 3. Now, we set the formula describing that a string occurs centered on position . Then, we give an example of a formula .
Let . Then,
Note that and the size of is . Now, we can use formulas to define expressing that has at least free occurrences. Then, we need to use pairwise different variables and each variable must be in a proper distance from and .
Now, we need to deal with formulas expressing that the scattering of is at least . First, in the following, we set an auxiliary formula in order to make the presentation simpler. The formula below indicates that occurs centered on a position on the left of and at least distant from . This formula is important in ensuring a proper distance from other occurrences of , that is, greater than . Furthermore, the distance between and or must be greater than in order to occur free.
With respect to the quantifier rank, we have . Now, we can define the sentence . After that, we give an example of and .
Let and . Thus,
We also define the following abbreviations and . Then, and