Deterministic Regular Expressions With Back-References

02/05/2018
by   Dominik D. Freydenberger, et al.
0

Most modern libraries for regular expression matching allow back-references (i.e., repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of a suitable automaton model, and a generalization of the Glushkov construction. We demonstrate that, compared to their non-deterministic superclass, these deterministic regular expressions with back-references have desirable algorithmic properties (i.e., efficiently solvable membership problem and some decidable problems in static analysis), while, at the same time, their expressive power exceeds that of deterministic regular expressions without back-references.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

03/14/2019

Regular Expressions with Backreferences: Polynomial-Time Matching Techniques

Regular expressions with backreferences (regex, for short), as supported...
09/17/2021

Games for Succinctness of Regular Expressions

We present a version of so called formula size games for regular express...
05/24/2022

Register Set Automata (Technical Report)

We present register set automata (RsAs), a register automaton model over...
05/31/2018

Practical Study of Deterministic Regular Expressions from Large-scale XML and Schema Data

Regular expressions are a fundamental concept in computer science and wi...
08/05/2020

Glushkov's construction for functional subsequential transducers

Glushkov's construction has many interesting properties and they become ...
06/05/2019

An Effective Algorithm for Learning Single Occurrence Regular Expressions with Interleaving

The advantages offered by the presence of a schema are numerous. However...
02/06/2019

Modeling Terms by Graphs with Structure Constraints (Two Illustrations)

In the talk at the workshop my aim was to demonstrate the usefulness of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Regular expressions were introduced in 1956 by Kleene [34] and quickly found wide use in both theoretical and applied computer science, including applications in bioinformatics [41], programming languages [49], model checking [48], and XML schema languages [47]. While the theoretical interpretation of regular expressions remains mostly unchanged (as expressions that describe exactly the class of regular languages), modern applications use variants that vary greatly in expressive power and algorithmic properties. This paper tries to find common ground between two of these variants with opposing approaches to the balance between expressive power and tractability.

Regex

The first variant that we consider are regex, regular expressions that are extended with a back-reference operator. This operator is used in almost all modern programming languages (like e. g. Java, PERL, and .NET). For example, the regex defines , as can create a , which is then stored in the variable and repeated with the reference . Hence, back-references allow to define non-regular languages; but with the side effect that the membership problem is -complete (cf. Aho [2]).

Regex were first examined from a theoretical point of view by Aho [2], but without fully defining the semantics. There were various proposals for semantics, of which we mention the first by Câmpeanu, Salomaa, Yu [10], and the recent one by Schmid [46], which is the basis for this paper. Apart from defining the semantics, there was work on the expressive power [10, 11, 25], the static analysis [11, 23, 24], and the tractability of the membership problem (investigated in terms of a strongly restricted subclass of regex) [21, 22]. They have also been compared to related models in database theory, e. g. graph databases [4, 26] and information extraction [20, 24].

Deterministic Regular Expressions

The second variant, deterministic regular expressions (also known as 1-unambiguous regular expressions), uses an opposite approach, and achieves a more efficient membership problem than regular expressions by defining only a strict subclass of the regular languages.

Intuitively, a regular expression is deterministic if, when matching a word from left to right with no lookahead, it is always clear where in the expression the next symbol must be matched. This property has a characterization via the Glushkov construction that converts every regular expression into a (potentially non-deterministic) finite automaton , by treating each terminal position in as a state. Then is deterministic if is deterministic. As a consequence, the membership problem for deterministic regular expressions can be solved more efficiently than for regular expressions in general (more details can be found in [31]). Hence, in spite of their limited expressive power, deterministic regular expressions are used in actual applications: Originally defined for the ISO standard for SGML (see Brüggemann-Klein and Wood [9]), they are a central part of the W3C recommendations on XML DTDs [7] and XML Schema [27] (see Murata et al. [42]). Following the original paper [9], deterministic regular expressions have been studied extensively. Aspects include computing the Glushkov automaton and deciding the membership problem (e. g. [8, 31, 44]), static analysis (cf. [40]), deciding whether a regular language is deterministic (e. g. [16, 31, 39]), closure properties and descriptional complexity [37], and learning (e. g. [5]). One noteworthy extension are counter operators (e. g. [29, 31, 36]), which we briefly address in Section 8.

Deterministic REGEX

The goal of this paper is finding common ground between these two variants, by combining the capability of backreferences with the concept of determinism in regular expressions. Generally, our definition of determinism for regex mimics that for classical regular expressions, i. e., we define a Glushkov-like conversion from regex into suitable automata and then say that a regex is deterministic if and only if its Glushkov-automaton is. The thus defined class of deterministic regex is denoted by , and refers to the corresponding language class.

The underlying automaton model for this approach is a slight modification of the memory automata () proposed by Schmid [46] as a characterisation for the class of regex-languages. More precisely, we introduce memory automata with trap-state (), for which the deterministic variant, the , is better suited for complementation than the deterministic .

As indicated by the title of this subsection, it is our hope to preserve, on the one hand, the increased expressive power provided by backreferences, and, on the other hand, the tractability that usually comes with determinism. While it is not surprising that do not achieve these goals to the full extent, a comprehensive study reveals that their expressive power clearly exceeds that of (deterministic) regular expressions, while, at the same time, being much more tractable than the full class of regex.

We shall now outline our main results according to these two aspects, and we start with the algorithmic features of :

  1. We can decide in time , whether a regex with variables and over alphabet is deterministic (if so, we can construct its Glushkov-automaton in the same time).

  2. We can decide in time , whether can be generated by an with variables and occurrences of terminal symbols or variable references.

  3. The intersection-emptiness problem for is undecidable, but in for variable-star-free111A regex is variable-star-free if each of its sub-regexes under a Kleene-star contains no variable operations (see Section 6). (as well as the inclusion and equivalence problem).

Results 1 and 2 are a consequence of the Glushkov-construction for regex. In view of the -hardness of the membership problem for the full class of regex, result 2 demonstrates that the membership problem for can be solved almost as efficiently as for deterministic regular expression (which is possible in time [8, 44] or [31]). The positive results of 3 are based on encoding the intersection-emptiness problem in the existential theory of concatenation with regular constraints. With respect to 3, we observe that it is in fact the determinism, which makes the inclusion and equivalence problem for variable-star-free decidable, since these problems are known to be undecidable for non-deterministic variable-star-free regex (see [23]). Moreover, result 3 also yields a minimization algorithm for variable-star-free (enumerate all smaller candidates and check equivalence).

Throughout the paper, there are numerous examples that demonstrate the expressive power of . We also provide a tool for proving non-expressibility, which does not use a pumping argument. In fact, it can be shown that, despite the automata-theoretic characterisation of deterministic regex, contains infinite languages that cannot be pumped (in the sense as regular languages are “pumpable”). In addition, we show the following results with respect to ’s expressiveness:

  1. There are regular languages that are not in .

  2. contains regular languages that are not deterministic regular.

  3. contains all unary regular languages.

  4. is not closed under union, concatenation, reversal, complement, homomorphism, inverse homomorphism, intersection, and intersection with deterministic regular languages.

While result 1 fits to the situation that there are also regular languages that are not deterministic regular languages (and, thus, points out that our definition of determinism restricts regex in a similar way as classical regular expressions), result 2 points out that in the case of regex this restriction is not as strong. With respect to result 3, note that not all unary regular languages are deterministic regular (see [37]). From a technical point of view, it is worth mentioning that in some of our proofs, we use subtleties of the back-reference operator in novel ways. Intuitively speaking, defining and referencing variables under a Kleene-star allows for shifting around factors between different variables (even arbitrarily often in loops), which makes it possible to abuse variables for generating seemingly non-deterministic regular structures in a deterministic way, instead of generating non-regular structures.

As a last strong point in favour of our definition of determinism for regex, we examine a natural relation of the definition of determinism (i. e., requiring determinism only with respect to a constant look-ahead). We prove that checking whether a regex is deterministic under this more general notion is intractable (even for the class of variable-star-free regex).

Summing up, from the perspective of deterministic regular expressions, we propose a natural extension that significantly increases the expressive power, while still having a tractable membership problem. From a regex point of view, we restrict regex to their deterministic core, thus obtaining a tractable subclass. Hence, the authors intend this paper as a starting point for further work, as it opens a new direction on research into making regex tractable.

Structure of the paper

In Section 2, we define some basic concepts and the syntax and semantics of regex; in addition, due to the fact that this aspect is often neglected in the existing literature, which caused misunderstandings, we also provide a thorough discussion of existing variants of regex in theory and practice. Section 3 is devoted to the definition of memory automata with trap state and their deterministic subclass. In addition, we provide an extensive automata-theoretic toolbox in this section that, besides showing interesting facts about , shall play an important role for our further results. Next, in Section 4, we define deterministic regex and provide the respective Glushkov-construction. The expressive power of (and of related classes resulting from different variants of ) is investigated in Section 5 and the decidability and hardness results of the static analysis of are provided in Section 6. Finally, in Section 7, we discuss the above mentioned relaxation of determinism, and we conclude the paper by giving some conclusions in Section 8.

2 Preliminaries

We use to denote the empty word. The subset and proper subset relation are denoted by and , respectively. Let be a finite terminal alphabet (unless otherwise noted, we assume ) and let be an infinite variable alphabet with . For a word and for every , , denotes the symbol at position of . We define and for all , and, for with , let for all and all with . A is a factor of if there exist with . If , is also a prefix of .

We use the notions of deterministic and non-deterministic finite automata ( and ) like [32]. If an can have -transitions, we call it an . Given a class of language description mechanisms (e. g., a class of automata or regular expressions), we use to denote the class of all languages with . The membership problem for is defined as follows: Given a and a , is ?

Next, we define the syntax and semantics of regular expressions with backreferences.

Definition 1 (Syntax of regex).

The set of regex over and is recursively defined as follows:
Terminals and : and for every .
Variable reference: and for every .
Concatenation: and if .
Disjunction: and if .
Kleene plus: and if .
Variable binding: and if with .
In addition, we allow as a regex (with ), but we do not allow to occur in any other regex. An with is called a proper regular expression, or just regular expression. We use to denote the set of all regular expressions. We add and omit parentheses freely, as long as the meaning remains clear. We use the Kleene star as shorthand for , and as shorthand for for non-empty .

We define the semantics of regex using the ref-words (short for reference words) by Schmid [46]. A ref-word is a word over , where . Intuitively, the symbols and mark the beginning and the end of the match that is stored in the variable , while an occurrence of represents a reference to that variable. Instead of defining the language of a regex  directly, we first treat as a generator of ref-words by defining its ref-language as follows.

  • For every , .

  • For every , ,

    • ,

    • .

  • For every ,

    • ,

    • , and

    • .

In particular, if is a regular expression, then . An alternative definition of the ref-language would be , where is the proper regular expression obtained from by replacing each sub-regex by , and each by .

Intuitively speaking, every occurrence of a variable in some functions as a pointer to the next factor to the left of this occurrence (or to if no such factor exists). In this way, a ref-word compresses a word over , the so-called dereference of , which can be obtained by replacing every variable occurrence by the corresponding factor (note that might again contain variable occurrences, which need to be replaced as well), and removing all symbols afterwards. See [46] for a more detailed definition, or the following Example 1 for an illustration. Finally, the language of a regex is defined by .

Example 1.

Let . Then

or, equivalently, , with .

An interesting example is the regex with ref-language , where . For example, for

we have with . Using induction, we can verify that . Thus, .

Hence, unlike regular expressions, regex can define non-regular languages. The expressive power comes at a price: their membership problem is -complete (follows from Angluin [3]), and various other problems are undecidable (Freydenberger [23]). Starting with Aho [2], there have been various approaches to specifying syntax and semantics of regex. While [2] only sketched the intuition behind the semantics, the first formal definition (using parse trees) was proposed by Câmpeanu, Salomaa, Yu [10], followed by the ref-words of Schmid [46]. In the following, we provide a more detailed discussion of the different approaches and actual implementations of regex.

2.1 Regex in Theory and Practice

In this section, we motivate the choice of the formalization of regex syntax and semantics that are used in the current paper, in particular in comparison to [10], and then connect these to the use of back-references in actual implementations. Note that in order to explain how our results and concepts can be adapted to various alternative definitions of syntax and semantics, we anticipate some of the technical content of our paper.

2.1.1 Choices behind the definition

We begin with a discussion of semantics of back-references, which most actual implementations define in terms of the used matching algorithm222From a theory point of view, this might be considered a rather generous use of the term “define”.. For a theoretical analysis, this approach is not satisfactory.

To the authors’ knowledge, the first theoretical analysis of regular expressions is due to Aho [2], who defined the semantics informally. Câmpeanu, Salomaa, Yu [10] then proposed a definition using parse trees, which was precise, but rather technical and unwieldy. Schmid [46] then introduced the definition with ref-words that we use in the current paper. The two definitions differ only in some semantical particularities, which we discuss further down.

The most obvious difference in approaches to syntax is that some formalizations, like [10], do not use variables, but numbered back-references. For example, would be written as , where refers to the content of the first pair of parentheses (called the first group).

After working with this definition for some time, the authors of the present paper came to the conclusion that using numbered back-references instead of named variables is inconvenient (both when reading and writing regex). The developers of actual implementations seem to agree with this sentiment: While using numbered back-references was well-motivated when considering PERL at the time [10] was published, most current regex dialects allow the use of named groups, which basically act like our variables (depending on the actual dialect, see below). The choice between variables and numbered groups is independent of the choice of semantics, as parse trees can also be used with variables, see [23]. Hence, using variables instead of numbers is a natural choice.

Building on this, the next question is whether the same variable can be bound at different places in the regex (which is automatically excluded by the use of numbered groups as in [10]), i. e., whether one allows expressions like

While some implementations that have developed from back-references forbid these constructions to certain degrees (see Section 2.1.2 below), there seems to be no particular reason for this decision when approaching this question without this historical baggage. In fact, one can argue from a point of applications that expressions like the following make sense (abstracting away some details that would be needed in actual use):

In fact, these constructions are explicitly allowed in the regex formulas of Fagin et al. [20], that are closely related to regex. In particular, both the semantic definitions (ref-words and parse trees) allow this choice. Thus, there seems to be no particular practical reason to disallow these constructions when considering only the model (instead of its algorithmic properties).

In addition to disallowing the repeated binding of the same variable described above, the regex definition in [10] also includes a syntactic restriction that changes the expressive power considerably: It requires that a backreference can only appear in a regex if it occurs to the right of corresponding group number . In [10], otherwise, the expression is called a “semi-regex”. Consider from Example 1. In the numbered notation of [10], this would be expressed as , when adding group numbers to the groups to increase readability. But using definitions from [10], is only a semi-regex, as the reference occurs to the left of group 3.

The motivation behind this restriction is not explained in [10]. While one might argue that this was chosen to avoid referencing unused groups, the definition of semantics in [10] still needs to deal with this problem in regexes like , and handles them by assigning  (like the definition from [46], which we use as well). Hence, even on “semi-regex”, the parse tree semantics behave like the ref-word semantics.

Arguably, the restriction has an advantage from a theoretical point of view, as it allows Câmpeanu, Salomaa, Yu [10] and Carle and Narendran [11] to define pumping lemmas for this class. Using these, it is possible to show that languages like from Example 1 or the language from Lemma 3 cannot be expressed with the regex model from [10]. But in other areas, there seems to be no advantage in this choice: Even under this restriction, the membership problem is -complete (since it is still possible to describe Angluin’s pattern languages [3]), the undecidability results from [23] on various problems of static analysis are unaffected by this choice, and even the proof of Theorem 9 (the undecidability of the disjointness problem for deterministic ) directly works on this subclass. In summary, the authors of the current paper see no reason to adapt this restriction.

For full disclosure, the second author points out that in his own articles [45, 46], using [10] as reference for a full definition of regex is not entirely correct, since the restrictions of [10] discussed above are not used in these papers; instead they talk about the language class of the current paper.

The last choice in the definition that we need to address is how we deal with referencing undefined variables. Both [10] and [46] default those references to (as do others, like [23]); but there is also literature, like [11], that uses  as default value (under these semantics, a ref-word that contains a variable that de-references to cannot generate any terminal words; the same holds for a parse tree that contains such a reference). This choice can easily be implemented in both semantics by discarding a ref-word or parse tree that contains such a reference; and a (see Section 3) can reject if a run encounters a reference to such an undefined memory.

While these “-semantics” are also used in some actual implementations, the authors of the current paper are against this approach. One of the reasons is that using as default allows the use of curious synchronization effects that distract from the main message of this paper. For example, let for some , and define

If unbound variables default to , this regex generates the language

as every variable needs to be assigned exactly once (otherwise, a reference would return and block). Hence, using this semantics, even variables that are bound only to can be used for synchronization effects. While this can lead to interesting constructions, the authors think that it provides more insight to study the effects of back-references on lower bounds without relying on these additional features. This way, there is no question whether the hardness of the examined problems is due to the effects of the -semantics.

Furthermore, all examples in the present paper can be adapted from the used -semantics to -semantics: Given an with the variables , define . First, we observe that the language that is defined by under -semantics is the same language that defines under -semantics. Furthermore, note that if is deterministic (in the sense as shall be defined in Section 4), is also deterministic. The analogous construction can be used for (and ).

While it is possible to adapt most of the results in the current paper directly to this alternative semantics, the authors chose to keep the paper focused on -semantics.

2.1.2 Actual implementations

We now give a brief overview of how back-references are used in some actual implementations. For a good introduction on various dialects, the authors recommend [30], in particular the section on back-references and named groups. As this behavior is often under-defined, badly documented, and implementation dependent, this can only be a very short and superficial summary of some behavior.

Before we go into details, we address why back-references are used, in spite of the resulting -hard membership problem: Most regex libraries use a backtracking algorithm that can have exponential running time, even on many proper regular expressions (see Cox [15]). From this point of view, back-references can be added with little implementation effort and without changing the efficiency of the program.

Most modern dialects of regex not only support numbered back-references as used by [10], but also named capture groups, which work like our variables. In some dialects like e. g. Python, PERL, and PCRE, these act as aliases for back-references with numbers; hence, would be interpreted as . As a consequence, each name resolves to a well-defined number. As some of these dialects assign the empty set as default value of unbound back-references (or group names), the resulting behavior is similar implicitly requiring the restriction from [10]. This implementation of named capture groups seems to be mostly for historical reasons (as back-references were introduced earlier).

In contrast to this, there are other dialects that use numbered back-references and explicitly allow references to access groups that occur to their right in the expression. For example, the W3C recommendation for XPath and XQuery functions and operators [33] defines regular expressions with back-references for the use in fn:matches. There, it is possible to refer to capture groups that occur to the right of the reference (although only for the capture groups 1 to 9, but not for 10 to 99, which might be considered a peculiar decision). As this dialect defaults unbound references to , it is possible to directly express by renaming the variable references to back-references.

Furthermore, .NET allows the same name to be used for different groups, for example . While .NET defaults unset variables to , it is possible to express , by using an expression like . In the same way, every regex in the sense of our paper can be converted into an equivalent .NET regex.

Finally, in 2007 (just four years after the publication of [10]), PERL 5.10 introduced branch reset groups (which were also adapted in PCRE). These reset the numbering inside disjunctions, and allow expressions that behave like the expression . This allows PERL regex to replicate a large part of the behavior of .NET regex.

In conclusion, it seems that almost every formalization of regex syntax and semantics can be justified by finding the right dialect; but every restriction might be superseded by the continual evolution of regex dialects. Hence, the current paper attempts to avoid restrictions; and when in doubt, we choose natural definitions over trying to directly emulate a single dialect. Therefore, we use variables instead of numbered back-references, and allow multiple uses of the same variable name.

The authors acknowledge that most actual implementations of “regular expressions” allow additional operators. Common features are counters, which allow constructions like e. g. that define the language , character classes and ranges, which are shortcuts for sets like “all alphanumeric symbols” or “all letters from b to y”, and look ahead and look behind, which can be understood as allowing the expression to call another expressions as a kind of subroutine.

While these operators are outside of the scope of the current paper, we briefly address the issue of counters. These are used in XML DTDs and XML Schema, and were studied in connection to determinism. In particular, Gelade, Gyssens, Martens [29] described how counters can be added to finite automata and proposed an appropriate extension of determinism and Glushkov construction to this model. Although the current paper does not address this matter (in order to keep the paper focussed), the that we introduce in Section 3 can also be extended with counters (like the extension to in [29]). Likewise, the Glushkov constructions of [29] and the current paper can be combined, as can the notions of determinism. The membership problem for the resulting class of deterministic regex with counters can then be solved as efficiently as for deterministic regex (see Theorem 5).

3 Memory Automata with Trap State

In this section, we define memory automata with trap-state, the deterministic variant of which will be the algorithmic foundation for deterministic regex. Before moving on to the actual definition of deterministic regex in Section 4 and the applications of memory automata with trap-state, we subject this automaton model to a thorough theoretical analysis. Most of the thus obtained insights will have immediate consequences and applications for proving the main results regarding deterministic regex, while others have the mere purpose of supporting our understanding of memory automata (and therefore the important class of regex languages).

Memory automata [46] are a simple automaton model that characterizes . Intuitively speaking, these are classical finite automata that can record consumed factors in memories, which can be recalled later on in order to consume the same factor again. However, for our applications, we need to slightly adapt this model to memory automata with trap-state.

Definition 2.

For every , a -memory automaton with trap-state, denoted by , is a tuple , where is a finite set of states that contains the trap-state , is a finite alphabet, is the initial state, is the set of final states and is the transition function (where denotes the power set of a set ), which satisfies , for every , and , for every , . The elements , , and are called memory instructions (they stand for opening, closing and reseting a memory, respectively, and leaves the memory unchanged).

A configuration of is a tuple , where is the current state, is the remaining input and, for every , , is the configuration of memory , where is the content of memory and is the status of memory (i. e., means that memory is open and means that it is closed). The initial configuration of (on input ) is the configuration , a configuration is an accepting configuration if and .

can change from a configuration to a configuration , denoted by , if there exists a transition with either ( and ) or (, and ), and, for every , ,

  • ,

  • ,

  • ,

  • ,

  • .

Furthermore, can change from a configuration to the configuration , if for some , and , such that with and .

A transition is an -transition if and is called consuming, otherwise (if all transitions are consuming, then is called -free). If , it is called a memory recall transition and the situation that a memory recall transition leads to the state , is called a memory recall failure.

The symbol denotes the reflexive and transitive closure of . A is accepted by if , where is the initial configuration of on and is an accepting configuration. The set of words accepted by is denoted by .

Note that executing the open action on a memory that already contains some word discards the previous contents of that memory. A crucial part of is the trap-state , in which computations terminate, if a memory recall failure happens. If is not accepting, then are (apart from negligible formal differences) identical to the memory automata introduced in [46], which characterize the class of regex language. If, on the other hand, is accepting, then every computation with a memory recall failure is accepting (independent from the remaining input). While it seems counter-intuitive to define the words of a language via “failed” back-references, the possibility of having an accepting trap-state yields closure under complement for deterministic (see Theorem 3). It will be convenient to consider the partition of into and (having a rejecting and an accepting trap-state, respectively).

Next, we illustrate the concept of memory automata with trap state by some examples333For the sake of convenience, we present in the form of the usual automata diagrams (initial states are marked by an unlabeled incoming arc, accepting states by an additional circle and arcs are labelled with the transition tuples). (further illustrations can be found in [46]).

Intuitively speaking, in a single step of a computation of a , we first change the memory statuses according to the memory instructions , , and then a (possibly empty) prefix of the remaining input ( is either from or it equals the content of some memory that, according to the definition, has been closed by the same transition) is consumed and appended to the content of every memory that is currently open (note that here the new statuses after applying the memory instructions count). The changes of memory configurations caused by a transition are illustrated in Figure 1.

Figure 1: Possible configuration changes of a fixed memory. Note that by and , we denote an empty or non-empty memory content, respectively; the instruction is omitted. Moreover, the diagram only shows configuration changes caused by memory instructions (in particular, can only change into by consuming transitions).
Example 2.

Consider the following   with two memories over :

This works as follows. First, in state , we record a non-empty word over in the first memory, then, in state , a non-empty word over in the second memory, and then, by moving through states and , these words are repeated in reverse order by first recalling the second and then the first memory (note that in the transition from to , an already closed memory is closed again, since according to Definition 2, every memory that is recalled must be closed in the same transition). Due to the -transition from to , describes the Kleene-plus of such words, i. e., , where .

Note that each of the two memory recall transitions closes the respective memory. This is required by definition, as a transition can only recall a memory if it ensures that it is closed.

Example 3.

Consider the following   with two memories over :

The behavior of can be described as follows: First, opens memory 1 and reads , . After that, opens the second memory, reads (which is stored in both memories), , closes the first memory, reads , , and closes the second memory. Hence, after reading , the first memory contains , and the second . Finally, recalls memory 1 and then 2. Hence, .

Now, note that in each input word , memory 2 is opened and closed after memory 1. Hence, if , the areas in where the two memories are open overlap, instead of being nested. This cannot happen in a regex, as it is ensured from the syntax of variable bindings that these “areas” in the word are properly nested. For this reason, it seems impossible to express with a regex with only two variables. But this does not mean that is not a regex language, as for . In other words, the key idea is expressing each memory with two variables (one for the overlapping parts of the memories, and one for each rest).444Proving that these overlaps can always be resolved is the main step in showing the equivalence of and , which is provided in [46] (see also the discussion at the end of the proof of Theorem 1).

Next, we shall see that every can be transformed into an equivalent , which implies ; thus, it follows from [46] that characterize . The idea of this construction is as follows. Every memory is simulated by two memories and , which store a (nondeterministically guessed) factorisation of the content of memory . This allows us to guess and verify if a memory recall failure occurs, i. e., stores the longest prefix that can be matched and starts with the first mismatch. For correctness, it is crucial that every possible factorisation of the content of a memory can be guessed.

We first need the following definition. An is in normal form if no empty memory is recalled, no open memory is opened, no memory is reset, and, for every transition ,

  • if , then , ,

  • if , for some , , then and , for every , , .

Proposition 1.

Any can be transformed into an equivalent in normal form.

Proof.

An arbitrary can be changed into an equivalent one in normal form as follows. By introducing -transitions, we can make sure that every transition is of the form stated in the proposition. Furthermore, by adding states, we can keep track of the memory configurations (i. e., their status and whether or not they are empty; this simple technique is also explained in more detail in the proof of Theorem 3). This allows us to replace transitions that are recalling an empty memory by -transitions. Furthermore, transitions that open an open memory are replaced by transitions applying the memory instructions and in this order to memory , and transitions that reset a memory are replaced by transitions applying the memory instructions , and in this order to memory (the correctness of this can be easily checked with the help of Figure 1). The is then in normal form and, by definition, these modifications do not change the accepted language. ∎

Now, we can formally prove the claimed characterisation.

Theorem 1.

.

Proof.

We first note that follows from [46] (we briefly discuss this at the end of the proof). Since and , it only remains to prove . To this end, let be a in normal form. First, we replace every memory , , by two memories and and we implement in the finite state control a list with entries from , which initially satisfies , . Then, we change the transitions of such that the new memories and simulate the old memory , i. e., memory stores some word if and only if memories and store and , respectively, with . Moreover, the element always equals the first symbol of the content of memory . More precisely, this can be done as follows. Let be an original transition of .

  • If or , for some , , then instead we open memory or close memory , respectively.

  • If , then, for every open memory , we nondeterministically choose to close it and open memory instead and set . Then we read from the input and change to state .

  • If , then we first recall memory and then, for every open memory , we nondeterministically choose to close it and open memory instead and set . Then, we recall memory and change to state .

All these modifications can be done by introducing intermediate states and using -transitions and the accepted language of does not change.

The automaton now stores some content of an original memory factorised into two factors and in the memories and , respectively. For the sake of convenience, we simply say that is stored in in order to describe this situation. Next, we show that if is stored in , then any way of how is factorised into the content of and is possible. More precisely, we show that, for every with , can reach state by consuming with stored in if and only if can reach state by consuming with and stored in and , respectively .

The if part of this statement is trivial. We now assume that can reach state by consuming with stored in . This implies that we reach the situation that is open, currently stores and the next consuming transition consumes , where and with . If , then can choose to close and then open , which results in and being stored in and , respectively. If, on the other hand, , then the next transition recalls memories and such that is stored in . If and are stored in and , respectively, then first recalls , chooses to close and open , and then recalls , which results in and being stored in and . Consequently, we have to repeat this argument for memories and , i. e., we have to show that it is possible that is stored in in such a way that is stored in and is stored in . Repeating this argument, we will eventually arrive at a memory that is not filled by any memory recalls; thus, we necessarily have the case .

Now, we turn into a , i. e., the state becomes non-accepting, and, in addition, we add a new accepting state (simulating the old accepting ) with , , and we change all ordinary transitions (i. e., transitions that are not recall failure transitions) of the former that lead to such that they now lead to . Furthermore, we change this such that for every memory recall, there is also the nondeterministic choice to only recall , then check whether does not equal the next symbol on the input and, if this is the case, enter state . Obviously, this simulates the memory recall failure of .

Every word accepted by without memory recall failures can be accepted by in the same way, every word accepted by due to a recall failure can be accepted by by guessing and simulating this memory recall failure. On the other hand, if accepts a word with a simulated memory recall failure, then will accept this word by a proper memory recall failure, and if accepts a word without a simulated memory recall failure, then, since , there is no memory recall failure in the computation and can accept the word by the same computation.

This completes the proof of .

We shall conclude this proof by briefly sketching why holds. For an , it is straightforward to obtain an equivalent : Transform into a proper regular expression with (by just renaming variable bindings and references), then transform into an equivalent  , and finally interpret as a by interpreting transition labels , as memory instructions and transition labels as memory recalls. The other direction relies on first resolving overlaps of memories (i. e., the case that two memories store factors that overlap in the input word, see also Example 3) and then transforming the into a proper regular expression for a ref-language that dereferences to , which can then directly be interpreted as a regex (due to the non-overlapping property of memories, which translates into a well-formed nesting of the parentheses , ). This works in the same way as for the case of memory automata without trap-states (see [46] for details). ∎

A consequence of the proof is that inherits the -hardness of the membership problem from . We do not devote more attention to this, as we focus on deterministic .

3.1 Deterministic

A is deterministic (or a , for short) if satisfies , for every and (for the sake of convenience, we then interpret as a partial function with range ), and, furthermore, for every , if is defined for some , then, for every , is undefined.555Note that in [46] deterministic memory automata without trap-state are considered. Analogously to , we partition into and .

Clearly, the of Examples 2 and 3 are not deterministic (in Example 2, there are different transitions for the same state that consume the same symbol and in Example 3, there are states for which -transitions exists in addition to other transitions). By minor changes of the of Example 2, a can be easily constructed for the language , the details are left to the reader.

The algorithmically most important feature of is that their membership can be solved efficiently by running the automaton on the input word. However, for each processed input symbol, there might be a delay of at most steps, due to -transitions and recalls of empty memories, which leads to . Removing such non-consuming transitions first, is possible, but problematic. In particular, recalls of empty memories depend on the specific input word and could only be determined beforehand by storing for each memory whether it is empty, which is too expensive. However, by an preprocessing, we can compute the information that is needed in order to determine in where to jump if certain memories are empty, and which memories are currently empty can be determined on-the-fly while processing the input. This leads to a delay of only , the number of memories:

Theorem 2.

Given with states and memories, and , we can decide whether or not

  • in time without preprocessing, or

  • in time after an preprocessing.

Proof.

We first modify with respect to its -transitions as follows. Let be a state with an -transition that is followed by another -transition. If is contained in a cycle of -transitions, we simply replace this cycle by a single state (i. e., all incoming edges of any , , then point to ) that is accepting if and only if some , , is (note that, since is deterministic, no