Decision Procedures for Path Feasibility of String-Manipulating Programs with Complex Operations

The design and implementation of decision procedures for checking path feasibility in string-manipulating programs is an important problem, whose applications include symbolic execution and automated detection of cross-site scripting (XSS) vulnerabilities. A (symbolic) path is a finite sequence of assignments and assertions (i.e. without loops), and checking its feasibility amounts to determining the existence of inputs that yield a successful execution. We give two general semantic conditions which together ensure the decidability of path feasibility: (1) each assertion admits regular monadic decomposition, and (2) each assignment uses a (possibly nondeterministic) function whose inverse relation preserves regularity. We show these conditions are expressive since they are satisfied by a multitude of string operations. They also strictly subsume existing decidable string theories, and most existing benchmarks (e.g. most of Kaluza's, and all of SLOG's, Stranger's, and SLOTH's). We give a simple decision procedure and an extensible architecture of a string solver in that a user may easily incorporate his/her own string functions. We show the general fragment has a tight, but high complexity. To address this, we propose to allow only partial string functions (i.e., prohibit nondeterminism) in condition (2). When nondeterministic functions are needed, we also provide a syntactic fragment that provides a support of nondeterministic functions but can be reduced to an existing solver SLOTH. We provide an efficient implementation of our decision procedure for deterministic partial string functions in a new string solver OSTRICH. It provides built-in support for concatenation, reverse, functional transducers, and replaceall and provides a framework for extensibility to support further string functions. We demonstrate the efficacy of our new solver against other competitive solvers.

Authors

• 16 publications
• 8 publications
• 14 publications
• 11 publications
• 6 publications
07/14/2020

A Decision Procedure for Path Feasibility of String Manipulating Programs with Integer Data Type

Strings are widely used in programs, especially in web applications. Int...
12/11/2021

CertiStr: A Certified String Solver (technical report)

Theories over strings are among the most heavily researched logical theo...
11/08/2021

Solving String Constraints With Regex-Dependent Functions Through Transducers With Priorities And Variables

Regular expressions are a classical concept in formal language theory. R...
10/29/2020

String Constraints with Concatenation and Transducers Solved Efficiently (Technical Report)

String analysis is the problem of reasoning about how strings are manipu...
11/09/2017

What Is Decidable about String Constraints with the ReplaceAll Function

Recently, it was shown that any theory of strings containing the string-...
06/07/2015

String Gaussian Process Kernels

We introduce a new class of nonstationary kernels, which we derive as co...
11/14/2018

Lemma Functions for Frama-C: C Programs as Proofs

This paper describes the development of an auto-active verification tech...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

0.1 Introduction

Strings are a fundamental data type in virtually all programming languages. Their generic nature can, however, lead to many subtle programming bugs, some with security consequences, e.g., cross-site scripting (XSS), which is among the OWASP Top 10 Application Security Risks [55]. One effective automatic testing method for identifying subtle programming errors is based on symbolic execution [37] and combinations with dynamic analysis called dynamic symbolic execution [51, 29, 15, 52, 14]. See [16] for an excellent survey. Unlike purely random testing, which runs only concrete program executions on different inputs, the techniques of symbolic execution analyse static paths (also called symbolic executions) through the software system under test. Such a path can be viewed as a constraint (over appropriate data domains) and the hope is that a fast solver is available for checking the satisfiability of (i.e. to check the feasibility of the static path), which can be used for generating inputs that lead to certain parts of the program or an erroneous behaviour.

Constraints from symbolic execution on string-manipulating programs can be understood in terms of the problem of path feasibility over a bounded program with neither loops nor branching (e.g. see [10]). That is, is a sequence of assignments and conditionals/assertions, i.e., generated by the grammar

 S::=y:=f(x1,…,xr) | {assert}(g(x1,…,xr)% ) | S;S (1)

where is a partial string function and is a string relation. The following is a simple example of a symbolic execution which uses string variables (, , and ’s) and string constants (letters a and b), and the concatenation operator ():

 z1:=x∘{ba}∘y;z2:=y∘{ab}∘x;assert(z1==z2) (2)

The problem of path feasibility/satisfiability111 It is equivalent to satisfiability of string constraints in the SMT framework [23, 7, 39]. Simply convert a symbolic execution into a Static Single Assignment (SSA) form (i.e. use a new variable on l.h.s. of each assignment) and treat assignments as equality, e.g., formula for the above example is , where denotes the string concatenation operation. asks whether, for a given program , there exist input strings (e.g. and in (2)) that can successfully take to the end of the program while satisfying all the assertions. This path can be satisfied by assigning (resp. ) to b (resp. the empty string). In this paper, we will also allow nondeterministic functions since nondeterminism can be a useful modelling construct. For example, consider the code in Figure 1. It ensures that each element in s1 (construed as a list delimited by -) is longer than each element in s2. If is a function that nondeterministically outputs a substring delimited by -, our symbolic execution analysis can be reduced to feasibility of the path:

 x:=f(s1);y:=f(s2);assert(len(x)≤% len(y))

In the last few decades much research on the satisfiability problem of string constraints suggests that it takes very little for a string constraint language to become undecidable. For example, although the existential theory of concatenation and regular constraints (i.e. an atomic expression is either , where and are concatenations of string constants and variables, or , where is

a regular language) is decidable and in fact pspace-complete [48, 25, 33], the theory becomes undecidable when enriched with letter-counting [12], i.e., expressions of the form , where is a function mapping a word to the number of occurrences of the the letter in the word. Similarly, although finite-state transductions [42, 22, 31] are crucial for expressing many functions used in string-manipulating programs — including autoescaping mechanisms (e.g. backslash escape, and HTML escape in JavaScript), and the function with a constant replacement pattern — checking a simple formula of the form , for a given rational transduction222A rational transduction is a transduction defined by a rational transducer, namely, a finite automaton over the alphabet , where denotes the empty string. , can easily encode the Post Correspondence Problem [45], and therefore is undecidable.

Despite the undecidability of allowing various operations in string constraints, in practice it is common for a string-manipulating program to contain multiple operations (e.g. concatenation and finite-state transductions), and so a path feasibility solver nonetheless needs to be able to handle them. This is one reason why some string solving practitioners opted to support more string operations and settle with incomplete solvers (e.g. with no guarantee of termination) that could still solve some constraints that arise in practice, e.g., see [53, 54, 63, 62, 9, 50, 60, 1, 59, 36, 40, 2]. For example, the tool S3 [53, 54]

supports general recursively-defined predicates and uses a number of incomplete heuristics to detect unsatisfiable constraints. As another example, the tool Stranger

[59, 60] supports concatenation, (but with both pattern and replacement strings being constants), and regular constraints, and performs widening (i.e. an overapproximation) when a concatenation operator is seen in the analysis. Despite the excellent performance of some of these solvers on several existing benchmarks, there are good reasons for designing decision procedures with stronger theoretical guarantees, e.g., in the form of decidability (perhaps accompanied by a complexity analysis). One such reason is that string constraint solving is a research area in its infancy with an insufficient range of benchmarking examples to convince us that if a string solver works well on existing benchmarks, it will also work well on future benchmarks. A theoretical result provides a kind of robustness guarantee upon which a practical solver could further improve and optimise.

Fortunately, recent years have seen the possibility of recovering some decidability of string constraint languages with multiple string operations, while retaining applicability for constraints that arise in practical symbolic execution applications. This is done by imposing syntactic restrictions including acyclicity [6, 3], solved form [28], and straight-line [42, 30, 18]. These restrictions are known to be satisfied by many existing string constraint benchmarks, e.g., Kaluza [50], Stranger [59], SLOG [57, 30], and mutation XSS benchmarks of [42]. However, these results are unfortunately rather fragmented, and it is difficult to extend the comparatively limited number of supported string operations. In the following, we will elaborate this point more precisely. The acyclic logic of [6] permits only rational transductions, in which the function with constant pattern/replacement strings and regular constraints (but not concatenation) can be expressed. On the other hand, the acyclic logic of [3] permits concatenation, regular constraints, and the length function, but neither the function nor transductions. This logic is in fact quite related to the solved-form logic proposed earlier by [28]. The straight-line logic of [42] unified the earlier logics by allowing concatenation, regular constraints, rational transductions, and length and letter-counting functions. It was pointed out by [18] that this logic cannot express the function with the replacement string provided as a variable, which was never studied in the context of verification and program analysis. Chen et al. proceeded by showing that a new straight-line logic with the more general function and concatenation is decidable, but becomes undecidable when the length function is permitted.

Although the aforementioned results have been rather successful in capturing many string constraints that arise in applications (e.g. see the benchmarking results of [28] and [42, 30]), many natural problems remain unaddressed. To what extent can one combine these operations without sacrificing decidability? For example, can a useful decidable logic permit the more general , rational transductions, and concatenation at the same time? To what extent can one introduce new string operations without sacrificing decidability? For example, can we allow the string-reverse function (a standard library function, e.g., in Python), or more generally functions given by two-way transducers (i.e. the input head can also move to the left)? Last but not least, since there are a plethora of complex string operations, it is impossible for a solver designer to incorporate all the string operations that will be useful in all application domains. Thus, can (and, if so, how do) we design an effective string solver that can easily be extended with user-defined string functions, while providing a strong completeness/termination guarantee? Our goal is to provide theoretically-sound and practically implementable solutions to these problems.

Contributions. We provide two general semantic conditions (see Section 0.3) which together ensure decidability of path feasibility for string-manipulating programs:

1. the conditional in each assertion admits a regular monadic decomposition, and

2. each assignment uses a function whose inverse relation preserves “regularity”.

Before describing these conditions in more detail, we comment on the four main features (4Es) of our decidability result: (a) Expressive: the two conditions are satisfied by most string constraint benchmarks (existing and new ones including those of [50, 59, 57, 30, 42]) and strictly generalise several expressive and decidable constraint languages (e.g. those of [42, 18]), (b) Easy: it leads to a decision procedure that is conceptually simple (in particular, substantially simpler than many existing ones), (c) Extensible: it provides an extensible architecture of a string solver that allows users to easily incorporate their own user-defined functions to the solver, and (d) Efficient: it provides a sound basis of our new fast string solver OSTRICH that is highly competitive on string constraint benchmarks. We elaborate the details of the two aforementioned semantic conditions, and our contributions below.

The first semantic condition simply means that can be effectively transformed into a finite union of Cartesian products of regular languages. (Note that this is not the intersection/product of regular languages.) A relation that is definable in this way is often called a recognisable relation [17], which is one standard extension of the notion of regular languages (i.e. unary relations) to general -ary relations. The framework of recognisable relations can express interesting conditions that might at a first glance seem beyond “regularity”, e.g., as can be seen below in Example 0.3.2. Furthermore, there are algorithms (i.e. called monadic decompositions in [56]) for deciding whether a given relation represented in highly expressive symbolic representations (e.g. a synchronised rational relation or a deterministic rational relation) is recognisable and, if so, output a symbolic representation of the recognisable relation [17]. On the other hand, the second condition means that the pre-image of a regular language under the function is a -ary recognisable relation. This is an expressive condition (see Section 0.4) satisfied by many string functions including concatenation, the string reverse function, one-way and two-way finite-state transducers, and the function where the replacement string can contain variables. Therefore, we obtain strict generalisations of the decidable string constraint languages in [42] (concatenation, one-way transducers, and regular constraints) and in [18] (concatenation, the function, and regular constraints). In addition, many string solving benchmarks (both existing and new ones) derived from practical applications satisfy our two semantics conditions including the benchmarks of SLOG [57] with replace and , the benchmarks of Stranger [59], of Kaluza benchmarks [50], and the transducer benchmarks of [42, 30]. We provide a simple and clean decision procedure (see Section 0.3) which propagates the regular language constraints in a backward manner via the regularity-preserving pre-image computation. Our semantic conditions also naturally lead to extensible architecture of a string solver: a user can easily extend our solver with one’s own string functions by simply providing one’s code for computing the pre-image for an input regular language without worrying about other parts of the solver.

Having talked about the Expressive, Easy, and Extensible features of our decidability result (first three of the four Es), our decidability result does not immediately lead to an Efficient decision procedure and a fast string solver. A substantial proportion of the remaining paper is dedicated to analysing the cause of the problem and proposing ways of addressing it which are effective from both theoretical and practical standpoints.

Our hypothesis is that allowing general string relations (instead of just partial functions ), although broadening the applicability of the resulting theory (e.g. see Figure 1), makes the constraint solving problem considerably more difficult. One reason is that propagating regular constraints backwards through a string relation seems to require performing a product automata construction for before computing a recognisable relation for . To make things worse, this product construction has to be done for practically every variable in the constraint, each of which causes an exponential blowup. We illustrate this with a concrete example in Example 0.5.2. We provide a strong piece of theoretical evidence that unfortunately this is unavoidable in the worst case. More precisely, we show (see Section 0.4) that the complexity of the path feasibility problem with binary relations represented by one-way finite transducers (a.k.a. binary rational relations) and the function (allowing a variable in the replacement string) has a non-elementary complexity (i.e., time/space complexity cannot be bounded by a fixed tower of exponentials) with a single level of exponentials caused by a product automata construction for each variable in the constraint. This is especially surprising since allowing either binary rational relations or the aforementioned function results in a constraint language whose complexity is at most double exponential time and single exponential space (i.e. expspace); see [42, 18]. To provide further evidence of our hypothesis, we accompany this with another lower bound (also see Section 0.4) that the path feasibility problem has a non-elementary complexity for relations that are represented by two-way finite transducers (without the function), which are possibly one of the most natural and well-studied classes of models of string relations (e.g. see [4, 26, 27] for the model).

We propose two remedies to the problem. The first one is to allow only string functions in our constraint language. This allows one to avoid the computationally expensive product automata construction for each variable in the constraint. In fact, we show (see Section 0.5.1) that the non-elementary complexity for the case of binary rational relations and the function can be substantially brought down to double exponential time and single exponential space (in fact, expspace-complete) if the binary rational relations are restricted to partial functions. In fact, we prove that this complexity still holds if we additionally allow the string-reverse function and the concatenation operator. The expspace complexity might still sound prohibitive, but the highly competitive performance of our new solver OSTRICH (see below) shows that this is not the case.

Our second solution (see Section 0.5.2) is to still allow string relations, but find an appropriate syntactic fragment of our semantic conditions that yield better computational complexity. Our proposal for such a fragment is to restrict the use of to constant replacement strings, but allow the string-reverse function and binary rational relations. The complexity of this fragment is shown to be expspace-complete, building on the result of [42]. There are at least two advantages of the second solution. While string relations are supported, our algorithm reduces the problem to constraints which can be handled by the existing solver SLOTH [30] that has a reasonable performance. Secondly, the fully-fledged length constraints (e.g. and more generally linear arithmetic expressions on the lengths of string variables) can be incorporated into this syntactic fragment without sacrificing decidability or increasing the expspace complexity. Our experimentation and the comparison of our tool with SLOTH (see below) suggest that our first proposed solution is to be strongly preferred when string relations are not used in the constraints.

We have implemented our first proposed decision procedure in a new fast string solver OSTRICH333As an aside, in contrast to an emu, an ostrich is known to be able to walk backwards, and hence the name of our solver, which propagates regular constraints in a backward direction. (Optimistic STRIng Constraint Handler). Our solver provides built-in support for concatenation, reverse, functional transducers (FFT), and . Moreover, it is designed to be extensible and adding support for new string functions is a straight-forward task. We compare OSTRICH with several state-of-the-art string solving tools — including SLOTH [30], CVC4 [40], and Z3 [9] — on a wide range of challenging benchmarks — including SLOG’s replace/replaceall [57], Stranger’s [59], mutation XSS [42, 30], and the benchmarks of Kaluza that satisfy our semantic conditions (i.e.  of them) [50]. It is the only tool that was able to return an answer on all of the benchmarks we used. Moreover, it significantly outperforms SLOTH, the only tool comparable with OSTRICH in terms of theoretical guarantees and closest in terms of expressibility. It also competes well with CVC4 — a fast, but incomplete solver — on the benchmarks for which CVC4 was able to return a conclusive response. We report details of OSTRICH and empirical results in Section 0.6.

0.2 Preliminaries

General Notation. Let and denote the set of integers and natural numbers respectively. For , let

. For a vector

, let denote the length of (i.e., ) and denote for each . Given a function and , we use to define the pre-image of under , i.e., .

Regular Languages. Fix a finite alphabet . Elements in are called strings. Let denote the empty string and . We will use to denote letters from and to denote strings from . For a string , let denote the length of (in particular, ), moreover, for , let denote the number of occurrences of in . A position of a nonempty string of length is a number (Note that the first position is , instead of 0). In addition, for , let denote the -th letter of . For a string , we use to denote the reverse of , that is, if , then . For two strings , we use to denote the concatenation of and , that is, the string such that and for each , , and for each , . Let be two strings. If for some string , then is said to be a prefix of . In addition, if , then is said to be a strict prefix of . If is a prefix of , that is, for some string , then we use to denote . In particular, .

A language over is a subset of . We will use to denote languages. For two languages , we use to denote the union of and , and to denote the concatenation of and , that is, the language . For a language and , we define , the iteration of for times, inductively as follows: and for . We also use to denote an arbitrary number of iterations of , that is, . Moreover, let .

Definition 0.2.1 (Regular expressions RegExp).
 e\tiny def=∅∣ε∣a∣e+e∣e∘e∣e∗, where a∈Σ.

Since is associative and commutative, we also write as for brevity. We use the abbreviation . Moreover, for , we use the abbreviations and .

We define to be the language defined by , that is, the set of strings that match , inductively as follows: , , , , , . In addition, we use to denote the number of symbols occurring in .

Automata models. We review some background from automata theory; for more, see [38, 32]. Let be a finite set (called alphabet).

Definition 0.2.2 (Finite-state automata).

A (nondeterministic) finite-state automaton (FA) over a finite alphabet is a tuple where is a finite set of states, is the initial state, is a set of final states, and is the transition relation.

For an input string , a run of on is a sequence of states such that for every . The run is said to be accepting if . A string is accepted by if there is an accepting run of on . In particular, the empty string is accepted by iff . The set of strings accepted by is denoted by , a.k.a., the language recognised by . The size of is defined to be ; we will use this when we discuss computational complexity.

For convenience, we will also refer to an FA without initial and final states, that is, a pair , as a transition graph.

Operations of FAs. For an FA , and , we use to denote the FA , that is, the FA obtained from by changing the initial state and the set of final states to and respectively. We use to denote that a string is accepted by .

Given two FAs and , the product of and , denoted by , is defined as , where is the set of tuples such that and . Evidently, we have .

Moreover, let , we define as , where is a newly introduced state not in and comprises the transitions such that as well as the transitions such that for some . Intuitively, is obtained from by reversing the direction of each transition of and swapping initial and final states. The new state in is introduced to meet the unique initial state requirement in the definition of FA. Evidently, recognises the reverse language of , namely, the language .

It is well-known (e.g. see [32]) that regular expressions and FAs are expressively equivalent, and generate precisely all regular languages. In particular, from a regular expression, an equivalent FA can be constructed in linear time. Moreover, regular languages are closed under Boolean operations, i.e., union, intersection, and complementation.

Definition 0.2.3 (Finite-state transducers).

Let be an alphabet. A (nondeterministic) finite transducer (FT) over is a tuple , where is a finite subset of .

The notion of runs of FTs on an input string can be seen as a generalisation of FAs by adding outputs. More precisely, given a string , a run of on is a sequence of pairs such that for every , . The run is said to be accepting if . When a run is accepting, is said to be the output of the run. Note that some of these s could be empty strings. A word is said to be an output of on if there is an accepting run of on with output . We use to denote the transduction defined by , that is, the relation comprising the pairs such that is an output of on .

We remark that an FT usually defines a relation. We shall speak of functional transducers, i.e., transducers that define functions instead of relations. (For instance, deterministic transducers are always functional.) We will use FFT to denote the class of functional transducers.

To take into consideration the outputs of transitions, we define the size of as the sum of the sizes of transitions in , where the size of a transition is defined as .

Example 0.2.4.

We give an example FT for the function escapeString, which backslash-escapes every occurrence of and ". The FT has a single state, i.e., and the transition relation comprises for each or ", , , and the final state . We remark that this FT is functional. ∎

Computational Complexity. In this paper, we will use computational complexity theory to provide evidence that certain (automata) operations in our generic decision procedure are unavoidable. In particular, we shall deal with the following computational complexity classes (see [32] for more details): pspace (problems solvable in polynomial space and thus in exponential time), expspace (problems solvable in exponential space and thus in double exponential time), and non-elementary (problems not a member of the class elementary, where elementary comprises elementary recursive functions, which is the union of the complexity classes exptime, 2-exptime, 3-exptime, , or alternatively, the union of the complexity classes expspace, 2-expspace, 3-expspace, ). Verification problems that have complexity pspace or beyond (see [5] for a few examples) have substantially benefited from techniques such as symbolic model checking [43].

0.3 Semantic conditions and A generic decision procedure

Recall that we consider symbolic executions of string-manipulating programs defined by the rules

 S::=y:=f(x1,…,xr) | {assert}(g(x1,…,xr)% ) | S;S (3)

where is a nondeterministic partial string function and is a string relation. Without loss of generality, we assume that symbolic executions are in Static Single Assignment (SSA) form.444Each symbolic execution can be turned into the SSA form by using a new variable on the left-hand-side of each assignment.

In this section, we shall provide two general semantic conditions for symbolic executions. The main result is that, whenever the symbolic execution generated by (3) satisfies these two conditions, the path feasibility problem is decidable. We first define the concept of recognisable relations which, intuitively, are simply a finite union of Cartesian products of regular languages.

Definition 0.3.1 (Recognisable relations).

An -ary relation is recognisable if where is regular for each . A representation of a recognisable relation is such that each is an FA with . The tuples are called the disjuncts of the representation and the FAs are called the atoms of the representation.

We remark that the recognisable relation is more expressive than it appears to be. For instance, it can be used to encode some special length constraints, as demonstrated in Example 0.3.2.

Example 0.3.2.

Let us consider the relation where and are strings over the alphabet . Although syntactically is a length constraint, it indeed defines a recognisable relation. To see this, is equivalent to the disjunction of , , , and , where each disjunct describes a cartesian product of regular languages. For instance, in , requires that belongs to the regular language , while requires that belongs to the regular language . ∎

The equality binary predicate is a standard non-example of recognisable relations; in fact, expressing as a union of products requires us to have , which in turn forces us to have an infinite index set .

The first semantic condition, Regular Monadic Decomposition is stated as follows.

RegMonDec: For each assertion in , is a recognisable relation, a representation of which, in terms of Definition 0.3.1, is effectively computable.

When , the RegMonDec condition requires that is regular and may be given by an FA , in which case .

The second semantic condition concerns the pre-images of string operations. A string operation with parameters () gives rise to a relation . Let . The pre-image of under , denoted by , is

 {(w1,…,wr)∈(Σ∗)r∣∃w. w∈f(w1,…,wr) and w∈L}.

For brevity, we use to denote for an FA . The second semantic condition, i.e. the inverse relation of preserves regularity, is formally stated as follows.

RegInvRel: For each operation in and each FA , is a recognisable relation, a representation of which (Definition 0.3.1), can be effectively computed from and .

When , this RegInvRel condition would state that the pre-image of a regular language under the operation is effectively regular, i.e. an FA can be computed to represent the pre-image of the regular language under .

Example 0.3.3.

Let . Consider the string function . (Recall that denotes the number of occurrences of in .) We can show that for each FA , is a recognisable relation. Let be an FA. W.l.o.g. we assume that . It is easy to observe that is a finite union of the languages , where are natural number constants. Therefore, to show that is a recognisable relation, it is sufficient to show that is a recognisable relation.

Let us consider the typical situation that and . Then is the disjunction of for with , and , where , . Evidently, and are regular languages. Therefore, is a finite union of cartesian products of regular languages, and thus a recognisable relation. ∎

Not every string operation satisfies the RegInvRel condition, as demonstrated by Example 0.3.4.

Example 0.3.4.

Let us consider the string function on the alphabet that transforms the unary representations of natural numbers into their binary representations, namely, such that and . For instance, . We claim that does not satisfy the RegInvRel condition. To see this, consider the regular language . Then comprises the strings with , which is evidently non-regular. Incidentally, this is an instance of the well-known Cobham’s theorem (cf. [47]) that the sets of numbers definable by finite automata in unary are strictly subsumed by the sets of numbers definable by finite automata in binary. ∎

We are ready to state the main result of this section.

Theorem 0.3.5.

The path feasibility problem is decidable for symbolic executions satisfying the RegMonDec and RegInvRel conditions.

Proof of Theorem 0.3.5.

We present a nondeterministic decision procedure from which the theorem follows.

Let be a symbolic execution, (where ) be the last assignment in , and be the set of all constraints in assertions of that involve (i.e. occurs in for all ). For each , let . Then by the RegMonDec assumption, is a recognisable relation and a representation of it, say with , can be effectively computed.

For each , we nondeterministically choose one tuple (where ), and for all , replace in with . Let denote the resulting program.

We use to denote the set of all the FAs such that , , and occurs in . We then compute the product FA from FAs such that is the intersection of the languages defined by FAs in . By the RegInvRel assumption, is a recognisable relation and a representation of it can be effectively computed.

Let be the symbolic execution obtained from by (1) removing along with all assertions involving (i.e. the assertions for ), (2) and adding the assertion .

It is straightforward to verify that is path-feasible iff there is a nondeterministic choice resulting in that is path-feasible, moreover, is path feasible iff is path-feasible. Evidently, has one less assignment than . Repeating these steps, the procedure will terminate when becomes a conjunction of assertions on input variables, the feasibility of which can be checked via language nonemptiness checking of FAs. To sum up, the correctness of the (nondeterministic) procedure follows since the path-feasibility is preserved for each step, and the termination is guaranteed by the finite number of assignments. ∎

Let us use the following example to illustrate the generic decision procedure.

Example 0.3.6.

Consider the symbolic execution

 assert(x∈A0); y1:=f(x); z:=y1∘y2; assert(y1∈A1); assert(y2∈A2); assert(z∈A3)

where are FAs illustrated in Figure 2, and is the function mentioned in Section 0.1 that nondeterministically outputs a substring delimited by -. At first, we remove the assignment as well as the assertion . Moreover, since the pre-image of under , denoted by , is a recognisable relation represented by , we add the assertion , and get following program

 assert(x∈A0); y1:=f(x); %assert(y1∈A1); assert(y2∈A2); assert(g(y1,y2)).

To continue, we nondeterministically choose one tuple, say , from the representation of , and replace with , and get the program

 assert(x∈A0); y1:=f(x); %assert(y1∈A1); assert(y2∈A2);assert(y1∈A3(q0,{q1})); assert(y2∈A3(q1,{q0})).

Let be , the set of FAs occurring in the assertions for in the above program. Compute the product and (see Figure 2).

Then we remove , as well as the assertions that involve , namely, and , and add the assertion , resulting in the program

 assert(x∈A0); assert(y2∈A2); assert(y2∈A3(q1,{q0})); assert(x∈A′′).

It is not hard to see that and . Then the assignment , , , and witnesses the path feasibility of the original symbolic execution.∎

Remark 0.3.7.

Theorem 0.3.5 gives two semantic conditions which are sufficient to render the path feasibility problem decidable. A natural question, however, is how to check whether a given symbolic execution satisfies the two semantic conditions. The answer to this meta-question highly depends on the classes of string operations and relations under consideration. Various classes of relations which admit finite representations have been studied in the literature. They include, in an ascending order of expressiveness, recognisable relations, synchronous relations, deterministic rational relations, and rational relations, giving rise to a strict hierarchy. (We note that slightly different terminologies tend to be used in the literature, for instance, synchronous relations in [17] are called regular relations in [6] which are also known as automatic relations, synchronised rational relations, etc. One may consult the survey [21] and [17].) It is known [17] that determining whether a given deterministic rational relation is recognisable is decidable (for binary relations, this can be done in doubly exponential time), and deciding whether a synchronous relation is recognisable can be done in exponential time [17]. Similar results are also mentioned in [8, 41].

By these results, one can check, for a given symbolic execution where the string relations in the assertion and the relations induced by the string operation are all deterministic rational relations, whether it satisfies the two semantic conditions. Hence, one can check algorithmically whether Theorem 0.3.5 is applicable.

0.4 An Expressive Language Satisfying The Semantic Conditions

Section 0.3 has identified general semantic conditions under which the decidability of the path feasibility problem can be attained. Two questions naturally arise:

1. How general are these semantic conditions? In particular, do string functions commonly used in practice satisfy these semantic conditions?

2. What is the computational complexity of checking path feasibility?

The next two sections will be devoted to answering these questions.

For the first question, we shall introduce a syntactically defined string constraint language , which includes general string operations such as the function and those definable by two-way transducers, as well as recognisable relations. [Here, stands for “straight-line” because our work generalises the straight-line logics of [42, 18].] We first recap the function that allows a general (i.e. variable) replacement string [18]. Then we give the definition of two-way transducers whose special case (i.e. one-way transducers) has been given in Section 0.2.

0.4.1 The replaceAll function and two-way transducers

The function has three parameters: the first parameter is the subject string, the second parameter is a pattern that is a regular expression, and the third parameter is the replacement string. For the semantics of function, in particular when the pattern is a regular expression, we adopt the leftmost and longest matching. For instance, , since the leftmost and longest matching of in is . Here we require that the language defined by the pattern parameter does not contain the empty string, in order to avoid the troublesome definition of the semantics of the matching of the empty string. We refer the reader to [18] for the formal semantics of the function. To be consistent with the notation in this paper, for each regular expression , we define the string function such that for , , and we write as .

As in the one-way case, we start with a definition of two-way finite-state automata.

Definition 0.4.1 (Two-way finite-state automata).

A (nondeterministic) two-way finite-state automaton (2FA) over a finite alphabet is a tuple where are as in FAs, (resp. ) is a left (resp. right) input tape end marker, and the transition relation , where . Here, we assume that there are no transitions that take the head of the tape past the left/right end marker (i.e.  for every ).

Whenever they can be easily understood, we will not mention , , and in .

The notion of runs of 2FA on an input string is exactly the same as that of Turing machines on a read-only input tape. More precisely, for a string

, a run of on is a sequence of pairs defined as follows. Let and . The following conditions then have to be satisfied: , and for every , we have and for some .

The run is said to be accepting if and . A string is accepted by if there is an accepting run of on . The set of strings accepted by is denoted by , a.k.a., the language recognised by . The size