1. Introduction
The problem of automatically solving string constraints (aka satisfiability of logical theories over strings) has recently witnessed renewed interests (Saxena et al., 2010; Trinh et al., 2016; Lin and Barceló, 2016; Yu et al., 2014; Trinh et al., 2014; Abdulla et al., 2014, 2017; D’Antoni and Veanes, 2013; Veanes et al., 2012; Hooimeijer et al., 2011; Kiezun et al., 2012; Liang et al., 2014; Zheng et al., 2013; Wang et al., 2016; Bjørner et al., 2009) because of important applications in the analysis of stringmanipulating programs. For example, program analysis techniques like symbolic execution (King, 1976; Godefroid et al., 2005; Cadar et al., 2006; Sen et al., 2013) would systematically explore executions in a program and collect symbolic path constraints, which could then be solved using a constraint solver and used to determine which location in the program to continue exploring. To successfully apply a constraint solver in this instance, it is crucial that the constraint language precisely models the data types in the program, along with the datatype operations used. In the context of stringmanipulating programs, this could include concatenation, regular constraints (i.e. pattern matching against a regular expression), stringlength functions, and the stringreplace functions, among many others.
Perhaps the most wellknown theory of strings for such applications as the analysis of stringmanipulating programs is the theory of strings with concatenation (aka word equations), whose decidability was shown by Makanin (Makanin, 1977) in 1977 after it was open for many years. More importantly, this theory remains decidable even when regular constraints are incorporated into the language (Schulz, 1990). However, whether adding the stringlength function preserves the decidability remains a longstanding open problem (Ganesh et al., 2012; Büchi and Senger, 1990).
Another important string operation—especially in popular scripting
languages like Python, JavaScript, and PHP—is the stringreplace function,
which may be used to replace either the first occurrence or
all occurrences of a string (a string constant/variable, or a regular expression) by
another string (a string constant/variable). The replace function (especially
the replaceall functionality) is omnipresent in HTML5 applications
(Lin and
Barceló, 2016; Trinh
et al., 2016; Yu
et al., 2014).
For example, a standard industry defense against crosssite scripting
(XSS) vulnerabilities includes sanitising untrusted strings before adding them
into the DOM (Document Object Model) or the HTML document.
This is typically done by various metacharacterescaping mechanisms (see, for instance,
(Kern, 2014; Hooimeijer et al., 2011; Williams
et al., 2017)). An example of such a mechanism is backslashescape, which replaces every
occurrence of quotes and doublequotes (i.e. ’
and "
) in the
string by \’
and \"
.
In addition to sanitisers, common JavaScript functionalities like document.write()
and innerHTML apply an implicit browser transduction — which
decodes HTML codes (e.g. '
is replaced by ’
) in the input
string — before inserting the input string into the DOM.
Both of these examples can be expressed by (perhaps multiple)
applications of the stringreplace function.
Moreover, although these examples replace constants by constants, the popularity of template systems such as Mustache (Wanstrath, 2009) and Closure Templates (Google, 2015) demonstrate the need for replacements involving variables.
Using Mustache, a webdeveloper, for example, may define an HTML fragment with placeholders that is instantiated with user data during the construction of the delivered page.
Example 1.1 ().
We give a simple example demonstrating a (naive) XSS vulnerability to illustrate the use of stringreplace functions.
Consider the HTML fragment below.
html
¡h1¿ User ¡span onMouseOver=”popupText(’bio’)”¿userName¡/span¿ ¡/h1¿
This HTML fragment is a template as might be used with systems such as Mustache to display a user on a webpage.
For each user that is to be displayed – with their username and biography stored in variables user and bio respectively – the string {{userName}}
will be replaced by user and the string {{bio}}
will be replaced by bio.
For example, a user Amelia
with biography Amelia was born in 1979...
would result in the HTML below.
html
¡h1¿ User
¡span onMouseOver=”popupText(’Amelia was born in 1979…’)”¿
Amelia ¡/span¿ ¡/h1¿
This HTML would display User Amelia
, and, when the mouse is placed over Amelia
, her biography would appear, thanks to the onMouseOver
attribute in the span
element.
Unfortunately, this template could be insecure if the user biography is not adequately sanitised:
A user could enter a malicious biography, such as ’); alert(’Boo!’); alert(’
which would cause the following instantiation of the span
element^{1}^{1}1
Readers familiar with Mustache and Closure Templates may expect single quotes to be automatically escaped.
However, we have tested our example with the latest versions of mustache.js (Lehnardt and
contributors, 2015) and Closure Templates (Google, 2015) (as of July 2017) and observed that the exploit is not disarmed by their automatic escaping features.
.
html
¡span onMouseOver=”popupText(”); alert(’Boo!’); alert(”)”¿
Now, when the mouse is placed over the user name, the malicious JavaScript alert(’Boo!’)
is executed.
The presence of such malicious injections of code can be detected using string constraint solving and XSS attack patterns given as regular expressions (Balzarotti et al., 2008; Saxena et al., 2010; Yu et al., 2014). For our example, given an attack pattern and template , we would generate the constraint
which would detect if the HTML generated by instantiating the template is susceptible to the attack identified by . ∎
In general, the stringreplace function has three parameters, and in the current mainstream language such as Python and JavaScript, all of the three parameters can be inserted as string variables. As result, when we perform program analysis for, for instance, detecting security vulnerabilities as described above, one often obtains string constraints of the form , where are string constants/variables, and is either a string constant/variable or a regular expression. Such a constraint means that is obtained by replacing all occurrences of in with . For convenience, we call as the subject, the pattern, and the replacement parameters respectively.
The function is a powerful string operation that goes beyond the expressiveness of concatenation. (On the contrary, as we will see later, concatenation can be expressed by the function easily.) It was shown in a recent POPL paper (Lin and Barceló, 2016) that any theory of strings containing the stringreplace function (even the most restricted version where pattern/replacement strings are both constant strings) becomes undecidable if we do not impose some kind of straightline restriction^{2}^{2}2Similar notions that appear in the literature of string constraints (without replace) include acyclicity (Abdulla et al., 2014) and solved form (Ganesh et al., 2012) on the formulas. Nonetheless, as already noted in (Lin and Barceló, 2016), the straightline restriction is reasonable since it is typically satisfied by constraints that are generated by symbolic execution, e.g., all constraints in the standard Kaluza benchmarks (Saxena et al., 2010) with 50,000+ test cases generated by symbolic execution on JavaScript applications were noted in (Ganesh et al., 2012) to satisfy this condition. Intuitively, as elegantly described in (Bjørner et al., 2009), constraints from symbolic execution on stringmanipulating programs can be viewed as the problem of path feasibility over loopless stringmanipulating programs with variable assignments and assertions, i.e., generated by the grammar
where and are some string functions. Straightline programs with assertions can be obtained by turning such programs into a Static Single Assignment (SSA) form (i.e. introduce a new variable on the left hand side of each assignment). A partial decidability result can be deduced from (Lin and Barceló, 2016) for the straightline fragment of the theory of strings, where (1) in the above grammar is either a concatenation of string constants and variables, or the function where the pattern and the replacement are both string constants, and (2) is a boolean combination of regular constraints. In fact, the decision procedure therein admits finitestate transducers, which subsume only the aforementioned simple form of the function. The decidability boundary of the straightline fragment involving the function in its general form (e.g., when the replacement parameter is a variable) remains open.
Contribution.
We investigate the decidability boundary of the theory of strings involving the function and regular constraints, with the straightline restriction introduced in (Lin and Barceló, 2016). We provide a decidability result for a large fragment of , which is sufficiently powerful to express the concatenation operator. We show that this decidability result is in a sense maximal by showing that several important natural extensions of the logic result in undecidability. We detail these results below:

If the pattern parameters of the function are allowed to be variables, then the satisfiability of is undecidable (cf. Proposition 4.1).

If the pattern parameters of the function are regular expressions, then the satisfiability of is decidable and in EXPSPACE (cf. Theorem 4.2). In addition, we show that the satisfiability problem is PSPACEcomplete for several cases that are meaningful in practice (cf. Corollary 4.7). This strictly generalises the decidability result in (Lin and Barceló, 2016) of the straightline fragment with concatenation, regular constraints, and the function where patterns/replacement parameters are constant strings.

If , where the pattern parameter of the function is a constant letter, is extended with the stringlength constraint, then satisfiability becomes undecidable again. In fact, this undecidability can be obtained with either integer constraints, character constraints, or constraints involving the function (cf. Theorem 9.4 and Proposition 9.6).
Our decision procedure for where the pattern parameters of the function are regular expressions follows an automatatheoretic approach. The key idea can be illustrated as follows. Let us consider the simple formula . Suppose that are the nondeterministic finite state automata corresponding to respectively. We effectively eliminate the use of by nondeterministically generating from a new regular constraint for as well as a new regular constraint for . These constraints incorporate the effect of the function (i.e. all regular constraints are on the “source” variables). Then, the satisfiability of is turned into testing the nonemptiness of the intersection of and , as well as the nonemptiness of the intersection of and . When there are multiple occurrences of the function, this process can be iterated. Our decision procedure enjoys the following advantages:

It is automatatheoretic and built on clean automaton constructions, moreover, when the formula is satisfiable, a solution can be synthesised. For example, in the aforementioned XSS vulnerability detection example, one can synthesise the values of the variables and for a potential attack.

The decision procedure is modular in that the terms are removed one by one to generate more and more regular constraints (emptiness of the intersection of regular constraints could be efficiently handled by stateoftheart solvers like (Wang et al., 2016)).

The decision procedure requires exponential space (thus double exponential time), but under assumptions that are reasonable in practice, the decision procedure uses only polynomial space, which is not worse than other string logics (which can encode the PSPACEcomplete problem of checking emptiness of the intersection of regular constraints).
Organisation.
This paper is organised as follows: Preliminaries are given in Section 2. The core string language is defined in Section 3. The main results of this paper are summarised in Section 4. The decision procedure is presented in Section 68, case by case. The extensions of the core string language are investigated in Section 9. The related work can be found in Section 10. The appendix contains missing proofs and additional examples.
2. Preliminaries
General Notation
Let and denote the set of integers and natural numbers respectively. For , let
. For a vector
, let denote the length of (i.e., ) and denote for each .Regular Languages
Fix a finite alphabet . Elements in are called strings. Let denote the empty string and . We will use to denote letters from and to denote strings from . For a string , let denote the length of (in particular, ). A position of a nonempty string of length is a number (Note that the first position is , instead of 0). In addition, for , let denote the th letter of . For two strings , we use to denote the concatenation of and , that is, the string such that and for each , and for each , . Let be two strings. If for some string , then is said to be a prefix of . In addition, if , then is said to be a strict prefix of . If is a prefix of , that is, for some string , then we use to denote . In particular, .
A language over is a subset of . We will use to denote languages. For two languages , we use to denote the union of and , and to denote the concatenation of and , that is, the language . For a language and , we define , the iteration of for times, inductively as follows: and for . We also use to denote the iteration of for arbitrarily many times, that is, . Moreover, let .
Definition 2.1 (Regular expressions ).
Since is associative and commutative, we also write as for brevity. We use the abbreviation . Moreover, for , we use the abbreviations and .
We define to be the language defined by , that is, the set of strings that match , inductively as follows: , , , , , . In addition, we use to denote the number of symbols occurring in .
A nondeterministic finite automaton (NFA) on is a tuple , where is a finite set of states, is the initial state, is the set of final states, and is the transition relation. For a string , a run of on is a state sequence such that for each , . A run is accepting if . A string is accepted by if there is an accepting run of on . We use to denote the language defined by , that is, the set of strings accepted by . We will use to denote NFAs. For a string , we also use the notation to denote the fact that there are such that for each , . For an NFA and , we use to denote the NFA obtained from by changing the initial state to and the set of final states to . The size of an NFA , denoted by , is defined as , the number of states. For convenience, we will also call an NFA without initial and final states, that is, a pair , as a transition graph.
It is wellknown (e.g. see (Hopcroft and Ullman, 1979)) that regular expressions and NFAs are expressively equivalent, and generate precisely all regular languages. In particular, from a regular expression, an equivalent NFA can be constructed in linear time. Moreover, regular languages are closed under Boolean operations, i.e., union, intersection, and complementation. In particular, given two NFA and on , the intersection is recognised by the product automaton of and defined as , where comprises the transitions such that and .
GraphTheoretical Notation
A DAG (directed acyclic graph) is a finite directed graph with no directed cycles, where (resp. ) is a set of vertices (resp. edges). Equivalently, a DAG is a directed graph that has a topological ordering, which is a sequence of the vertices such that every edge is directed from an earlier vertex to a later vertex in the sequence. An edge in is called an incoming edge of and an outgoing edge of . If , then is called a successor of and is called a predecessor of . A path in is a sequence such that for each , we have . The length of the path is the number of edges in . If there is a path from to (resp. from to ) in , then is said to be reachable (resp. coreachable) from in . If is reachable from in , then is also called an ancestor of in . In addition, an edge is said to be reachable (resp. coreachable) from if is reachable from (resp. is coreachable from ). The indegree (resp. outdegree) of a vertex is the number of incoming (resp. outgoing) edges of . A subgraph of is a directed graph with and . Let be a subgraph of . Then is the graph obtained from by removing all the edges in .
Computational Complexity
In this paper, we study not only decidability but also the complexity of string logics. In particular, we shall deal with the following computational complexity classes (see (Hopcroft and Ullman, 1979) for more details): PSPACE (problems solvable in polynomial space and thus in exponential time), and EXPSPACE (problems solvable in exponential space and thus in double exponential time). Verification problems that have complexity PSPACE or beyond (see (Baier and Katoen, 2008) for a few examples) have substantially benefited from techniques such as symbolic model checking (McMillan, 1993).
3. The core constraint language
In this section, we define a general string constraint language that supports concatenation, the function, and regular constraints. Throughout this section, we fix an alphabet .
3.1. Semantics of the Function
To define the semantics of the function, we note that the function encompasses three parameters: the first parameter is the subject string, the second parameter is a pattern that is a string or a regular expression, and the third parameter is the replacement string. When the pattern parameter is a string, the semantics is somehow selfexplanatory. However, when it is a regular expression, there is no consensus on the semantics even for the mainstream programming languages such as Python and Javascript. This is particularly the case when interpreting the union (aka alternation) operator in regular expressions or performing a with a pattern that matches . In this paper, we mainly focus on the semantics of leftmost and longest matching. Our handling of matches is consistent with our testing of the implementation in Python and the sed command with the posix flag. We also assume union is commutative (e.g. ) as specified by POSIX, but often ignored in practice (where is a common result in the former case).
Definition 3.1 ().
Let be two strings such that for some and be a regular expression. We say that is the leftmost and longest matching of in if one of the following two conditions hold,

case :

leftmost: , and for every strict prefix of ,

longest: for every nonempty prefix of , .


case :

leftmost: , and ,

longest: for every nonempty prefix of , .

Example 3.2 ().
Let us first consider , , , , , and . Then , and the leftmost and longest matching of in is . This is because , (notice that has only one strict prefix, i.e. ), and none of , , and belong to (notice that has three nonempty prefixes, i.e. ). For another example, let us consider , , , , , and . Then and the leftmost and longest matching of in is . This is because , , and . On the other hand, similarly, one can verify that the leftmost and longest matching of in is .
Definition 3.3 ().
The semantics of , where are strings and is a regular expression, is defined inductively as follows:

if , that is, does not contain any substring from , then ,

otherwise,

if and is the leftmost and longest matching of in , then ,

if , , is the leftmost and longest matching of in , and , then ,

if , , and is the leftmost and longest matching of in , then .

Example 3.4 ().
At first, and . In addition, and . The argument for proceeds as follows: The leftmost and longest matching of in is , where and . Then . Since is the leftmost and longest matching of in , we have . Therefore, we get . (The readers are invited to test this in Python and sed.)
3.2. StraightLine String Constraints With the Function
We consider the String data type , and assume a countable set of variables of .
Definition 3.5 (Relational and regular constraints).
Relational constraints and regular constraints are defined by the following rules,
where is a string variable, and is a regular expression over .
For a formula (resp. ), let (resp. ) denote the set of variables occurring in (resp. ). Given a relational constraint , a variable is called a source variable of if does not contain a conjunct of the form or .
We then notice that, with the function in its general form, the concatenation operation is in fact redundant.
Proposition 3.6 ().
The concatenation operation () can be simulated by the function.
Proof.
It is sufficient to observe that a relational constraint can be rewritten as
where are two fresh letters. ∎
In light of Proposition 3.6, in the sequel, we will dispense the concatenation operator mostly and focus on the string constraints that involve the function only.
Another example to show the power of the function is that it can simulate the extension of regular expressions with string variables, which is supported by the mainstream scripting languages like Python, Javascript, and PHP. For instance, can be expressed by , where is a fresh variable and is a fresh letter.
The generality of the constraint language makes it undecidable, even in very simple cases. To retain decidability, we follow (Lin and Barceló, 2016) and focus on the “straightline fragment” of the language. This straightline fragment captures the structure of straightline stringmanipulating programs with the string operation.
Definition 3.7 (Straightline relational constraints).
A relational constraint with the function is straightline, if such that

are mutually distinct,

for each , all the variables in are either source variables, or variables from ,
Remark 3.8 ().
Checking whether a relational constraint is straightline can be done in linear time.
Definition 3.9 (Straightline string constraints).
A straightline string constraint with the function (denoted by ) is defined as , where

is a straightline relational constraint with the function, and

is a regular constraint.
Example 3.10 ().
The following string constraint belongs to :
4. The satisfiability problem
In this paper, we focus on the satisfiability problem of , which is formalised as follows.
Given an constraint , decide whether is satisfiable.
To approach this problem, we identify several fragments of , depending on whether the pattern and the replacement parameters are constants or variables. We shall investigate extensively the satisfiability problem of the fragments of .
We begin with the case where the pattern parameters of the terms are variables. It turns out that in this case the satisfiability problem of is undecidable. The proof is by a reduction from Post’s Correspondence Problem. Due to space constraints we relegate the proof to Appendix A.
Proposition 4.1 ().
The satisfiability problem of is undecidable, if the pattern parameters of the terms are allowed to be variables.
In light of Proposition 4.1, we shall focus on the case that the pattern parameters of the terms are constants, being a single letter, a constant string, or a regular expression. The main result of the paper is summarised as the following Theorem 4.2.
Theorem 4.2 ().
The satisfiability problem of is decidable in EXPSPACE, if the pattern parameters of the terms are regular expressions.
The following three sections are devoted to the proof of Theorem 4.2.

We start with the singleletter case that the pattern parameters of the terms are single letters (Section 6),

then consider the constantstring case that the pattern parameters of the terms are constant strings (Section 7),

and finally the regularexpression case that the pattern parameters of the terms are regular expressions (Section 8).
We first introduce a graphical representation of formulae as follows.
Definition 4.3 (Dependency graph).
Suppose is an formula where the pattern parameters of the terms are regular expressions. Define the dependency graph of as , such that for each , if , then and . A final (resp. initial) vertex in is a vertex in without successors (resp. predecessors). The edges labelled by and are called the edges and edges respectively. The depth of is the maximum length of the paths in . In particular, if is empty, then the depth of is zero.
Note that is a DAG where the outdegree of each vertex is two or zero.
Definition 4.4 (Diamond index and length).
Let be an formula and be its dependency graph. A diamond in is a pair of vertexdisjoint simple paths from to for some . The vertices and are called the source and destination vertex of the diamond respectively. A diamond with the source vertex and destination vertex is said to be reachable from another diamond with the source vertex and destination vertex if is reachable from (possibly ). The diamond index of , denoted by , is defined as the maximum length of the diamond sequences in such that for each , is reachable from . The length of a path in is the number of edges in the path. The length of , denoted by , is the maximum length of paths in .
For each dependency graph , since each diamond uses at least one edge, we know that .
Proposition 4.5 ().
Let be an formula and be its dependency graph. For each pair of distinct vertices in , there are at most different paths from to .
It follows from Proposition 4.5 that for a class of formulae such that is bounded by a constant , there are polynomially many different paths between each pair of distinct vertices in .
Example 4.6 ().
Let be the dependency graph illustrated in Figure 1. It is easy to see that is . In addition, there are paths from to . If we generalise in Figure 1 to a dependency graph comprising diamonds from to , , from to , and from to respectively, then the diamond index of the resulting dependency graph is and there are paths from to in the graph.
In Section 6–8, we will apply a refined analysis of the complexity of the decision procedures for proving Theorem 4.2 and get the following results.
Corollary 4.7 ().
The satisfiability problem is PSPACEcomplete for the following fragments of :

the singleletter case, plus the condition that the diamond indices of the dependency graphs are bounded by a constant ,

the constantstring case, plus the condition that the lengths of the dependency graphs are bounded by a constant ,

the regularexpression case, plus the condition that the lengths of the dependency graphs are at most .
Corollary 4.7 partially justifies our choice to present the decision procedures for the singleletter, constantstring, and regularexpression case separately. Intuitively, when the pattern parameters of the terms become less restrictive, the decision procedures become more involved, and more constraints should be imposed on the dependency graphs in order to achieve the PSPACE upperbound. The PSPACE lowerbound follows from the observation that nonemptiness of the intersection of the regular expressions over the alphabet , which is a PSPACEcomplete problem, can be reduced to the satisfiability of the formula , which falls into all fragments of specified in Corollary 4.7. At last, we remark that the restrictions in Corollary 4.7 are partially inspired by the benchmarks in practice. Diamond indices (intuitively, the “nesting depth” of ) are likely to be small in practice because the constraints like are rather artificial and rarely occur in practice. Moreover, the length reflects the nesting depth of replaceall w.r.t. the first parameter, which is also likely to be small. Finally, for string constraints with concatenation and where pattern/replacement parameters are constants, the diamond index is no greater than the “dimension” defined in (Lin and Barceló, 2016), where it was shown that existing benchmarks mostly have “dimensions” at most three for such string constraints.
5. Outline of Decision Procedures
We describe our decision procedure across three sections (Section 6–Section 8). This means the ideas can be introduced in a stepbystep fashion, which we hope helps the reader. In addition, by presenting separate algorithms, we can give the finegrained complexity analysis required to show Corollary 4.7. We first outline the main ideas needed by our approach.
We will use automatatheoretic techniques. That is, we make use of the fact that regular expressions can be represented as NFAs. We can then consider a very simple string expression, which is a single regular constraint . It is wellknown that an NFA can be constructed that is equivalent to . We can also test in LOGSPACE whether there is some word accepted by . If this is the case, then this word can be assigned to , giving a satisfying assignment to the constraint. If this is not the case, then there is no satisfying assignment.
A more complex case is a conjunction of several constraints of the form . If the constraints apply to different variables, they can be treated independently to find satisfying assignments. If the constraints apply to the same variable, then they can be merged into a single NFA. Intuitively, take and and equivalent to and respectively. We can use the fact that NFA are closed under intersection a check if there is a word accepted by . If this is the case, we can construct a satisfying assignment to from an accepting run of .
In the general case, however, variables are not independent, but may be related by a use of . In this case, we perform a kind of elimination. That is, we successively remove instances of from the constraint, building up an expanded set of regular constraints (represented as automata). Once there are no more instances of we can solve the regular constraints as above. Briefly, we identify some where does not appear as an argument to any other use of . We then transform any regular constraints on into additional constraints on and . This allows us to remove the variable since the extended constraints on and are sufficient for determining satisfiability. Moreover, from a satisfying assignment to and we can construct a satisfying assignment to as well. This is the technical part of our decision procedure and is explained in detail in the following sections, for increasingly complex uses of .
6. Decision procedure for : The singleletter case
In this section, we consider the singleletter case, that is, for the formula , every term of the form in satisfies that for . We begin by explaining the idea of the decision procedure in the case where there is a single use of a term. Then we describe the decision procedure in full details.
6.1. A Single Use of
Let us start with the simple case that
where, for , we suppose is the NFA corresponding to the regular expression .
From the semantics, is satisfiable if and only if can be assigned with strings so that: (1) is obtained from by replacing all the occurrences of in with , and (2) are accepted by respectively. Let be the strings satisfying these two constraints. As is accepted by , there must be an accepting run of on . Let such that for each , . Then and there are states such that
and . Let denote . Then . In addition, let be the NFA obtained from by removing all the transitions first and then adding the transitions for . Then
Therefore, . We deduce that there is such that and . In addition, it is not hard to see that this condition is also sufficient for the satisfiability of . The arguments proceed as follows: Let and . From , we know that there is an accepting run of on . Recall that is obtained from by first removing all the transitions, then adding all the transitions
Comments
There are no comments yet.