A Survey on String Constraint Solving

01/31/2020 ∙ by Roberto Amadini, et al. ∙ University of Bologna 0

String constraint solving refers to solving combinatorial problems involving constraints over string variables. String solving approaches have become popular over the last years given the massive use of strings in different application domains like formal analysis, automated testing, database query processing, and cybersecurity. This paper reports a comprehensive survey on string constraint solving by exploring the large number of approaches that have been proposed over the last decades to solve string constraints.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Strings are everywhere across and beyond Computer Science. They are a fundamental datatype in all the modern programming languages, and operations on strings frequently occur in disparate fields such as software analysis, model checking, database applications, web security, bioinformatics and so on (path_feas; dyn_test_db; mod_check; DBLP:conf/cav/AbdullaACHRRS14; waptec; js-string; aratha; str_mod_check; DBLP:journals/constraints/BarahonaK08).

Reasoning over strings requires solving arbitrarily complex string constraints, i.e., relations defined on a number of string variables. Typical examples of string constraints are string length, (dis-)equality, concatenation, substring, regular expression matching.

With the term “string constraint solving” (shortly, string solving or SCS) we refer to the process of modelling, processing, and solving combinatorial problems involving string constraints. We may see SCS as a declarative paradigm which falls in the intersection between constraint solving and combinatorics on words: the user states a problem with string variables and constraints, and a suitable string solver seeks a solution for that problem.

Although works on the combinatorics of words were already published in the 1940s (DBLP:journals/jsyml/Quine46a), the dawn of SCS date back to the late 1980s in correspondence with the rise of Constraint Programming (CP) (handbookCP) and

Constraint Logic Programming

(CLP) (clp) paradigms. Pioneers in this field were for example Trilogy (trilogy), a language providing strings, integer and real constraints, and  (clpstar), an instance of the CLP scheme representing strings as regular sets. The latter in particular was the first known attempt to use string constraints like regular membership to denote regular sets.

Later in the 1990s and 2000s, string solving has sparked some interests (e.g., (genlang; cpstr; cpgram; hampi09; mona; lpstr; jsa; drple; php-str)) without however leaving a mark. It was only from the 2010s that SCS finally took hold in application domains where string processing plays a central role such as test-case generation, software verification, model checking and web security. This increased interest motivated the organisation of the first workshop on string constraints and applications (MOSCA) in 2019 (mosca).

Arguably, the widespread interest in cybersecurity has given new impulse to SCS because strings can be silent carrier of software vulnerabilities (e.g., SQL injections). Another plausible reason is the remarkable performance improvements that constraint solvers have achieved over the last years. A precise reasoning about strings is especially critical for the analysis JavaScript language, nowadays the de-facto standard for web applications, given the crucial role that strings have in this language (js-string; aratha).

Over the last decade a large number of different SCS approaches has emerged, roughly falling in three main categories:

  • Automata-based approaches: relying on finite state automata to represent the domain of string variables and to handle string operations.

  • Word-based approaches: based on systems of word equations. They mainly use Satisfiability Modulo Theory (SMT) (smt) solvers to tackle string constraints.

  • Unfolding-based approaches: they basically expand each string variable into a number of contiguous elements denoting the characters of (e.g.,

    can be mapped into integer variables or bit-vectors).

As we shall see, CP and SMT are the state-of-the-art technologies for solving SCS problems.

The goal of this paper is to provide a comprehensive survey of the various string solving approaches proposed in the literature, ranging from the theoretical foundations to the more practical aspects. After formalising the notion of string constraint solving in Section 2, we provide a detailed review of SCS approaches grouped by category (Sections 35). In Section 6 we show the main theoretical results we are aware, while in Section 7 we focus on the practical aspects, by reporting the SCS tools and benchmarks that have been developed. In Section 8 we discuss the related literature before concluding in Section 9.

2. String Constraint Solving

In this Section we define the fundamentals of string constraint solving, and in particular we show how SCS are treated from both CP and SMT perspectives. Before that, we recall some preliminary notions about strings and automata theory.

Let us fix a finite alphabet, i.e., a set of symbols also called characters. A string (or a word) is a finite sequence of characters of , and denotes the length of (in this work we do not consider infinite-length strings). The empty string is denoted with . The countable set of all the strings of is inductively defined as follows: (i) ; (ii) if and , then ; (iii) nothing else belongs to .

The string concatenation of is denoted by (or simply with when not ambiguous). We denote with the iterated concatenation of for times, i.e., and for . Analogously, we define the concatenation between sets of strings: given , we denote with (or simply with ) their concatenation and with the iterated concatenation, i.e., and for .

A set of strings is called a formal language. Formal languages are countable sets that can be recognised by well-known models of computation called finite-state automata (FSA). Roughly, a FSA is a system with a finite number of states where a state transition occurs according to the input string . A number of final or accepting states determines if belongs to the language denoted by or not.

Different variant and extensions of FSA have been proposed, e.g., deterministic (DFA), non-deterministic (NFA), push-down automata (PDA, to recognize context-free languages), finite-state transducers (FST, basically FSA with input/output tapes defining relations between sets of strings).

A string variable is basically a variable that can only take values in or, equivalently, whose domain is a formal language of

. We can classify string variables into three hierarchical classes:

  • unbounded-length variables: they can take any value in

  • bounded-length variables: fixed an integer , they can only take values in

  • fixed-length variables: fixed an integer , they can only take values in

A string constraint is a relation over at least a string variable. For example, concatenation is a ternary string constraint. Clearly, instead of writing we use a more convenient functional notation . In this paper we will only consider constraints involving strings and (possibly) integers (e.g., string length or iterated concatenation).

String constraint Description
, equality, inequality
, , , lexicografic ordering
string length
, concatenation, iterated concatenation times
string reverse
substring from index to index
is the first occurrence of in
is obtained by replacing the first occurrence of with in
is obtained by replacing all the occurrences of with in
is the number of occurrences of character in
membership of in regular language denoted by
membership of in context-free grammar language denoted by
Table 1. Main string constraints.

Table 1 summarises the main string constraints that one can find in the literature, from which is possible to derive other constraints (e.g., to replace the last occurrence of a string one can combine and string reverse).111The constraint is also referred as global cardinality count (GCC) (mzn-strings). A good SCS approach should be able to handle most of them. Note that in this work we consider quantifier-free constraints. In simple terms, the goal of string constraint solving is to determine whether or not a set of string constraints is feasible. As we shall see, this task can be tackled equivalently with both CP and SMT technologies.

2.1. Scs from a CP perspective

From a constraint programming point of view, string constraint solving means solving a particular case of Constraint Satisfaction Problem (CSP). Formally, a CSP is a triple where: are the variables; are the domains, where for each is a set of values that can take; are the constraints, i.e., relations over the variables of defining the feasible values for the variables.

The goal is to find a solution of , which is basically an assignment such that for and for each constraint defined over variables . The CSP notion can be naturally extended to optimization problems: we just add an objective function that maps each solution to a numerical value to be minimised or maximised.

To find a solution, CP solvers use two main combined techniques: propagation, which works on individual constraints trying to prune the domains of the variables involved until a fixpoint is reached, and branching

, which aims to find a solution via heuristic search (propagation process is not complete in general).

Fixed an alphabet we call a CSP with strings, or -CSP, a CSP having string variables such that for , and a number of constraints in over such variables. To find a solution, a CP solver can try to compile down a -CSP into a CSP with only integer variables (mzn-strings), or it can define specialised string propagators and branchers (gecode_s; dashed-string; sweep-based). The latter approach has proved to be much more efficient.

For example, consider CSP where are string variables with associated alphabet and is an integer variable. Propagating will exclude string from because it has length 4, while the domain of is the interval . An optimal propagator for would narrow the domain of from to . Note that propagation is a compromise between effectiveness (how many values are pruned) and efficiency (the computational cost of pruning), so sometimes it makes sense to settle for efficient but sub-optimal propagators. Then, will narrow the domain of to singleton (which actually means assigning the value to ). At this stage, a fixpoint is reached, i.e., no more propagation is possible: we have to branch on to possibly find a solution. Let us suppose that the variable choice heuristics selects variable and the value choice heuristics assigns to it the value ; in this case the propagator of is able to conclude that so a feasible solution for (not the only one) is .

Note that virtually all the CSPs referred in the literature have finite domains, i.e., the cardinality of each domain of is bounded. Having finite domains guarantees the decidability of CSPs—that are in general NP-complete—by enumeration, but at the same time prevents the use of unbounded-length string variables for -CSPs.

As we shall see in Section 5, all the effective CP approaches for string solving are unfolding-based and do not handle unbounded-length variables. In fact, although CP provides the constraint—stating that must belong to the language denoted by finite state machine —it is also true that the string variable (or array of integer variables) must have a fixed (pesant04-regular) or bounded (regular) length. The only C(L)P proposals we are aware handling unbounded-length strings (via regular sets) are (clpstar) and (cpstr). These automata-based approaches are however outdated.

2.2. Scs from a SMT perspective

In a nutshell, Satisfiability Modulo Theories generalises the Boolean satisfiability problem to decide whether a formula in first-order logic is satisfiable with respect to some background theory that fixes the interpretations of predicates and functions (smt). Note that SMT theories can be arbitrarily enriched and combined together.

Over the last decades, several decision procedures have been developed to tackle the most disparate theories and sub-theories, including the theory of (non-)linear arithmetic, bit-vectors, floating points, arrays, difference logic, uninterpreted functions. In particular, well-known SMT solvers like, e.g., CVC4 (cvc4-str) and Z3 (z3) decided to implement the theory of strings (often in conjunction with related theories, such as linear arithmetic for length constraints and regular expressions).

For example, the quantifier-free theory of strings (or word equations) and linear arithmetic deals with integers and unbounded-length strings in , where is a given alphabet. Its terms are string/integer variables/constants, concatenation and length. The formulas of are (dis-)equalities between strings and linear arithmetic constraints. For example, the formula where and are string variables is well-formed for this theory. Unfortunately, the decidability of is still unknown (mosca).

As we shall see in Section 4, over the last years a growing number of modern SMT solvers has integrated the theory of strings. Most of them are based on the  (DPLLT) procedure. is a general framework extending the original DPLL algorithm (tailored for SAT solving) to deal with an arbitrary theory through the interaction between a SAT solver and a solver specific for . In a nutshell, lazily decomposes a SMT problem into a SAT formula, which is handled by a DPLL-based SAT solver which in turn interacts with a theory-specific solver for , whose job is to check the feasibility of the formulas returned by the SAT solver.

As an example, let us consider the theory with the above formula (which is unsatisfiable). The formula is tranlated into a Boolean formula , handled by a SAT solver which can return “unsatisfiable” or a satisfying assignment (this does not imply that the overall formula is satisfiable). In the latter case, the constraints corresponding to such assignments are distributed to the different theories.

For example, if the assignment is returned, the constraints and are delivered to the string solver, while will be solved by an arithmetic solver. Now, the theory solvers can either find that the constraints are -satisfiable or return theory lemmas to the SAT solver. For example, the string solver might return to the SAT solver, which will add the corresponding clause to its knowledge base. The SAT solver will then produce a new assignment (or return “unsatisfiable”) and the process will be iteratively repeated until either (un-)satisfiability of is proven or a resource limit is reached (if the theory is not decidable, termination is not guaranteed in general).

3. Automata-based Scs approaches

As aforementioned, the domain of a string variable is a formal language, i.e., a potentially infinite (yet countable) set. A natural way to denote these sets is through (extensions of) finite-state automata. It is therefore unsurprising that the early string solving approaches were based on FSA possibly enriched with other data structures.

We can say that a SCS approach is automata-based if the string variables are mainly represented by automata, and the string constraints are mainly mapped into corresponding automata operations.

As aforementioned was one of the first attempts to incorporate strings in the CLP framework to strengthen the standard string-handling features such as concatenation and sub-string (clpstar). This approach was further developed by Golden et al. (cpstr) about 15 years later. Their main contribution was to use FSA to represent regular sets. In (DBLP:conf/aaai/HansenA07) Hansen et al. use deterministic FSA (DFA) and binary decision diagrams (BDDs) to handle interactive configuration on string variables.

The global constraint proposed in (pesant04-regular) enables to treat a fixed-size array of integer variables as a fixed-length string belonging to the regular language denoted by a given (non-)deterministic FSA . Note that was introduced to solve finite domains CP problems like rostering and car sequencing, and not targeted to string solving — in fact, it is a useful support that has been used in different CP applications. Its natural extension is the constraint (cpgram; gramcons), where instead of a FSA we have a context-free grammar. However, never reached the popularity of in the CP community.

An interesting paper about automata-based approaches is (auto-eval), where Hooimeijer et al. study a comprehensive set of algorithms and data structures for automata operations in order to give a fair comparison between different automata-based SCS frameworks (mona; drple; jsa; php-str; rex; stranger; pass). According to their experiments, the best results were achieved when using the BDDs in combination with lazy versions of automata intersection and difference.

MONA (mona) is a tool developed in the ’90s that acts as a decision procedure for Monadic Second-Order Logic (M2L) and as a translator to finite-state automata based on BDDs. FIDO (fido) is a domain-specific programming formalism translated first into pure M2L via suitable encodings, and finally into FSA through the MONA tool. Another M2L-based solver is PISA (pisa), a path- and index-sensitive string solver that is applicable for static analysis.

DRPLE (drple) is a SCS approach to solve equations over regular language string variables. The authors provide automata-based decision procedures to tackle the Regular Matching Assignments problem and a subclass, the Concatenation-Intersection problem.

StrSolve (strsolve) is a decision procedure supporting similar operations to those allowed by DPRLE, but efficiently produces single witnesses rather than atomically generating entire solution sets. Nevertheless, its worst-case performance corresponds to that of DPRLE.

JSA (jsa) is a string analysis framework that first transforms a Java source into a flow graph (front-end), and then derives FSA from such graph (back-end). In particular, JSA uses well-founded hierarchical directed acyclic graphs of non-deterministic FSA called multi-level automata (MLFA).

In (php-str) Minamide developed a string analyzer for the PHP scripting language to detect cross-site software vulnerabilities and to validate pages they generate dynamically. The analyzer has a library to manipulate formal languages including automata, transducers and context-free grammars.

Rex (rex) is a tool based on Z3 solver (z3) for symbolically expressing and analyzing regular expression constraints. It relies on symbolic finite-state automata (SFA) where moves are labeled by formulas instead of individual characters. SFAs are then translated into axioms describing the acceptance conditions.

SUSHI (sushi) is a string solver based on the Simple Linear String Equation (SISE) formalism (sise) to represent path conditions and attack patterns. To solve SISE constraints, the authors use an automata-based approach. Finite state transducers are used to model the semantics of regular substitution.

Stranger (stranger) is an automata-based tool for finding and eliminating string-related security vulnerabilities in PHP applications. It uses symbolic forward and backward reachability analyses to compute the possible values that string expressions can take during program execution.

PASS (pass) is a string solver using parameterized arrays as the main data structure to model strings, and converts string constraints into quantified expressions that are solved through quantifier elimination. In addition, PASS uses an automaton model to handle regular expressions and reason about string values faster.

SLOG (slog) is a string analysis tool based on a NFA manipulation engine with logic circuit representation. Automata manipulations can be performed implicitly using logic circuits while determinization is largely avoided. SLOG also supports symbolic automata and enables the generation of counterexamples.

Sloth (sloth) is based on the reduction of satisfiability of formulae in the straight-line fragment and in the acyclic fragment to the emptiness problem of alternating finite-state automata (AFAs). Sloth can handle string constraints with concatenation, finite-state transducers, and regular constraints.

OSTRICH (ostrich) is a string solver providing built-in support for concatenation, reverse, functional transducers (FFT), and . It can be seen as an extension of Sloth in the sense that the decision algorithm of OSTRICH can reduce the problem to constraints handled by Sloth when it is not possible to avoid non-determinism.

In (DBLP:journals/jip/ZhuAM19) Zhu et al. proposed a SCS procedure where atomic string constraints are represented by streaming string transducers (SSTs) (sst). A straight-line constraint is satisfiable if and only if the domain of the composed streaming string transducer is not empty.

Pros and Cons

Automata enables to represent infinite sets of strings with finite machines, hence they are a natural and elegant way to represent unbounded-length strings. Also, the theory of automata (automata) is well defined and studied.

Unfortunately, the performance of automata-based approaches for string constraint solving has been hampered by two main factors: (i) the possible state explosion due to automata operations (e.g., the intersection of DFA is quadratic in the number of states); (ii) the integration with other domains and theories (integers in particular).

For these reasons, SCS approaches purely based on automata have little success nowadays. However, for some particular classes of SCS problems they still might be the best option.

4. Word-based Scs approaches

As mentioned in Section 2, strings are also called words. What we call string constraints are, in a more algebraic terminology, sometimes called word equations (makanin1977). To be more precise, a word equation is a particular string constraint of the form with where is an alphabet and is a set of variables.

We can say that a SCS approach is word-based if it is based on the theory of word-equations, possibly enriched with other theories (e.g., integers or regular expressions). Word-based approaches rely on algebraic techniques for solving natively (quantifier-free) string constraints over the (extended) theory of unbounded strings, without reducing to other data types such as bit vectors or automata. The natural candidates for implementing word-based approaches are SMT solvers, which can incorporate and integrate the theory of strings in their frameworks.

In 2014, Liang et al. (cvc4-str) argued that: “Despite their power and success as back-end reasoning engines, general multitheory SMT solvers so far have provided minimal or no native support for reasoning over strings […] until very recently the available string solvers were standalone tools that […] imposed strong restrictions on the expressiveness […] Traditionally, these solvers were based on reductions to satisfiability problems over other data types”. However, in the following years the situation has changed, and more and more word-based string solving approaches have emerged.

In (cvc4-str) the authors integrated a word-based SCS approach into the well-known SMT solver CVC4 (cvc4). They used a approach for solving (quantifier-free) constraints natively over the theory of unbounded strings with length and regular language membership. The authors claimed that CVC4 was the first solver able to reason about a language of mixed constraints including strings together with integers, reals, arrays, and algebraic datatypes. This work has been revised and extended in the following years to handle extended string functions frequently occurring in security and verification applications such as , and  (cvc4-ext1; cvc4-ext2).

Meanwhile, another well-established SMT solver, namely Z3 (z3), started to develop string solving capabilities. In (z3str) was introduced Z3-str, an extension of Z3 to solve string constraints. Z3-str was the progenitor of a number of different word-based string solvers built on top of Z3. Z3str2 (z3str2-jnl) extended Z3-str by including overlapping variables detection and new heuristics. Z3str3 (z3str3) added a technique called theory-aware branching to take into account the structure of theory literals for computing branching activities. Z3strBV (z3strbv) is instead a solver for the theory of string equations, string length represented as bit-vectors, and bitvector arithmetic aimed at the software verification, testing, and security analysis of C/C++.

Norn is a SMT solver introduced in (norn) for an expressive constraint language including word equations, length constraints, and regular membership queries. Norn is based on a decision procedure under the assumption of a set of acyclicity conditions on word equations, where acyclicity is a syntactic condition ensuring that no variable appears more than once in word (dis-)equalities.

Trau (trau) is a SMT string solver based on the flattening technique introduced in (flat-trau), where flat automata are used to capture simple patterns of common constraints. It relies on a Counter-Example Guided Abstraction Refinement (CEGAR) framework (cegar) where an under- and an over-approximation module interact to increase the string solving precision. In addition, Trau implements string transduction by reduction to context-free membership constraints.

S3 is a symbolic string solver (s3) motivated by the analysis of web programs working on string inputs. S3 can be viewed as an extension of the aforementioned Z3-str (z3str) solver to handle regular expressions. Its successor, S3P (s3p), guides the search towards a “minimal solution” for satisfiable problems and enables the conflict clause learning for string theory. The latest version, called S3# (s3hash) implements an algorithm for counting the models of string constraints.

Pros and Cons

State-of-the-art word-based approaches are more general, flexible and often efficient than automata-based approaches. They can be built on the top of well-known SMT solvers and integrated with already defined theories. Furthermore, they can natively handle unbounded-length string.

One the downside, most of these approaches are incomplete and suffer from the performance issues due to the disjunctive reasoning of the underlying DPLL(T) paradigm (DPLLT). In particular, some experimental evaluations (dashed-string; sweep-based) showed that they may encounter difficulties when dealing with big string constants.

5. Unfolding-based Scs approaches

A straightforward way of solving string constraints is to encode them into other well-known types—for which well-established constraint solving techniques already exist—such as Boolean, integers or bit-vectors.

A SCS approach is unfolding-based if each string variable is unfolded into an homogeneous sequence of variables of a different type , and each string constraint is accordingly mapped into a constraint over .

Note that an unfolding approach inherently needs an upper bound on the string length. So, these approaches can handle fixed-length or bounded-length string variables, but cannot deal with unbounded-length variables.

A proper choice of is crucial. If is too small, one cannot capture solutions having not-too-small string length. On the other hand, too large a value for can significantly worsen the SCS performance even for trivial simple problems.

Another important choice is whether to unfold eagerly (i.e., statically, before the actual solving process) or lazily (i.e., dynamically, during the solving process).

Hampi (hampi09; hampi12)

was probably the first SMT-based approach encoding string constraints into constraints over bit-vectors, solved by the underlying STP solver 

(stp). Its first version (hampi09) only allowed one fixed-length string variable. Its subsequent version (hampi12) added a number of optimisations and, in particular, provided the support to word equations and bounded-length string variables.

Kaluza (kaluza) was the back-end solver used by Kudzu, a symbolic execution framework for the JavaScript code analysis. Similarly to Hampi, Kaluza deal with string constraints over bounded-length variables by translating them into bit-vector constraints solved with STP solver (stp). In fact, Kaluza can be seen as an extension of the first version of Hampi (hampi09).

Mapping strings into bit-vectors is suitable for software analysis applications, especially when it comes to precisely handle the overflows via wrapped integer arithmetic. Plenty of SMT solvers for the quantifier-free bit-vector formulas exist (often relying on bit blasting), while the CP support for bit-vectors appears limited (bit_vector).

In alternative to bit-vectors, the simplest approach is to translate string variables into arrays of integer variables encoding the string characters (possibly using a special padding value for the “empty character”). Mapping to integers is probably the most common approach for C(L)P solvers 

(mzn-strings).

For example, in (ocl) the authors describe a lightweight solver relying on CLP and Constraint Handling Rules (CHR) paradigm (chr) in order to generate large solutions for tractable string constraints in model finding. This approach unfolds string variables by first labelling their lengths and domains, and then their characters.

As aforementioned, CP solvers can use fixed-length or bounded-length arrays of integer variables to deal with , and other string constraints (pesant04-regular; cpgram; KadiogluS10; HeFPZ13; Maher09; mzn-strings) without a native support for string variables. However, as shown in (gecode_s; mzn-strings; dashed-string; sweep-based), having dedicated propagators for string variables can make a difference.

In (ScottFP13) Scott et al. presented a prototypical bounded-length approach based on the affix domain to natively handle string variables. This domain allows one to reason about the content of string suffixes even when the length is unknown by using a padding symbol at the end of the string.

The (ScottFP13) approach has been subsequently improved in (bound_str; gecode_s) with a new structured variable type for strings called Open-Sequence Representation, for which suitable propagators are defined. This approach has been implemented in the Gecode solver (gecode) and in (mzn-strings) referred as Gecode+.

To mitigate the dependency on and enable lazy unfolding, a fairly recent CP approach based on dashed strings has been introduced (dashed-string; sweep-based). Dashed strings are concatenations of distinct set of strings (called blocks) used to represent in a compact way the domain of string variables with potentially very big length. Dashed string propagators have been defined for a number of constraints (sweep-based; lexfind; regular) and implemented in the G-Strings solver (g-strings).

Pros and cons

The unfolding-based approaches allows one to take advantage of already defined theories and propagators without explicitly implementing a support for strings. Experimental results show that unfolding approaches, and in particular the CP-based dashed string approach, can be quite effective — especially for SCS problems involving long strings.

However, unfolding approaches also have limitations. The most obvious one is the impossibility of handling unbounded-length strings. This can be negligible, provided that a good value of is chosen. Unfortunately, deciding a good value of is not always trivial. CP solvers can be more efficient than SMT solvers for satisfiable SCS problems, but they may fail on unsatisfiable problems with large domains, and on problems with a lot of logical disjunctions.

6. Theoretical aspects

In this section we briefly explore the theoretical side of string constraint solving, focusing in particular on the results that have been applied to the SMT and CP fields. For more insights about the theory of automata and combinatorics on words we refer the reader to (automata; automata_1; automata_2; automata_3; automata_4; lothaire1997combinatorics; choffrut1997combinatorics; lothaire2002algebraic; lothaire2005applied; berstel2007origins).

Algebraically speaking, the structure is a free monoid on . The free semigroup on is instead the semigroup where . There is therefore a number of theoretical results for free semigroups that can be applied for string solving (makanin1977; Va_enin_1983; Durnev1995UndecidabilityOT).

Over the years the complexity and the decidability of combinatorial problems over finite-length strings has been deeply studied (DBLP:journals/jsyml/Quine46a; ferebee1972; makanin1977; Va_enin_1983; Buchi1990; Durnev1995UndecidabilityOT; DBLP:journals/jsyml/Lob53; DBLP:journals/jacm/KarhumakiMP00; DBLP:journals/jacm/Plandowski04). Much progress has been made, but many questions remain open especially when the language is enriched with new predicates (DBLP:conf/hvc/GaneshMSR12). For example, the decidability of the theory of strings and linear arithmetic, mentioned in Section 2 and used by the CVC4 solver, is still unknown. In (open-words), Néraud provided a list of open problems and conjectures for combinatorics on words.

As also noted in (DBLP:conf/atva/AbdullaADHJ19), DPLL(T)-based string solvers like Z3str2 (z3str2), Z3str3 (z3str3), CVC4 (cvc4-str), S3 (s3; s3p; s3hash), Norn (norn), Trau (trau), Sloth (sloth), and OSTRICH (ostrich) handle a variety of string constraints but they are incomplete for the full combination of those constraints. So, they often decide to deal with a particular fragment of the individual string constraints.

In (DBLP:conf/hvc/GaneshMSR12) Ganesh et al. proved several (un-)decidability results for word equations with length constraints and membership in regular sets. They proved in particular three main theorems: (i) the undecidability of the validity problem for the set of sentences written as a quantifier alternation applied to positive word equations; (ii) the decidability of quantifier-free formulas over word equations in solved form and length constraints; (iii) the decidability of quantifier-free formulas over word equations in regular solved form, length constraints, and the membership predicate over regular expressions.

The work in (DBLP:conf/rp/DayGHMN18) extends (DBLP:conf/hvc/GaneshMSR12) by considering the (un-)decidability of many fragments of the first order theory of word equations and their extensions. The authors show that when extended with several natural predicates on words, the existential fragment becomes undecidable. Moreover, they prove that deciding whether solutions exist for a restricted class of equations, augmented with many of the predicates leading to undecidability in the general case, is possible in non-deterministic polynomial time.

A. W. Lin et al. (DBLP:conf/popl/LinB16) studied the decidability of string logics with concatenations and finite-state transducers as atomic operations. They show that the straight-line fragment of the logic is decidable (complexity ranges from PSPACE to EXPSPACE). This fragment can express constraints required for analysing mutation XSS in web applications and remains decidable in the presence of length, letter-counting, regular, indexOf, and disequality constraints.

In (DBLP:journals/pacmpl/ChenCHLW18) Chen et al. provide a systematic study of straight-line string constraints with and regular membership as basic operations. They show that a large class of such constraints (i.e., when only a constant string or a regular expression is permitted in the pattern) is decidable. They also show undecidability results when variables or length constraints are permitted in the pattern parameter of the replace function.

In (ostrich) Chen et al. define a first-order language based on the .NET string library functions and prove some decidability properties for (fragments of) that language. In particular, based on the work of (Buchi1990), the authors prove that the path feasibility problem for the library language is undecidable.

In (DBLP:conf/atva/AbdullaADHJ19) Abdulla et al. address string constraints combining transducers, word equations, and length constraints. Since this problem is undecidable in general, they propose a new decidable fragment of string constraints, called weakly chaining, that they prove to be decidable.

Recompression (Jez16) is a technique based on local modification of variables and iterative replacement of pairs of letters that can be seen as a bottom-up compression of the solution of a given word equation. Recompression can be used for the proof of satisfiability of word equations (Jez16) or for grammar-based compression (Jez15).

Switching to the CP side, we cannot say anything deep about decidability because this paradigm essentially works on finite domains, and CSPs over finite domains are trivially decidable by enumerating the domain values. The theoretical results in CP mainly refer to the consistency levels (handbookCP) that a given constraint propagator can enforce. For example, the propagator introduced in (pesant04-regular) enforces the Generalized Arc Consistency (GAC).

In (DBLP:conf/cpaior/PesantQRS09) Pesant et al. performed a theoretical study demonstrating how the constraint can be linearized efficiently. In particular, they proposed a lifted polytope having only integer extreme points.

In (KadiogluS10) Kadioglu et al. theoretically investigate filtering problems that arise from constraints. They authors address question like: can we efficiently filter context-free grammar constraints? How can we achieve arc-consistency for conjunctions of regular grammar constraints? What languages are suited for filtering based on regular and context-free grammar constraints? Are there languages that are suited for context-free, but not for regular grammar filtering?

In (prefsuff) Scott et al. extended the open domain consistency notion (Maher09) with new definitions such as PS-consistency, PSL-consistency, PSU-consistency, and PSLU-consistency.

Given their non-lattice nature, dashed strings do not come with propagators that formally ensure a given level of consistency. The main theoretical results for dashed strings concern the sweep-based equation algorithm introduce in (dashed-string), on which most of the dashed string propagators rely, which has been proven to be sound (while its completeness is still an open problem).

7. Practical aspects

In this Section we focus on the practical aspects of string solving, that is, on the SCS tools and benchmarks that have been actually developed and tested.

Among the automata-based approaches seen in Section 3, the tools that are currently available online are MONA (mona-tool), JSA (jsa-tool), the PHP string analyzer (php-str-tool) described in (php-str), Rex (rex-tool), and Sloth (sloth-tool). However, only the latter two are actively maintained. PISA (pisa) is a commercial security product not publicly available.

CVC4 and Z3 are two of the most famous SMT solvers. Their interface for string solving is maintained and well documented (strings-cvc; strings-z3). Note that Z3 provides two alternatives for string solving: the theory of strings (via Z3str3 solver (z3str3)) and the theory of sequences (Z3seq). Z3str2 and Z3-str are instead no longer maintained.

Norn is publicly available at (norn-tool), but its last update dates back to 2015. At (s3-tool) one can find the binaries of all the three versions of S3 string solver (S3, S3P, and S3#). However, the sources are available only for S3 and no longer updated since 2014.

Trau comes in two versions. The first one is available at (trau-tool-1), while the newest implementation (called Z3-Trau) can be found at (trau-tool-2). Both these versions are built on top of Z3 solver.

Concerning unfolding-based approaches, Kaluza solver is still available at (kaluza-tool) but no longer maintained. Analogously, also Gecode+ is available on-line (gecode_s-tool) but no longer developed. Conversely, the G-Strings solver (g-strings-tool) implementing the dashed string approach is actively maintained. Both Gecode+ and G-Strings are extensions of Gecode (gecode), a well-established CP solver over finite domains.

ACO-Solver (acosolver) is a tool using a hybrid constraint solving procedure based on the ant colony optimization meta-heuristic, which is executed as a fall-back mechanism, when an underlying solver encounters an unsupported string operation. JOACO-CS is an extension of ACO-Solver publicly available at (joaco-tool).

All the above approaches, and many others, have been tested on several string benchmarks — especially coming from the software verification world. Although some of them are publicly available, a major issue for string solving is that there are not standard benchmarks for SCS applications. This makes difficult a rigorous comparison and the integration between string solvers.

The first step towards the creation of standard benchmarks is the implementation of a common language to model and solve SCS problems. Unfortunately, at present, strings are not a standard feature for neither SMT nor CP modelling languages. However, something is currently moving in this direction.

For example, given the increasing interest in string solving, the STM-LIB initiative (smtlib) is developing an official theory of strings and related logics for SMT solvers.222http://smtlib.cs.uiowa.edu/theories-UnicodeStrings.shtml On the CP side, the de-facto modelling language for constraint solvers is called MiniZinc (minizinc). Strings are not (yet) part of the official MiniZinc release, but in (mzn-strings) Amadini et al. introduced an extension to enable string solving. This also includes a library for compiling string variables and constraints into bounded-length arrays or integer variables and constraints.

Unfortunately, at the moment the interaction between the SMT and CP communities does not appear very strong. A possible way to bring these two communities closer together is to define compilers to translate SMT-LIB into MiniZinc and vice versa. For example, Bofill et al. 

(fzn2smt) defined a compiler from FlatZinc—the low-level language derived from MiniZinc—to SMT-LIB. More recently, G. Gange (smt2mzn) implemented a prototypical compiler from SMT-LIB to MiniZinc able to process string variables and constraints.

Class Description Quantity
Concats-{Small,Big} Right-heavy, deep tree of concats 120
Concats-Balanced Balanced, deep tree of concats 100
Concats-Extracts-{Small,Big} Single concat tree, with character extractions 120
Lengths-{Long,Short} Single, large length constraint on a variable 200
Lengths-Concats Tree of fixed-length concats of variables 100
Overlaps-{Small,Big} Formula of the form 80
Regex-{Small,Big} Complex regex membership test 120
Many-Regexes Multiple random regex membership tests 40
Regex-Deep Regex membership test with many nested operators 45
Regex-Pair Test for membership in one regex, but not another 40
Regex-Lengths Regex membership test, and a length constraint 40
Different-Prefix Equality of two deep concats with different prefixes 60
Table 2. StringFuzz benchmarks composition

A nice step forward for SCS benchmarks is represented by StringFuzz (strfuzz), a modular SMT-LIB problem instance transformer and generator for string solvers. It is a useful tool for string solver developers and testers, and can help expose bugs and performance issues.

A repository of several SMT-LIB 2.0/2.5 problem instances generated and transformed with StringFuzz is available online.333http://stringfuzz.dmitryblotsky.com/benchmarks/. Table 2 summarizes the nature of these instances, grouped into twelve different classes. StringFuzz also reports the performance of Z3str3, CVC4, Z3seq, and Norn on such instances. Arguably, only these SMT solvers were used because other SMT approaches are either unstable or cannot properly process the proposed SMT-LIB syntax. Note that the SMT-LIB to MiniZinc compiler of (smt2mzn) can successfully translate all the generated StringFuzz instances, so also the performance of CP solvers can be evaluated.

8. Related works

In this section we cover some of the related literature that has not been previously mentioned.

DReX (drex) is a declarative language that can express all the regular string-to-string transformations based of function combinators (regcomb). The focus is on the complexity of evaluating the output of a DReX program on a given input string. In particular, the main contribution of (drex) is the identification of a consistency restriction on the use of combinators in DReX programs, and a single-pass evaluation algorithm for consistent programs.

Luu et al. developed SMC (smc), a model counter for determining the number of solutions of combinatorial problems involving string constraints. Their work relies on generating functions, a mathematical tool for reasoning about infinite series that also provides a mechanism to handle the cardinality of string sets. SMC is expressive enough to model constraints arising in real-world JavaScript applications and UNIX C utilities.

An Automata-Based model Counter for string constraints (ABC) is implemented in (DBLP:conf/cav/AydinBB15). ABC uses an automata-based constraint representation for reducing model counting to path counting. In (DBLP:conf/cav/AydinBB15) the ABC model counter is extended with relational and numeric constraints.

As aforementioned, the main application field for string solving is the area of software verification and testing. For example, Kausler et al. (str-eval) performed an evaluation of string constraint solvers in the context of symbolic execution (king1976). What they find is that, as one can expect, one solver might be more appropriate than another depending on the input program. This is also pointed out in (aratha), where Amadini et al. presented a multi-solver tool for the dynamic symbolic execution of JavaScript.

In (PHPRepair), the PHPRepair tool is used for automatically repairing HTML generation errors in PHP via string constraint solving. The property that all tests of a suite should produce their expected output is encoded as a string constraint over variables representing constant prints. Solving this constraint describes how constant prints must be modified to make all tests pass. The string constraints are encoded into the language of Kodkod (kodkod), a SAT-based constraint solver.

Abstracting a set of strings with a finite formalism is not merely a string solving affair. For example, the well-known Abstract Interpretation (CousotC77) framework may require the sound approximation of sets of strings (i.e., all the possible “concrete” values that a string variable of the input program can take) with an abstract counterpart.

Several abstract domains have been proposed to approximate set of strings. These domains vary according to the properties that one needs to capture (e.g., the string length, the prefix or suffix, the characters occurring in a string). Madsen et al. (DBLP:conf/cc/MadsenA14) proposed and evaluated a suite of twelve string domains for the static analysis of dynamic field access. Additional string domains are also discussed in (DBLP:journals/spe/CostantiniFC15; m-string).

Choi et al. (DBLP:conf/aplas/ChoiLKD06) used restricted regular expressions as an abstract domain for strings in the context of Java analysis. Park et al. (DBLP:conf/dls/ParkIR16) use a stricter variant of this idea, with a more clearly defined string abstract domain.

Amadini et al. (js-string) provided an evaluation on the combination via direct product of different string abstract domains in order to improve the precision of JavaScript static analysis. In (ref-domains) this combination is achieved via reduced product by using the set of regular languages as a reference domain for the other string domains.

Finally, we mention that string solving techniques might be useful in the context of Bioinformatics, where a number of CP techniques have been already applied (see, e.g., the works by Barahona et al. (DBLP:journals/constraints/BarahonaK08; DBLP:conf/aime/KrippahlMB13; DBLP:conf/cp/KrippahlB16)).

9. Conclusions

In this work we provided a comprehensive survey on the various aspects of string constraint solving (SCS), an emerging important field orthogonal to combinatorics on words and constraint solving. In particular, we focused on the three main categories of the SCS approaches we are aware (automata-based, word-based and unfolding-based) from the early proposals to the state-of-the-art approaches.

Among the future directions for string constraint solving we mention the four challenges reported in (ecai-str), namely:

  • Extending SCS capabilities to properly handle complex string operations, frequently occurring in web programming (expose-regex), such as back-references, lookaheads/lookbehinds or greedy matching.

  • Improving the efficiency of SCS solvers with new algorithms and search heuristics. At present, SMT solvers tend to fail with long-length strings, while CP solvers may struggle to prove unsatisfiability.

  • Combining SCS solvers with a portfolio approach (kotthoff2016algorithm) in order to exploit their different nature and uneven performance across different problem instances.

  • Using SCS solvers and related tools in different fields. The best candidates are probably software verification and testing, model checking and cybersecurity.

Finally, we hope that the research in SCS will encourage a closer and more fruitful collaboration between the CP and the SAT/SMT communities.

References