1. Introduction
Strings are everywhere across and beyond Computer Science. They are a fundamental datatype in all the modern programming languages, and operations on strings frequently occur in disparate fields such as software analysis, model checking, database applications, web security, bioinformatics and so on (path_feas; dyn_test_db; mod_check; DBLP:conf/cav/AbdullaACHRRS14; waptec; jsstring; aratha; str_mod_check; DBLP:journals/constraints/BarahonaK08).
Reasoning over strings requires solving arbitrarily complex string constraints, i.e., relations defined on a number of string variables. Typical examples of string constraints are string length, (dis)equality, concatenation, substring, regular expression matching.
With the term “string constraint solving” (shortly, string solving or SCS) we refer to the process of modelling, processing, and solving combinatorial problems involving string constraints. We may see SCS as a declarative paradigm which falls in the intersection between constraint solving and combinatorics on words: the user states a problem with string variables and constraints, and a suitable string solver seeks a solution for that problem.
Although works on the combinatorics of words were already published in the 1940s (DBLP:journals/jsyml/Quine46a), the dawn of SCS date back to the late 1980s in correspondence with the rise of Constraint Programming (CP) (handbookCP) and
Constraint Logic Programming
(CLP) (clp) paradigms. Pioneers in this field were for example Trilogy (trilogy), a language providing strings, integer and real constraints, and (clpstar), an instance of the CLP scheme representing strings as regular sets. The latter in particular was the first known attempt to use string constraints like regular membership to denote regular sets.Later in the 1990s and 2000s, string solving has sparked some interests (e.g., (genlang; cpstr; cpgram; hampi09; mona; lpstr; jsa; drple; phpstr)) without however leaving a mark. It was only from the 2010s that SCS finally took hold in application domains where string processing plays a central role such as testcase generation, software verification, model checking and web security. This increased interest motivated the organisation of the first workshop on string constraints and applications (MOSCA) in 2019 (mosca).
Arguably, the widespread interest in cybersecurity has given new impulse to SCS because strings can be silent carrier of software vulnerabilities (e.g., SQL injections). Another plausible reason is the remarkable performance improvements that constraint solvers have achieved over the last years. A precise reasoning about strings is especially critical for the analysis JavaScript language, nowadays the defacto standard for web applications, given the crucial role that strings have in this language (jsstring; aratha).
Over the last decade a large number of different SCS approaches has emerged, roughly falling in three main categories:

Automatabased approaches: relying on finite state automata to represent the domain of string variables and to handle string operations.

Wordbased approaches: based on systems of word equations. They mainly use Satisfiability Modulo Theory (SMT) (smt) solvers to tackle string constraints.

Unfoldingbased approaches: they basically expand each string variable into a number of contiguous elements denoting the characters of (e.g.,
can be mapped into integer variables or bitvectors).
As we shall see, CP and SMT are the stateoftheart technologies for solving SCS problems.
The goal of this paper is to provide a comprehensive survey of the various string solving approaches proposed in the literature, ranging from the theoretical foundations to the more practical aspects. After formalising the notion of string constraint solving in Section 2, we provide a detailed review of SCS approaches grouped by category (Sections 3–5). In Section 6 we show the main theoretical results we are aware, while in Section 7 we focus on the practical aspects, by reporting the SCS tools and benchmarks that have been developed. In Section 8 we discuss the related literature before concluding in Section 9.
2. String Constraint Solving
In this Section we define the fundamentals of string constraint solving, and in particular we show how SCS are treated from both CP and SMT perspectives. Before that, we recall some preliminary notions about strings and automata theory.
Let us fix a finite alphabet, i.e., a set of symbols also called characters. A string (or a word) is a finite sequence of characters of , and denotes the length of (in this work we do not consider infinitelength strings). The empty string is denoted with . The countable set of all the strings of is inductively defined as follows: (i) ; (ii) if and , then ; (iii) nothing else belongs to .
The string concatenation of is denoted by (or simply with when not ambiguous). We denote with the iterated concatenation of for times, i.e., and for . Analogously, we define the concatenation between sets of strings: given , we denote with (or simply with ) their concatenation and with the iterated concatenation, i.e., and for .
A set of strings is called a formal language. Formal languages are countable sets that can be recognised by wellknown models of computation called finitestate automata (FSA). Roughly, a FSA is a system with a finite number of states where a state transition occurs according to the input string . A number of final or accepting states determines if belongs to the language denoted by or not.
Different variant and extensions of FSA have been proposed, e.g., deterministic (DFA), nondeterministic (NFA), pushdown automata (PDA, to recognize contextfree languages), finitestate transducers (FST, basically FSA with input/output tapes defining relations between sets of strings).
A string variable is basically a variable that can only take values in or, equivalently, whose domain is a formal language of
. We can classify string variables into three hierarchical classes:

unboundedlength variables: they can take any value in

boundedlength variables: fixed an integer , they can only take values in

fixedlength variables: fixed an integer , they can only take values in
A string constraint is a relation over at least a string variable. For example, concatenation is a ternary string constraint. Clearly, instead of writing we use a more convenient functional notation . In this paper we will only consider constraints involving strings and (possibly) integers (e.g., string length or iterated concatenation).
String constraint  Description 

,  equality, inequality 
, , ,  lexicografic ordering 
string length  
,  concatenation, iterated concatenation times 
string reverse  
substring from index to index  
is the first occurrence of in  
is obtained by replacing the first occurrence of with in  
is obtained by replacing all the occurrences of with in  
is the number of occurrences of character in  
membership of in regular language denoted by  
membership of in contextfree grammar language denoted by 
Table 1 summarises the main string constraints that one can find in the literature, from which is possible to derive other constraints (e.g., to replace the last occurrence of a string one can combine and string reverse).^{1}^{1}1The constraint is also referred as global cardinality count (GCC) (mznstrings). A good SCS approach should be able to handle most of them. Note that in this work we consider quantifierfree constraints. In simple terms, the goal of string constraint solving is to determine whether or not a set of string constraints is feasible. As we shall see, this task can be tackled equivalently with both CP and SMT technologies.
2.1. Scs from a CP perspective
From a constraint programming point of view, string constraint solving means solving a particular case of Constraint Satisfaction Problem (CSP). Formally, a CSP is a triple where: are the variables; are the domains, where for each is a set of values that can take; are the constraints, i.e., relations over the variables of defining the feasible values for the variables.
The goal is to find a solution of , which is basically an assignment such that for and for each constraint defined over variables . The CSP notion can be naturally extended to optimization problems: we just add an objective function that maps each solution to a numerical value to be minimised or maximised.
To find a solution, CP solvers use two main combined techniques: propagation, which works on individual constraints trying to prune the domains of the variables involved until a fixpoint is reached, and branching
, which aims to find a solution via heuristic search (propagation process is not complete in general).
Fixed an alphabet we call a CSP with strings, or CSP, a CSP having string variables such that for , and a number of constraints in over such variables. To find a solution, a CP solver can try to compile down a CSP into a CSP with only integer variables (mznstrings), or it can define specialised string propagators and branchers (gecode_s; dashedstring; sweepbased). The latter approach has proved to be much more efficient.
For example, consider CSP where are string variables with associated alphabet and is an integer variable. Propagating will exclude string from because it has length 4, while the domain of is the interval . An optimal propagator for would narrow the domain of from to . Note that propagation is a compromise between effectiveness (how many values are pruned) and efficiency (the computational cost of pruning), so sometimes it makes sense to settle for efficient but suboptimal propagators. Then, will narrow the domain of to singleton (which actually means assigning the value to ). At this stage, a fixpoint is reached, i.e., no more propagation is possible: we have to branch on to possibly find a solution. Let us suppose that the variable choice heuristics selects variable and the value choice heuristics assigns to it the value ; in this case the propagator of is able to conclude that so a feasible solution for (not the only one) is .
Note that virtually all the CSPs referred in the literature have finite domains, i.e., the cardinality of each domain of is bounded. Having finite domains guarantees the decidability of CSPs—that are in general NPcomplete—by enumeration, but at the same time prevents the use of unboundedlength string variables for CSPs.
As we shall see in Section 5, all the effective CP approaches for string solving are unfoldingbased and do not handle unboundedlength variables. In fact, although CP provides the constraint—stating that must belong to the language denoted by finite state machine —it is also true that the string variable (or array of integer variables) must have a fixed (pesant04regular) or bounded (regular) length. The only C(L)P proposals we are aware handling unboundedlength strings (via regular sets) are (clpstar) and (cpstr). These automatabased approaches are however outdated.
2.2. Scs from a SMT perspective
In a nutshell, Satisfiability Modulo Theories generalises the Boolean satisfiability problem to decide whether a formula in firstorder logic is satisfiable with respect to some background theory that fixes the interpretations of predicates and functions (smt). Note that SMT theories can be arbitrarily enriched and combined together.
Over the last decades, several decision procedures have been developed to tackle the most disparate theories and subtheories, including the theory of (non)linear arithmetic, bitvectors, floating points, arrays, difference logic, uninterpreted functions. In particular, wellknown SMT solvers like, e.g., CVC4 (cvc4str) and Z3 (z3) decided to implement the theory of strings (often in conjunction with related theories, such as linear arithmetic for length constraints and regular expressions).
For example, the quantifierfree theory of strings (or word equations) and linear arithmetic deals with integers and unboundedlength strings in , where is a given alphabet. Its terms are string/integer variables/constants, concatenation and length. The formulas of are (dis)equalities between strings and linear arithmetic constraints. For example, the formula where and are string variables is wellformed for this theory. Unfortunately, the decidability of is still unknown (mosca).
As we shall see in Section 4, over the last years a growing number of modern SMT solvers has integrated the theory of strings. Most of them are based on the (DPLLT) procedure. is a general framework extending the original DPLL algorithm (tailored for SAT solving) to deal with an arbitrary theory through the interaction between a SAT solver and a solver specific for . In a nutshell, lazily decomposes a SMT problem into a SAT formula, which is handled by a DPLLbased SAT solver which in turn interacts with a theoryspecific solver for , whose job is to check the feasibility of the formulas returned by the SAT solver.
As an example, let us consider the theory with the above formula (which is unsatisfiable). The formula is tranlated into a Boolean formula , handled by a SAT solver which can return “unsatisfiable” or a satisfying assignment (this does not imply that the overall formula is satisfiable). In the latter case, the constraints corresponding to such assignments are distributed to the different theories.
For example, if the assignment is returned, the constraints and are delivered to the string solver, while will be solved by an arithmetic solver. Now, the theory solvers can either find that the constraints are satisfiable or return theory lemmas to the SAT solver. For example, the string solver might return to the SAT solver, which will add the corresponding clause to its knowledge base. The SAT solver will then produce a new assignment (or return “unsatisfiable”) and the process will be iteratively repeated until either (un)satisfiability of is proven or a resource limit is reached (if the theory is not decidable, termination is not guaranteed in general).
3. Automatabased Scs approaches
As aforementioned, the domain of a string variable is a formal language, i.e., a potentially infinite (yet countable) set. A natural way to denote these sets is through (extensions of) finitestate automata. It is therefore unsurprising that the early string solving approaches were based on FSA possibly enriched with other data structures.
We can say that a SCS approach is automatabased if the string variables are mainly represented by automata, and the string constraints are mainly mapped into corresponding automata operations.
As aforementioned was one of the first attempts to incorporate strings in the CLP framework to strengthen the standard stringhandling features such as concatenation and substring (clpstar). This approach was further developed by Golden et al. (cpstr) about 15 years later. Their main contribution was to use FSA to represent regular sets. In (DBLP:conf/aaai/HansenA07) Hansen et al. use deterministic FSA (DFA) and binary decision diagrams (BDDs) to handle interactive configuration on string variables.
The global constraint proposed in (pesant04regular) enables to treat a fixedsize array of integer variables as a fixedlength string belonging to the regular language denoted by a given (non)deterministic FSA . Note that was introduced to solve finite domains CP problems like rostering and car sequencing, and not targeted to string solving — in fact, it is a useful support that has been used in different CP applications. Its natural extension is the constraint (cpgram; gramcons), where instead of a FSA we have a contextfree grammar. However, never reached the popularity of in the CP community.
An interesting paper about automatabased approaches is (autoeval), where Hooimeijer et al. study a comprehensive set of algorithms and data structures for automata operations in order to give a fair comparison between different automatabased SCS frameworks (mona; drple; jsa; phpstr; rex; stranger; pass). According to their experiments, the best results were achieved when using the BDDs in combination with lazy versions of automata intersection and difference.
MONA (mona) is a tool developed in the ’90s that acts as a decision procedure for Monadic SecondOrder Logic (M2L) and as a translator to finitestate automata based on BDDs. FIDO (fido) is a domainspecific programming formalism translated first into pure M2L via suitable encodings, and finally into FSA through the MONA tool. Another M2Lbased solver is PISA (pisa), a path and indexsensitive string solver that is applicable for static analysis.
DRPLE (drple) is a SCS approach to solve equations over regular language string variables. The authors provide automatabased decision procedures to tackle the Regular Matching Assignments problem and a subclass, the ConcatenationIntersection problem.
StrSolve (strsolve) is a decision procedure supporting similar operations to those allowed by DPRLE, but efficiently produces single witnesses rather than atomically generating entire solution sets. Nevertheless, its worstcase performance corresponds to that of DPRLE.
JSA (jsa) is a string analysis framework that first transforms a Java source into a flow graph (frontend), and then derives FSA from such graph (backend). In particular, JSA uses wellfounded hierarchical directed acyclic graphs of nondeterministic FSA called multilevel automata (MLFA).
In (phpstr) Minamide developed a string analyzer for the PHP scripting language to detect crosssite software vulnerabilities and to validate pages they generate dynamically. The analyzer has a library to manipulate formal languages including automata, transducers and contextfree grammars.
Rex (rex) is a tool based on Z3 solver (z3) for symbolically expressing and analyzing regular expression constraints. It relies on symbolic finitestate automata (SFA) where moves are labeled by formulas instead of individual characters. SFAs are then translated into axioms describing the acceptance conditions.
SUSHI (sushi) is a string solver based on the Simple Linear String Equation (SISE) formalism (sise) to represent path conditions and attack patterns. To solve SISE constraints, the authors use an automatabased approach. Finite state transducers are used to model the semantics of regular substitution.
Stranger (stranger) is an automatabased tool for finding and eliminating stringrelated security vulnerabilities in PHP applications. It uses symbolic forward and backward reachability analyses to compute the possible values that string expressions can take during program execution.
PASS (pass) is a string solver using parameterized arrays as the main data structure to model strings, and converts string constraints into quantified expressions that are solved through quantifier elimination. In addition, PASS uses an automaton model to handle regular expressions and reason about string values faster.
SLOG (slog) is a string analysis tool based on a NFA manipulation engine with logic circuit representation. Automata manipulations can be performed implicitly using logic circuits while determinization is largely avoided. SLOG also supports symbolic automata and enables the generation of counterexamples.
Sloth (sloth) is based on the reduction of satisfiability of formulae in the straightline fragment and in the acyclic fragment to the emptiness problem of alternating finitestate automata (AFAs). Sloth can handle string constraints with concatenation, finitestate transducers, and regular constraints.
OSTRICH (ostrich) is a string solver providing builtin support for concatenation, reverse, functional transducers (FFT), and . It can be seen as an extension of Sloth in the sense that the decision algorithm of OSTRICH can reduce the problem to constraints handled by Sloth when it is not possible to avoid nondeterminism.
In (DBLP:journals/jip/ZhuAM19) Zhu et al. proposed a SCS procedure where atomic string constraints are represented by streaming string transducers (SSTs) (sst). A straightline constraint is satisfiable if and only if the domain of the composed streaming string transducer is not empty.
Pros and Cons
Automata enables to represent infinite sets of strings with finite machines, hence they are a natural and elegant way to represent unboundedlength strings. Also, the theory of automata (automata) is well defined and studied.
Unfortunately, the performance of automatabased approaches for string constraint solving has been hampered by two main factors: (i) the possible state explosion due to automata operations (e.g., the intersection of DFA is quadratic in the number of states); (ii) the integration with other domains and theories (integers in particular).
For these reasons, SCS approaches purely based on automata have little success nowadays. However, for some particular classes of SCS problems they still might be the best option.
4. Wordbased Scs approaches
As mentioned in Section 2, strings are also called words. What we call string constraints are, in a more algebraic terminology, sometimes called word equations (makanin1977). To be more precise, a word equation is a particular string constraint of the form with where is an alphabet and is a set of variables.
We can say that a SCS approach is wordbased if it is based on the theory of wordequations, possibly enriched with other theories (e.g., integers or regular expressions). Wordbased approaches rely on algebraic techniques for solving natively (quantifierfree) string constraints over the (extended) theory of unbounded strings, without reducing to other data types such as bit vectors or automata. The natural candidates for implementing wordbased approaches are SMT solvers, which can incorporate and integrate the theory of strings in their frameworks.
In 2014, Liang et al. (cvc4str) argued that: “Despite their power and success as backend reasoning engines, general multitheory SMT solvers so far have provided minimal or no native support for reasoning over strings […] until very recently the available string solvers were standalone tools that […] imposed strong restrictions on the expressiveness […] Traditionally, these solvers were based on reductions to satisfiability problems over other data types”. However, in the following years the situation has changed, and more and more wordbased string solving approaches have emerged.
In (cvc4str) the authors integrated a wordbased SCS approach into the wellknown SMT solver CVC4 (cvc4). They used a approach for solving (quantifierfree) constraints natively over the theory of unbounded strings with length and regular language membership. The authors claimed that CVC4 was the first solver able to reason about a language of mixed constraints including strings together with integers, reals, arrays, and algebraic datatypes. This work has been revised and extended in the following years to handle extended string functions frequently occurring in security and verification applications such as , and (cvc4ext1; cvc4ext2).
Meanwhile, another wellestablished SMT solver, namely Z3 (z3), started to develop string solving capabilities. In (z3str) was introduced Z3str, an extension of Z3 to solve string constraints. Z3str was the progenitor of a number of different wordbased string solvers built on top of Z3. Z3str2 (z3str2jnl) extended Z3str by including overlapping variables detection and new heuristics. Z3str3 (z3str3) added a technique called theoryaware branching to take into account the structure of theory literals for computing branching activities. Z3strBV (z3strbv) is instead a solver for the theory of string equations, string length represented as bitvectors, and bitvector arithmetic aimed at the software verification, testing, and security analysis of C/C++.
Norn is a SMT solver introduced in (norn) for an expressive constraint language including word equations, length constraints, and regular membership queries. Norn is based on a decision procedure under the assumption of a set of acyclicity conditions on word equations, where acyclicity is a syntactic condition ensuring that no variable appears more than once in word (dis)equalities.
Trau (trau) is a SMT string solver based on the flattening technique introduced in (flattrau), where flat automata are used to capture simple patterns of common constraints. It relies on a CounterExample Guided Abstraction Refinement (CEGAR) framework (cegar) where an under and an overapproximation module interact to increase the string solving precision. In addition, Trau implements string transduction by reduction to contextfree membership constraints.
S3 is a symbolic string solver (s3) motivated by the analysis of web programs working on string inputs. S3 can be viewed as an extension of the aforementioned Z3str (z3str) solver to handle regular expressions. Its successor, S3P (s3p), guides the search towards a “minimal solution” for satisfiable problems and enables the conflict clause learning for string theory. The latest version, called S3# (s3hash) implements an algorithm for counting the models of string constraints.
Pros and Cons
Stateoftheart wordbased approaches are more general, flexible and often efficient than automatabased approaches. They can be built on the top of wellknown SMT solvers and integrated with already defined theories. Furthermore, they can natively handle unboundedlength string.
One the downside, most of these approaches are incomplete and suffer from the performance issues due to the disjunctive reasoning of the underlying DPLL(T) paradigm (DPLLT). In particular, some experimental evaluations (dashedstring; sweepbased) showed that they may encounter difficulties when dealing with big string constants.
5. Unfoldingbased Scs approaches
A straightforward way of solving string constraints is to encode them into other wellknown types—for which wellestablished constraint solving techniques already exist—such as Boolean, integers or bitvectors.
A SCS approach is unfoldingbased if each string variable is unfolded into an homogeneous sequence of variables of a different type , and each string constraint is accordingly mapped into a constraint over .
Note that an unfolding approach inherently needs an upper bound on the string length. So, these approaches can handle fixedlength or boundedlength string variables, but cannot deal with unboundedlength variables.
A proper choice of is crucial. If is too small, one cannot capture solutions having nottoosmall string length. On the other hand, too large a value for can significantly worsen the SCS performance even for trivial simple problems.
Another important choice is whether to unfold eagerly (i.e., statically, before the actual solving process) or lazily (i.e., dynamically, during the solving process).
Hampi (hampi09; hampi12)
was probably the first SMTbased approach encoding string constraints into constraints over bitvectors, solved by the underlying STP solver
(stp). Its first version (hampi09) only allowed one fixedlength string variable. Its subsequent version (hampi12) added a number of optimisations and, in particular, provided the support to word equations and boundedlength string variables.Kaluza (kaluza) was the backend solver used by Kudzu, a symbolic execution framework for the JavaScript code analysis. Similarly to Hampi, Kaluza deal with string constraints over boundedlength variables by translating them into bitvector constraints solved with STP solver (stp). In fact, Kaluza can be seen as an extension of the first version of Hampi (hampi09).
Mapping strings into bitvectors is suitable for software analysis applications, especially when it comes to precisely handle the overflows via wrapped integer arithmetic. Plenty of SMT solvers for the quantifierfree bitvector formulas exist (often relying on bit blasting), while the CP support for bitvectors appears limited (bit_vector).
In alternative to bitvectors, the simplest approach is to translate string variables into arrays of integer variables encoding the string characters (possibly using a special padding value for the “empty character”). Mapping to integers is probably the most common approach for C(L)P solvers
(mznstrings).For example, in (ocl) the authors describe a lightweight solver relying on CLP and Constraint Handling Rules (CHR) paradigm (chr) in order to generate large solutions for tractable string constraints in model finding. This approach unfolds string variables by first labelling their lengths and domains, and then their characters.
As aforementioned, CP solvers can use fixedlength or boundedlength arrays of integer variables to deal with , and other string constraints (pesant04regular; cpgram; KadiogluS10; HeFPZ13; Maher09; mznstrings) without a native support for string variables. However, as shown in (gecode_s; mznstrings; dashedstring; sweepbased), having dedicated propagators for string variables can make a difference.
In (ScottFP13) Scott et al. presented a prototypical boundedlength approach based on the affix domain to natively handle string variables. This domain allows one to reason about the content of string suffixes even when the length is unknown by using a padding symbol at the end of the string.
The (ScottFP13) approach has been subsequently improved in (bound_str; gecode_s) with a new structured variable type for strings called OpenSequence Representation, for which suitable propagators are defined. This approach has been implemented in the Gecode solver (gecode) and in (mznstrings) referred as Gecode+.
To mitigate the dependency on and enable lazy unfolding, a fairly recent CP approach based on dashed strings has been introduced (dashedstring; sweepbased). Dashed strings are concatenations of distinct set of strings (called blocks) used to represent in a compact way the domain of string variables with potentially very big length. Dashed string propagators have been defined for a number of constraints (sweepbased; lexfind; regular) and implemented in the GStrings solver (gstrings).
Pros and cons
The unfoldingbased approaches allows one to take advantage of already defined theories and propagators without explicitly implementing a support for strings. Experimental results show that unfolding approaches, and in particular the CPbased dashed string approach, can be quite effective — especially for SCS problems involving long strings.
However, unfolding approaches also have limitations. The most obvious one is the impossibility of handling unboundedlength strings. This can be negligible, provided that a good value of is chosen. Unfortunately, deciding a good value of is not always trivial. CP solvers can be more efficient than SMT solvers for satisfiable SCS problems, but they may fail on unsatisfiable problems with large domains, and on problems with a lot of logical disjunctions.
6. Theoretical aspects
In this section we briefly explore the theoretical side of string constraint solving, focusing in particular on the results that have been applied to the SMT and CP fields. For more insights about the theory of automata and combinatorics on words we refer the reader to (automata; automata_1; automata_2; automata_3; automata_4; lothaire1997combinatorics; choffrut1997combinatorics; lothaire2002algebraic; lothaire2005applied; berstel2007origins).
Algebraically speaking, the structure is a free monoid on . The free semigroup on is instead the semigroup where . There is therefore a number of theoretical results for free semigroups that can be applied for string solving (makanin1977; Va_enin_1983; Durnev1995UndecidabilityOT).
Over the years the complexity and the decidability of combinatorial problems over finitelength strings has been deeply studied (DBLP:journals/jsyml/Quine46a; ferebee1972; makanin1977; Va_enin_1983; Buchi1990; Durnev1995UndecidabilityOT; DBLP:journals/jsyml/Lob53; DBLP:journals/jacm/KarhumakiMP00; DBLP:journals/jacm/Plandowski04). Much progress has been made, but many questions remain open especially when the language is enriched with new predicates (DBLP:conf/hvc/GaneshMSR12). For example, the decidability of the theory of strings and linear arithmetic, mentioned in Section 2 and used by the CVC4 solver, is still unknown. In (openwords), Néraud provided a list of open problems and conjectures for combinatorics on words.
As also noted in (DBLP:conf/atva/AbdullaADHJ19), DPLL(T)based string solvers like Z3str2 (z3str2), Z3str3 (z3str3), CVC4 (cvc4str), S3 (s3; s3p; s3hash), Norn (norn), Trau (trau), Sloth (sloth), and OSTRICH (ostrich) handle a variety of string constraints but they are incomplete for the full combination of those constraints. So, they often decide to deal with a particular fragment of the individual string constraints.
In (DBLP:conf/hvc/GaneshMSR12) Ganesh et al. proved several (un)decidability results for word equations with length constraints and membership in regular sets. They proved in particular three main theorems: (i) the undecidability of the validity problem for the set of sentences written as a quantifier alternation applied to positive word equations; (ii) the decidability of quantifierfree formulas over word equations in solved form and length constraints; (iii) the decidability of quantifierfree formulas over word equations in regular solved form, length constraints, and the membership predicate over regular expressions.
The work in (DBLP:conf/rp/DayGHMN18) extends (DBLP:conf/hvc/GaneshMSR12) by considering the (un)decidability of many fragments of the first order theory of word equations and their extensions. The authors show that when extended with several natural predicates on words, the existential fragment becomes undecidable. Moreover, they prove that deciding whether solutions exist for a restricted class of equations, augmented with many of the predicates leading to undecidability in the general case, is possible in nondeterministic polynomial time.
A. W. Lin et al. (DBLP:conf/popl/LinB16) studied the decidability of string logics with concatenations and finitestate transducers as atomic operations. They show that the straightline fragment of the logic is decidable (complexity ranges from PSPACE to EXPSPACE). This fragment can express constraints required for analysing mutation XSS in web applications and remains decidable in the presence of length, lettercounting, regular, indexOf, and disequality constraints.
In (DBLP:journals/pacmpl/ChenCHLW18) Chen et al. provide a systematic study of straightline string constraints with and regular membership as basic operations. They show that a large class of such constraints (i.e., when only a constant string or a regular expression is permitted in the pattern) is decidable. They also show undecidability results when variables or length constraints are permitted in the pattern parameter of the replace function.
In (ostrich) Chen et al. define a firstorder language based on the .NET string library functions and prove some decidability properties for (fragments of) that language. In particular, based on the work of (Buchi1990), the authors prove that the path feasibility problem for the library language is undecidable.
In (DBLP:conf/atva/AbdullaADHJ19) Abdulla et al. address string constraints combining transducers, word equations, and length constraints. Since this problem is undecidable in general, they propose a new decidable fragment of string constraints, called weakly chaining, that they prove to be decidable.
Recompression (Jez16) is a technique based on local modification of variables and iterative replacement of pairs of letters that can be seen as a bottomup compression of the solution of a given word equation. Recompression can be used for the proof of satisfiability of word equations (Jez16) or for grammarbased compression (Jez15).
Switching to the CP side, we cannot say anything deep about decidability because this paradigm essentially works on finite domains, and CSPs over finite domains are trivially decidable by enumerating the domain values. The theoretical results in CP mainly refer to the consistency levels (handbookCP) that a given constraint propagator can enforce. For example, the propagator introduced in (pesant04regular) enforces the Generalized Arc Consistency (GAC).
In (DBLP:conf/cpaior/PesantQRS09) Pesant et al. performed a theoretical study demonstrating how the constraint can be linearized efficiently. In particular, they proposed a lifted polytope having only integer extreme points.
In (KadiogluS10) Kadioglu et al. theoretically investigate filtering problems that arise from constraints. They authors address question like: can we efficiently filter contextfree grammar constraints? How can we achieve arcconsistency for conjunctions of regular grammar constraints? What languages are suited for filtering based on regular and contextfree grammar constraints? Are there languages that are suited for contextfree, but not for regular grammar filtering?
In (prefsuff) Scott et al. extended the open domain consistency notion (Maher09) with new definitions such as PSconsistency, PSLconsistency, PSUconsistency, and PSLUconsistency.
Given their nonlattice nature, dashed strings do not come with propagators that formally ensure a given level of consistency. The main theoretical results for dashed strings concern the sweepbased equation algorithm introduce in (dashedstring), on which most of the dashed string propagators rely, which has been proven to be sound (while its completeness is still an open problem).
7. Practical aspects
In this Section we focus on the practical aspects of string solving, that is, on the SCS tools and benchmarks that have been actually developed and tested.
Among the automatabased approaches seen in Section 3, the tools that are currently available online are MONA (monatool), JSA (jsatool), the PHP string analyzer (phpstrtool) described in (phpstr), Rex (rextool), and Sloth (slothtool). However, only the latter two are actively maintained. PISA (pisa) is a commercial security product not publicly available.
CVC4 and Z3 are two of the most famous SMT solvers. Their interface for string solving is maintained and well documented (stringscvc; stringsz3). Note that Z3 provides two alternatives for string solving: the theory of strings (via Z3str3 solver (z3str3)) and the theory of sequences (Z3seq). Z3str2 and Z3str are instead no longer maintained.
Norn is publicly available at (norntool), but its last update dates back to 2015. At (s3tool) one can find the binaries of all the three versions of S3 string solver (S3, S3P, and S3#). However, the sources are available only for S3 and no longer updated since 2014.
Trau comes in two versions. The first one is available at (trautool1), while the newest implementation (called Z3Trau) can be found at (trautool2). Both these versions are built on top of Z3 solver.
Concerning unfoldingbased approaches, Kaluza solver is still available at (kaluzatool) but no longer maintained. Analogously, also Gecode+ is available online (gecode_stool) but no longer developed. Conversely, the GStrings solver (gstringstool) implementing the dashed string approach is actively maintained. Both Gecode+ and GStrings are extensions of Gecode (gecode), a wellestablished CP solver over finite domains.
ACOSolver (acosolver) is a tool using a hybrid constraint solving procedure based on the ant colony optimization metaheuristic, which is executed as a fallback mechanism, when an underlying solver encounters an unsupported string operation. JOACOCS is an extension of ACOSolver publicly available at (joacotool).
All the above approaches, and many others, have been tested on several string benchmarks — especially coming from the software verification world. Although some of them are publicly available, a major issue for string solving is that there are not standard benchmarks for SCS applications. This makes difficult a rigorous comparison and the integration between string solvers.
The first step towards the creation of standard benchmarks is the implementation of a common language to model and solve SCS problems. Unfortunately, at present, strings are not a standard feature for neither SMT nor CP modelling languages. However, something is currently moving in this direction.
For example, given the increasing interest in string solving, the STMLIB initiative (smtlib) is developing an official theory of strings and related logics for SMT solvers.^{2}^{2}2http://smtlib.cs.uiowa.edu/theoriesUnicodeStrings.shtml On the CP side, the defacto modelling language for constraint solvers is called MiniZinc (minizinc). Strings are not (yet) part of the official MiniZinc release, but in (mznstrings) Amadini et al. introduced an extension to enable string solving. This also includes a library for compiling string variables and constraints into boundedlength arrays or integer variables and constraints.
Unfortunately, at the moment the interaction between the SMT and CP communities does not appear very strong. A possible way to bring these two communities closer together is to define compilers to translate SMTLIB into MiniZinc and vice versa. For example, Bofill et al.
(fzn2smt) defined a compiler from FlatZinc—the lowlevel language derived from MiniZinc—to SMTLIB. More recently, G. Gange (smt2mzn) implemented a prototypical compiler from SMTLIB to MiniZinc able to process string variables and constraints.Class  Description  Quantity 

Concats{Small,Big}  Rightheavy, deep tree of concats  120 
ConcatsBalanced  Balanced, deep tree of concats  100 
ConcatsExtracts{Small,Big}  Single concat tree, with character extractions  120 
Lengths{Long,Short}  Single, large length constraint on a variable  200 
LengthsConcats  Tree of fixedlength concats of variables  100 
Overlaps{Small,Big}  Formula of the form  80 
Regex{Small,Big}  Complex regex membership test  120 
ManyRegexes  Multiple random regex membership tests  40 
RegexDeep  Regex membership test with many nested operators  45 
RegexPair  Test for membership in one regex, but not another  40 
RegexLengths  Regex membership test, and a length constraint  40 
DifferentPrefix  Equality of two deep concats with different prefixes  60 
A nice step forward for SCS benchmarks is represented by StringFuzz (strfuzz), a modular SMTLIB problem instance transformer and generator for string solvers. It is a useful tool for string solver developers and testers, and can help expose bugs and performance issues.
A repository of several SMTLIB 2.0/2.5 problem instances generated and transformed with StringFuzz is available online.^{3}^{3}3http://stringfuzz.dmitryblotsky.com/benchmarks/. Table 2 summarizes the nature of these instances, grouped into twelve different classes. StringFuzz also reports the performance of Z3str3, CVC4, Z3seq, and Norn on such instances. Arguably, only these SMT solvers were used because other SMT approaches are either unstable or cannot properly process the proposed SMTLIB syntax. Note that the SMTLIB to MiniZinc compiler of (smt2mzn) can successfully translate all the generated StringFuzz instances, so also the performance of CP solvers can be evaluated.
8. Related works
In this section we cover some of the related literature that has not been previously mentioned.
DReX (drex) is a declarative language that can express all the regular stringtostring transformations based of function combinators (regcomb). The focus is on the complexity of evaluating the output of a DReX program on a given input string. In particular, the main contribution of (drex) is the identification of a consistency restriction on the use of combinators in DReX programs, and a singlepass evaluation algorithm for consistent programs.
Luu et al. developed SMC (smc), a model counter for determining the number of solutions of combinatorial problems involving string constraints. Their work relies on generating functions, a mathematical tool for reasoning about infinite series that also provides a mechanism to handle the cardinality of string sets. SMC is expressive enough to model constraints arising in realworld JavaScript applications and UNIX C utilities.
An AutomataBased model Counter for string constraints (ABC) is implemented in (DBLP:conf/cav/AydinBB15). ABC uses an automatabased constraint representation for reducing model counting to path counting. In (DBLP:conf/cav/AydinBB15) the ABC model counter is extended with relational and numeric constraints.
As aforementioned, the main application field for string solving is the area of software verification and testing. For example, Kausler et al. (streval) performed an evaluation of string constraint solvers in the context of symbolic execution (king1976). What they find is that, as one can expect, one solver might be more appropriate than another depending on the input program. This is also pointed out in (aratha), where Amadini et al. presented a multisolver tool for the dynamic symbolic execution of JavaScript.
In (PHPRepair), the PHPRepair tool is used for automatically repairing HTML generation errors in PHP via string constraint solving. The property that all tests of a suite should produce their expected output is encoded as a string constraint over variables representing constant prints. Solving this constraint describes how constant prints must be modified to make all tests pass. The string constraints are encoded into the language of Kodkod (kodkod), a SATbased constraint solver.
Abstracting a set of strings with a finite formalism is not merely a string solving affair. For example, the wellknown Abstract Interpretation (CousotC77) framework may require the sound approximation of sets of strings (i.e., all the possible “concrete” values that a string variable of the input program can take) with an abstract counterpart.
Several abstract domains have been proposed to approximate set of strings. These domains vary according to the properties that one needs to capture (e.g., the string length, the prefix or suffix, the characters occurring in a string). Madsen et al. (DBLP:conf/cc/MadsenA14) proposed and evaluated a suite of twelve string domains for the static analysis of dynamic field access. Additional string domains are also discussed in (DBLP:journals/spe/CostantiniFC15; mstring).
Choi et al. (DBLP:conf/aplas/ChoiLKD06) used restricted regular expressions as an abstract domain for strings in the context of Java analysis. Park et al. (DBLP:conf/dls/ParkIR16) use a stricter variant of this idea, with a more clearly defined string abstract domain.
Amadini et al. (jsstring) provided an evaluation on the combination via direct product of different string abstract domains in order to improve the precision of JavaScript static analysis. In (refdomains) this combination is achieved via reduced product by using the set of regular languages as a reference domain for the other string domains.
Finally, we mention that string solving techniques might be useful in the context of Bioinformatics, where a number of CP techniques have been already applied (see, e.g., the works by Barahona et al. (DBLP:journals/constraints/BarahonaK08; DBLP:conf/aime/KrippahlMB13; DBLP:conf/cp/KrippahlB16)).
9. Conclusions
In this work we provided a comprehensive survey on the various aspects of string constraint solving (SCS), an emerging important field orthogonal to combinatorics on words and constraint solving. In particular, we focused on the three main categories of the SCS approaches we are aware (automatabased, wordbased and unfoldingbased) from the early proposals to the stateoftheart approaches.
Among the future directions for string constraint solving we mention the four challenges reported in (ecaistr), namely:

Extending SCS capabilities to properly handle complex string operations, frequently occurring in web programming (exposeregex), such as backreferences, lookaheads/lookbehinds or greedy matching.

Improving the efficiency of SCS solvers with new algorithms and search heuristics. At present, SMT solvers tend to fail with longlength strings, while CP solvers may struggle to prove unsatisfiability.

Combining SCS solvers with a portfolio approach (kotthoff2016algorithm) in order to exploit their different nature and uneven performance across different problem instances.

Using SCS solvers and related tools in different fields. The best candidates are probably software verification and testing, model checking and cybersecurity.
Finally, we hope that the research in SCS will encourage a closer and more fruitful collaboration between the CP and the SAT/SMT communities.
Comments
There are no comments yet.