Supercompiling String Programs Using Word Equations as Constraints

by   Antonina Nepeivoda, et al.

We describe a general parameterized scheme of program and constraint analyses allowing us to specify both the program specialization method known as Turchin's supercompilation and Hmelevskii's algorithm solving the quadratic word equations. The scheme is specified for both sorts of the analysis and works in a joint algorithm in which these two sorts of the analysis are used together. The word equations and the inequalities on regular patterns are used as the string constraint language in the algorithm.



There are no comments yet.


page 1

page 2

page 3

page 4


Quadratic Word Equations with Length Constraints, Counter Systems, and Presburger Arithmetic with Divisibility

Word equations are a crucial element in the theoretical foundation of co...

Decidable Logics Combining Word Equations, Regular Expressions and Length Constraints

In this work, we consider the satisfiability problem in a logic that com...

On Solving Word Equations via Program Transformation

The paper presents an experiment of solving word equations via specializ...

Parameterized String Equations

We study systems of String Equations where block variables need to be as...

Linear Tree Constraints

Linear tree constraints were introduced by Hofmann and Rodriguez in the ...

The Satisfiability of Extended Word Equations: The Boundary Between Decidability and Undecidability

The study of word equations (or the existential theory of equations over...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Sec. 1 Introduction

Program transformation techniques are usually used for optimization, but sometimes they are also used for verification. Given a program , a simple analysis of its transformed version might show some properties of which were not obvious in itself [5, 11]. This paper aims at verification of reachability properties of functional programs. Any program can be formally unfolded in a tree which includes all computation paths of [15]. Some unreachable paths in the tree are pruned by the transformation. The resulted tree is presented by the unfold/fold transformation by a finite graph. If the graph does not contain some program state, then the state is unreachable. E. g., given returning either or , let the graph after the unfold/fold transformation [3, 15] contain no states with value . Then the value is unreachable from the input point of : either the value will be returned or will run forever.

Turchin’s supercompilation is one of such methods based on unfold/fold operations [16, 15, 5]. Turchin’s original works use the string operating language Refal as the input language of a supercompiler [17]. When a string operating program is treated by supercompilation, the method requires analysis of word equations.

Example 1

Let a program include an operator testing the equality on strings. Let take strings as its value, and be letters, and — the concatenation sign. Let , . In order to prove that the test never returns , we need to know that no value of can satisfy the equality . That means the word equation has no solutions, or, in other words, the relation on strings defined by the equation is empty.

As a rule, tools for string programs analysis are based on the class of regular languages as a language for string constraints [18, 12, 2]. This class is then enriched by some additional predicate symbols in such a way that decidability of the enriched systems is preserved. The class of the word relations defined by the word equations is neither a subset nor a superset of the rational relation set of rational relations defined by the finite-state machines [9].

This work considers the set of quadratic word equations as a constraint language for supercompilation. Given a program , the word equations are used either to detect some of unreachable computation paths of or to present some properties of the program structures.

Our contributions are the following:

  1. We present a general parameterized scheme of program and constraint analyses allowing us to specify both a program specialization method known as Turchin s supercompilation and Hmelevskii’s algorithm solving the quadratic word equations.

  2. We specify this scheme for each of the two analyses and present a new joint algorithm in which these two analyses are used together. The joint algorithm verifies some safety properties of the programs to be analysed. A new type of string constraints, namely the quadratic word equations, together with the inequalities on regular patterns are used for the constraint analysis. The input string lengths of the programs are unknown, i.e. they are not bounded in advance.

  3. The presented algorithm has been implemented in a model supercompiler MSCP-A  for the language Refal.

The paper is organized as follows. After introducing the syntax (Sec. 2), we describe the general formal scheme of the analysis (Sec. 3.1). Then we describe the set of configurations (Sec. 3.2) and specify the scheme first for solving quadratic word equations (Sec. 4) and then for supercompilation (Sec. 5). Section 6 informally describes how these schemes are used together in the supercompiler MSCP-A  . We show some examples of reachability analysis done using the schemes in Appendix (Sec. 7.3).

Sec. 2 Presentation Language

We present our program examples in a variant of a pseudocode for functional programs. The programs given below are written as term rewriting systems based on pattern matching. The rules in the programs are ordered from the top to the bottom to be matched. The programs are over strings in a finite alphabet

and use variables of two types: strings, which can be valued by elements of , and symbols taking elements of . We use the notion of a parameter for data which already has a value but it is unknown to us; while a variable value is undefined and is to be assigned. We use the following syntax.

  • is the empty word, is the concatenation sign (both may be omitted);

  • , , etc. are elements of ;

  • , , , maybe subscripted, are string parameters; is a symbol parameter;

  • , , , maybe subscripted, are string variables; is a symbol variable;

  • function names are given in typographic style and start with capital letters, e.g., , , .

Let denote a set of all variables, and a set of string and symbol parameters respectively, and .

A rule of a program is , where is , is an -ary function name, , can contain symbols, function calls, and variables. may include only the variables present in .

Definition 1

Expression set and function returning the set of parameters of an expression are defined as follows.

  • If , then . . If , then , .

  • If , , then and .

  • If , — a function name of the arity , then , .

  • does not include any other elements.

Given a program, the function serves as its input point.

Function is called a substitution, if is a morphism on preserving constants. Thus, any substitution is completely defined by its values on . We write an application of substitution to as . We also assume that substitutions respect types, i.e. for every . If for and , then we write as . Substitutions for are defined similarly.

In this paper the notion of a substitution is also extended to predicates as follows. Given an -ary predicate and substitution , means a restriction of to the image of , i.e. , s.t. .

Sec. 3 Unfold/Fold Program Transformation Method

This section presents a variant of the unfold/fold technique used by supercompilation [16, 15] and is refined to the string data type with a class of word equations and inequalities used as a constraint language. First, we extend the unfold/fold scheme given in [11] for a wider set of configurations. The scheme described is applicable both to the program data and word equations. Then we specify the relations controlling the unfold/fold process for the both types of data.

3.1 General Unfold/Fold Scheme

Given a set of predicates , is a set of predicates equivalent to . The relation is the textual coincidence, and are logical connectives with the usual meaning. Given a tree and the edge in , we say is a child of . A node is an ancestor of (and is a successor of ) if there exists a sequence of edges such that .

Definition 2

A configuration is a tuple , where , is a set of predicates on . We denote the set of configurations by .

Given , and a substitution , .

Given a tuple , where , , and are binary relations on , and , we name the transition relation, — the reducing relation, — the similarity relation. We assume that every node has a unique mark: either fresh, open or closed with , where is either an ancestor of or itself. The unfolding of the process tree of follows the scheme below, which is an extension of the scheme described in [11].

START: Create a root of the tree, label it with (denoted ) and mark is as fresh.

UNFOLD: Choose a fresh vertex and generate configurations such that and for every substitution , if , then there exist a substitution and such that . For every such a create a fresh child vertex and label it with . Open the vertex . If the parent of is open, then mark by closed with .

CLOSE I: Choose an open vertex and check whether it has an ancestor vertex , such that . If yes mark the vertex as closed with and delete the childen of .

CLOSE II: Choose an open vertex and check whether all its children are closed. If yes mark the vertex by closed with . If s.t. , mark by closed with .

GENERALIZE: Choose an open vertex and its ancestor vertex such that . Let . Generate configuration such that both and hold, and there is a substitution such that . Delete the subtree with the root , except the vertex itself. Replace the label with a special let-label ” and generate fresh children nodes111The construction of the let-branching differs from the branching produced by UNFOLD. The children of a let-node are generated by the split procedure rather than transitions: they all must be computed for computing . of . The last is labeled with and the others with , where , . Nodes with the let-label are never tested by CLOSE I and GENERALIZE.

PRUNE: Choose a closed vertex and consider the subtree rooted in . If all leaves in are closed with their ancestors from , then delete , node itself and the ingoing edge to .

3.2 Equations and Inequalities as String Constraints

In this subsection we specify a set which is used in configuration set , both for the program and constraint analyses. The set consists of the two subsets, namely word equations and word inequalities.

Definition 3

A word equation is an equality , where . is .

Given an equation , a solution of is s. t. . is quadratic, if no string parameter occurs in more than twice [8].

Definition 4

Given , and , a linear word inequality is an inequality of the form , where for , is . Recall that are variables of the string type (Sec. 2).

For the sake of brevity, we use the simplified notation treating all as free variables.

In the model used in this paper, any configuration is of the form , where , is a set of the quadratic word equations, and set is a set of the linear inequalities.

Sec. 4 Scheme for Constraint Analysis

In this section we apply the scheme given in Sec. 3.1 to analysis of the word equations. As a result, we reconstruct the well-known algorithm of Hmelevskii for solving the quadratic word equations [6] in the terms of the scheme given in Sec. 3.1. The algorithm is extended to parameters from and constraints in the form of the linear inequalities. In order to get the algorithm we have to specify some versions of the relations and and the configuration set .

Given a binary constructor , an eq-configuration is the configuration , where and the set includes the only quadratic word equation (and maybe some more linear word equations).

Now we specify the and relations over . Consider eq-configurations , .

Definition 5

is reduced to (denoted ), if there is a substitution such that , , , , and if then , if then . Thus, is a renaming substitution.

Definition 6

is unfolded from (denoted ), if there is a substitution (which is called the narrowing substitution) satisfying , , and having the following properties.

Let , where , , , and are fresh parameters. If , where , then all the definitions should be applied to symmetrically.

  • If and , then .

  • If and , then .

  • If and , then either or .

  • If , , and , then either or .

  • If , , and , then either or .

  • If , and , then .

Let , , and is chosen to be of the maximal length. Then assign , .

Actually, expression presents the equation after deleting the common prefixes of and . The construction of from is shown in the Appendix (Sec. 7.1). The properties of and , together with the construction of , guarantee that the algorithm given in Sec. 3.1 terminates. In fact, this algorithm specified by and is a version of Hmelevskii’s algorithm solving the quadratic word equations [6] with some minor changes due to extension of the parameters set by .

In the unfolded process tree of an equation, some simple properties holding for every path generated in the tree may become explicit. If the properties are expressible as narrowings of the root parameters, then the narrowings are extracted from the tree and are used in the analysis in the program from where constraints come. The unfolding also performs a test for satisfiability of the equation under the conditions . If the tree has no leaves marked by the expression (which is replaced by in our diagrams) then the equation has no solutions under the given conditions and the node with the general configuration where , can be pruned.

An example of the constraint analysis following the scheme above is shown in Appendix, see Example 5.

Sec. 5 Scheme for Program Analysis

Now we specify versions of the relations , , used by our program analysis. Consider a program , which is a finite sequence of rules (see Sec. 2).

Definition 7

The homeomorphic embedding is defined on as follows [15].

  • For every , , , . For every , .

  • Given , if , then (for any ), and , where for some .

  • Given , if then .

Let two configurations be .

Definition 8

is reduced to configuration (denoted ), if there is a substitution such that , , .

is similar to (denoted ) if .

Definition 9

is unfolded to () if there exists a rule in such that there are a substitution and a set of equations over such that . We call the narrowing, and the elements of the narrowing equations. Moreover, the following properties are required.

  • such that , and, for every inequality , if , then there is a rule such that it precedes and .

  • such that , , and all equations in are quadratic.

Actually, makes sense only if contains multiple occurrences of some string variables. Because the rules of are ordered from top to bottom to be matched, the branches of the process tree generated by UNFOLD rule are ordered. The set is constructed using this order. The order is not used in our analysis except this case.

Unlike the scheme given in [11], the unfolding scheme in this paper only partially determines the transitions done by UNFOLD, for they may vary in the equation and inequality sets. A construction demonstrating the role of and the problem to make explicit is given in Appendix (Example 6).

Example 2

The function below tests whether its argument is a palindrome.

Given configuration and the configurations

all the relations , , and hold. The corresponding rule is , the narrowing substitution is .

Definition 10

A generalization of is an expression such that there are , named generalizing substitutions, such that ().

A generalization of a linear inequality is either or an inequality , where maps some of to constant strings (maybe empty) and preserves the others.

A generalization of a word equation is either or a quadratic word equation such that there is satisfying ().

Example 3

Given inequality , all the inequalities , , are generalizations of . Inequality is not a generalization of , because substitution is non-constant.

Given equation , all the equations , , and are generalizations of . The equation is not a generalization of , because substitution is forbidden.

Definition 11

A generalization of two configurations , is such that

  • , and are the generalizing substitutions, .

  • , and for all such that , there exists such that is a generalization of .

  • , and for all such that , there exists an equation such that is a generalization of .

Thus the set of the computation paths starting at the generalized configuration includes all computation paths both from and .

Example 4

Consider the function below generating a word in language . Function is defined in Example 2.

Given the input point , let us prove that the configuration is unreachable using the unfold/fold scheme above. If or consists of a single element, then we omit the set enclosing brackets.

The input configuration is . UNFOLD rule is applied to the input node. The first child corresponding to the application of is generated by narrowing and is labeled with . The second child corresponding to the application of is generated by narrowing and is labeled with . After applying UNFOLD to , no contradiction is generated. The similarity relation holds, therefore GENERALIZE is used.

The first steps (except unfolding ) are shown in the diagram below. For the sake of brevity, we omit the trivial equations and inequalities sets in the configurations.

The following generalization with generalizing substitutions , is built.


The equation appears in . Both and hold, and , therefore is a valid equation for (the rule generating the equation is given in Appendix, Sec. 7.2).

Then again, node generates two successors by applying UNFOLD. First, consider corresponding to the narrowing . Given substitution , it satisfies both and . Hence, and CLOSE I marks as closed with .

Rule unfolds the configuration to by means of the narrowing . with the substitution and CLOSE I marks as closed with .

Rule unfolds to by means of . The constraint analysis of this configuration shows the predicate is contradictory. Thus, the branch with the node labeled with is pruned. This transformation is crucial for solving the verification task given above.

Rules and unfold to the constant configurations and , which are closed by CLOSE II. The whole process tree we constructed for the computation is shown below. If is closed with its ancestor , it has the dotted reverse edge to marked ”reducing”. If is closed with itself, we do not write the dotted edge corresponding to its mark.

The process tree does not contain nodes labeled with expression. Hence, the output is unreachable from the input point . The corresponding safety property of the program has been proven.

Sec. 6 Combining the Two Schemes

Let a configuration be , the narrowing substitution be . To check whether is unreachable, we follow the algorithm below.

  • Successively select elements of and replace them by their corollaries, which are linear inequalities (following the table given in Appendix, Sec. 7.1). If some implies contradiction, then delete the node labeled with . Otherwise construct the inequality set for .

  • Successively select elements of and split them into a number of shorter equations using the length argument [7]. If the length argument for some implies contradiction, delete the node labeled with . Otherwise construct the set of the simplified equations.

  • Successively select . If is not quadratic, then generalize by . Take all elements of such that , and all linear equations s.t. . Apply our constraint analysis scheme (Sec. 4) for . Proceed until is completely exhausted or the contradiction is found.

To check whether , it is enough to check whether . Because all elements , are linear and the cardinality of is more than 3, the implication always can be proved or refuted by finding whether for every there are such , that [14].

The most complex problem in the analysis is checking whether , even for sets of the quadratic equations. In general, the language inclusion problem for word equations is undecidable [4]. To check the reducing relation for nodes labeled with and , we must check whether , and can contain non-quadratic word equations. In fact, our algorithm does simplify the equations of and then verifies whether the simplified set is a subset of . Using such a simple test leads to more applications of GENERALIZE instead of CLOSE I when constructing the process tree, which results in a less precise analysis.

Sec. 7 Conclusion

Unlike approaches shown in [2, 10, 13], our algorithm works for unbounded strings in the language of non-linear word equations. Attempts to replace the word equation languages by their regular approximations showed that all equation languages except languages described by very simple linear equations cannot be modelled by rational relations [18]. Hence, using word equations as a constraint language can make sense. While in Example 4, a regular condition is introduced, in Example 7 and Example 8 verification is done on the non-regular data set.

The main two weaknesses of our approach are the following. First, the information about the branch ordering is lost, hence, if some subtree of the process tree is cut by PRUNE rule, the computation paths which end in may be embedded in some other subtree of . Second, we weaken the constraints on parameters of the input via generalization and reducing.


Without A. P. Nemytykh, who leaded the research and helped much to improve the paper, this paper would not exist.


  • [1] Abdulla, P.A., Atig, M.F., Chen, Y.F., Diep, B.P., Holik, L., Rezine, A., Rummer, P.: Flatten and conquer: A framework for efficient analysis of string constraints // SIGPLAN Not. 52(6), 602–617 (Jun 2017).
  • [2] Bjorner, N., Tillmann, N., Voronkov, A.: Path feasibility analysis for string manipulating programs // Kowalewski, S., Philippou, A. (eds.) Tools and Algorithms for the Construction and Analysis of Systems. pp. 307–321. Springer Berlin Heidelberg, Berlin, Heidelberg (2009).
  • [3] Burstall, R.M., Darlington, J.: A transformation system for developing recursive programs // J. ACM 24(1), 44–67 (1977)
  • [4] Freydenberger, D.D.: Inclusion of pattern languages and related problems // Ausgezeichnete Informatikdissertationen (2011)
  • [5] Hamilton, G.W.: Verifying temporal properties of reactive systems by transformation // Proceedings of the Third International Workshop on Verification and Program Transformation, VPT@ETAPS 2015, London, United Kingdom, 11th April 2015. pp. 33–49 (2015).
  • [6] Hmelevskij, J.I.: Equations in a free semigroup. (in Russian) // Trudy Mat. Inst. Steklov 107, 286 (1971).
  • [7] Huova, M.: Combinatorics of words. new aspects on avoidability, defect effect, equations and palindromes // Ph.D. Thesis (2014).
  • [8] Karhumaki, J., Maurer, H., Paun, G., Rozenberg, G.: Jewels are Forever, Contributions on Theoretical Computer Science in Honor of Arto Salomaa (01 1999).
  • [9] Karhumaki, J., Mignosi, F., Plandowski, W.: The expressibility of languages and relations by word equations // J. ACM 47(3), 483–505 (May 2000).
  • [10] Liang, T., Reynolds, A., Tsiskaridze, N., Tinelli, C., Barrett, C., Deters, M.: An efficient SMT solver for string constraints // Form. Methods Syst. Des. 48(3), 206–234 (Jun 2016).
  • [11] Lisitsa, A., Nemytykh, A.P.: Reachability analysis in verification via supercompilation // Int. J. Foundations of Computer Science 19(4), 953–970 (2008).
  • [12] M.T. Trinh, D.H. Chu, J.Jaffar: Progressive reasoning over recursively-defined strings // Proc. CAV 2016 (LNCS). vol. 9779, pp. 218–240 (2016).
  • [13] Saxena, P., Akhawe, D., Hanna, S., F. Mao, S., Song, D.: A symbolic execution framework for javascript // SP. pp. 513–528 (2010).
  • [14] S. Jain, Y. S. Ong, F.Stefan: Regular patterns, regular languages and context-free languages // Information Processing Letters 110, 1114–1119 (2010).
  • [15] Sorensen, M.H., Gluck, R., Jones, N.D.: A positive supercompiler // Journal of Functional Programming 6, 465–479 (1993).
  • [16] Turchin, V.F.: The concept of a supercompiler // ACM Transactions on Programming Languages and Systems 8(3), 292–325 (1986).
  • [17] Turchin, V.F.: Refal-5, Programming Guide and Reference Manual. New England Publishing Co., Holyoke, Massachusetts (1989), electronic version:
  • [18] Yu, F., Bultan, T., Ibarra, O.H.: Relational string verification using multi-track automata. // Domaratzki, M., Salomaa, K. (eds.) Implementation and Application of Automata. pp. 290–299. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)


7.1 Transition Rules

In the table below, if then , . The disjunction operation means that a branching in the process tree is generated.

The implications above lose information, in particular, if the middle expression contains some free variables. If is long, splitting it into all possible parts produces too many branches in the process tree.

If and a corollary contains an inequality on , then the transition rules are applied successively until the first branching is generated. If an application of a transition rule produces the second branching, then the inequality on producing the branching is thrown away. In the constraint analysis (Sec. 4) such a situation never occurs, because of the restricted form of the narrowings.

Given an inequality , all its corollaries are of the form , where is a substring of (modulo renamings of ).

7.2 Special Generalizations

Here means the substitution of in the inequality results in the contradiction.

Rules for the inequalities are given below. For the third rule, besides the constraint , the special condition must be checked.

Rules for the equations are given below. is a special delimiter symbol, and we assume .

The second rule for the equations in the table above is used optionally, for it generates new parameters .

7.3 Examples

Example 5

For the sake of brevity, we write as and omit trivial and sets in configurations. We also omit the set enclosing brackets if the corresponding set is a singleton.

Let be the initial configuration. Then it may be unfolded either to by the narrowing or to