An Automata-based Abstract Semantics for String Manipulation Languages

08/17/2018 ∙ by Vincenzo Arceri, et al. ∙ University of Verona 0

In recent years, dynamic languages, such as JavaScript or Python, have faced an important increment of usage in a wide range of fields and applications. Their tricky and misunderstood behaviors pose a hard challenge for static analysis of these programming languages. A key aspect of any dynamic language program is the multiple usage of strings, since they can be implicitly converted to another type value, transformed by string-to-code primitives or used to access an object-property. Unfortunately, string analyses for dynamic languages still lack of precision and do not take into account some important string features. Moreover, string obfuscation is very popular in the context of dynamic language malicious code, for example, to hide code information inside strings and then to dynamically transform strings into executable code. In this scenario, more precise string analyses become a necessity. This paper proposes a new semantics for string analysis placing a first step for handling dynamic languages string features.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dynamic languages, such as JavaScript or Python, have faced an important increment of usage in a very wide range of fields and applications. Common features in dynamic languages are dynamic typing (typing occurs during program execution, at run-time) and implicit type conversion[pradel2015], lightening the development phase and allowing not to block the program execution in presence of unexpected or unpredictable situations. Moreover, one important aspect of dynamic languages is the way strings may be used. In JavaScript, for example, strings can be either used to access property objects or transformed into executable code, by using the global function eval. In this way, dynamic languages provide multiple string features that simplify writing programs, allowing, at the same time, statically unpredictable executions which may make programs harder to understand [pradel2015]. For this reason, string obfuscation (e.g., string splitting) is becoming one of the most common obfuscation techniques in JavaScript malware [xu2012], making hard to statically analyze code. Consider, for example, the JavaScript program fragment in Fig. 1 where strings are manipulated, de-obfuscated, combined together into the variable dec and finally transformed into executable code, the statement ws = new ActiveXObject(WScript.Shell).111ActiveXObject Microsoft documentation This command, in Internet Explorer, opens a shell which may execute malicious commands. The command is not hard-coded in the fragment but it is built at run-time and the initial values of i,j and k, and therefore the number of iterations of the loops in the fragment, are unknown. All these observations suggest us that, in order to statically understand statements dynamically generated and executed, it may be extremely useful to statically analyze the string value of dec.
Unfortunately, existing static analyzers for dynamic languages [tajs2009, jsai2014, safe2012, hauzar2015], may fail to precisely analyze strings in dynamic contexts. For instance, in the example above, existing static analyzers [tajs2009, jsai2014, safe2012] lose precision on the eval input value, returning any possible string value. Namely, the issue of analyzing dynamic languages, even if tackled by sophisticated tools as the cited ones, still lacks formal approaches for handling the more dynamic features of string manipulation, such as dynamic typing, implicit type conversion and dynamic code generation.


In this paper, we focus on the characterization of an abstract interpretation-based [cousot1977] formal framework for handling dynamic typing and implicit type conversion, by defining an abstract semantics able to (precisely, when possible) capture these dynamic features. Even if we still do not tackle the problem of analyzing dynamically generated code (by using statements such as eval), we strongly believe that such a semantics is a necessary step towards a sufficiently precise analysis of dynamically generated code. With this task in mind, we first discuss how to combine abstract domains of primitive types (strings, integers and booleans) in order to capture dynamic typing. Once we have such an abstract domain, we define on it an abstract semantics for an language, augmented with implicit type conversion, dynamic typing and some interesting string operations, whose concrete semantics is inspired by the JavaScript one. In particular, for each one of these operations we provide the algorithm computing its abstract semantics and we discuss their soundness and completeness.

  v = "wZsZ"; vd = "";
  while (i < v.length) {
    vd = vd + v.charAt(i);
    i = i + 2;
    m = "AYcYtYiYvYeYXY"; ac = "";
  while (j < m.length) {
    ac = ac + m.charAt(i);
    j = j + 2;
  ac = ac + "Object";  la = "";
  l = "WYSYcYrYiYpYtY.YSYhYeYlYlY";
  while (k < l.length) {
    la = la + l.charAt(i);
    k = k + 2;
  dec = vd + "=new " + ac + "(" + la + ")";

Figure 1: A potentially malicious obfuscated JavaScript program.

Paper structure.

In Sect. 2 we recall relevant notions on finite state automata and the core language we adapt for this paper and the finite state automata domain, highlighting some important operations and theoretical results, respectively. In Sect. 3 we discuss and present two ways of combining abstract domains (for primitive types) suitable for dynamic languages. Then, In Sect. 4 we present the novel abstract semantics for string manipulation programs. Finally, in Sect. 5 we discuss the related work compared to this paper and we conclude the paper.

2 Background

2.1 Basic notations and concepts

String notation.

We denote by a finite alphabet of symbols, its Kleene-closure by and a string element by . If , the length of is and the element in the -th position is . Given two strings , is their concatenation. A language is a set of strings, i.e., . We use the following notations: and . Given , () the substring between and of is the string , and we denote it by . We denote by the set of numeric strings, i.e., strings corresponding to (signed) integers. maps numeric strings to the corresponding integers. Dually, we define the function that maps each integer to its minimal numeric string representation (e.g., 1 is mapped to the string "1", and not "+1").

Regular languages and finite state automata.

We follow [hopcroft1979] for automata notation. A finite state automaton (FA) is a tuple where is a finite set of states, is the initial state, is a finite alphabet, is the transition relation and is the set of final states. In particular, if is a function then A is called deterministic FA (DFA)222We consider DFA also those FAs which are not complete, namely such that a transition for each pair (, ) does not exists. They can be easily transformed in a DFA by adding a sink state receiving all the missing transitions.. The class of languages recognized by FAs is the class of regular languages. We denote the set of all DFAs as Dfa. Given an automaton A, we denote the language accepted by A as . A language L is regular iff there exists a FA A such that . From the Myhill-Nerode theorem[davis1994], for each regular language there uniquely exists a minimum automaton, i.e., with the minimum number of states, recognizing the language. Given a regular language L, we denote by the minimum DFA A s.t. .

The programming language.

We consider an language (Fig. 2) that contains four representative string operations taken from the set of methods offered by the JavaScript built-in class String[w3school-string]. Other string operations, such as the JavaScript lastIndexOf or startsWith, can be modeled by composition of the given string operations or as particular cases of them.

<Exp> ::= Id |  v  |  Exp + Exp |  Exp - Exp |  Exp * Exp |  Exp / ExpExp && Exp |  Exp || Exp |  ! Exp |  Exp > Exp |  Exp < ExpExp == Exp |  Exp.substring(Exp,Exp)  |  Exp.charAt(Exp) Exp.indexOf(Exp)  |  Exp.length

<Block> ::= { }  |  { Stmt }

<Stmt> ::= Id = Exp;  |  if (Exp) Block else Block |  while (Exp) BlockBlock |  Stmt Stmt |  ;

Figure 2: syntax

Primitive values are with (strings on the alphabet ), and NaN a special value denoting not-a-number.

Implicit type conversion.

In order to properly capture the semantics of the language , inspired by the JavaScript semantics, we need to deal with implicit type conversion[arceri2017]. For each primitive value, we define an auxiliary function converting primitive values to other primitive values (Fig. 3). Note that all the functions behave like identity when applied to values not needing conversion, e.g., toInteger on integers. Then, maps any input value to its string representation; returns the integer corresponding to a value, when it is possible: For true and false it returns respectively and , for strings in it returns the corresponding integer, while all the other values are converted to NaN. For instance, , . Finally, returns false when the input is , and true for all the other non boolean primitive values.

scale=0.85 scale=0.85


Figure 3: implicit type conversion functions.


Program states are partial maps from identifiers to primitive values, i.e., . The concrete big-step semantics is quite standard, and it includes dynamic typing and implicit type conversion. Also the expression semantics, , is standard; we only provide the formal and precise semantics of the four string operations we have in : Let (otherwise a run-time error occurs), and (in both cases, values which are not strings or numbers respectively, are converted by the implicit type conversion primitives).


It extracts substrings from strings, i.e., all the characters between two indexes. The semantics is the function Ss defined as: Suppose (negative values are treated as zero),


It returns the character at a specified index. The semantics is the function Ca defined as follows:


It returns the position of the first occurrence of a given substring, namely . The semantics is the function Io defined as follows:


It returns the length of a string . Its semantics is the function Le trivially defined as .

2.2 The finite state automata domain for strings

In this section, we describe the automata abstract domain for strings [park2016, wid-approach, yu2008], namely the domain of regular languages over . In particular, our aim is that of underlying the well known theoretical foundations of regular languages (and therefore of DFA) characterizing automata as a domain for abstracting the computation of program semantics in the abstract interpretation framework. The exploited idea is that of approximating strings as regular languages represented by the minimum DFAs [davis1994] recognizing them. In general, we have more DFAs than regular languages, hence the domain of automata is indeed the quotient w.r.t. the equivalence relation induced by language equality: . We abuse notation by representing equivalence classes in the domain w.r.t.  by one of its automata (usually the minimum), i.e., when we write we mean .
The partial order induced by language inclusion is , which is well defined since automata in the same -equivalence class recognize the same language.






Figure 4: Least upper bound of .

The corresponding least upper bound on the domain , corresponds to the standard union between automata: . It is the minimum automaton recognizing the union of the languages and . This is a well-defined notion since regular languages are closed under union. As example, consider Fig. 4, where the automaton in Fig. 3(c) is the least upper bound of and given in Fig. 3(a) and Fig. 3(b), respectively.
The (finite) greatest lower bound corresponds to automata intersection (since regular languages are closed under finite intersection):

Theorem 2.1

is a sub-lattice but not a complete meet-sub-semilattice of .

In other words, there exists no Galois connections between and , i.e., there may exists no minimal automaton abstracting a language. 333Note that, some works [campeanu2002, domaratzki2001, mohri2001] have studied automatic procedures to compute, given an input language , the regular cover of [domaratzki2001] (i.e., an automaton containing the language ) Some of them[campeanu2002, domaratzki2001] studied regular covers guaranteeing that the automaton obtained is the best w.r.t. a minimal relation (but not minimum). However, this is not a concern, since the relation between concrete semantics and abstract semantics can be weakened still ensuring soundness [cousot1992]. A well known example is the convex polyhedra domain [cousot1978].


The domain is an infinite domain, and it is not ACC.444A domain is ACC if it does not contain infinite ascending chains. For instance, consider the set of languages forming an infinite ascending chain, then also the set of the corresponding minimal automata trivially forms an ascending chain on . This clearly implies that any computation on may lose convergence [cousot1992]. Most of the proposed abstract domains for strings [costantini2015, jsai2014, tajs2009, safe2012] trivially satisfy ACC being finite, but they may lose precision during the abstract computation [cousot1992-2]. In these cases, domains must be equipped with a widening operator approximating the least upper bound in order to force convergence (by necessarily losing precision) for any increasing chain [cousot1992-2]. As far as automata are concerned, existing widenings are defined in terms of a state equivalence relation merging states recognizing the same language, up to a fixed length (set as parameter for tuning the widening precision) [silva2006, DBLP:conf/cav/BartzisB04].

3 An abstract domain for string manipulation

In this section, we discuss how to design an abstract domain for string manipulation dealing also with other primitive types, namely able to combine different abstractions of different primitive types. In particular, since operations on strings combine strings also with other values (e.g., integers), an abstract domain for string analysis equipped with dynamic typing must include all the possible primitive values, i.e., the whole . The idea is to consider an abstract domain for each type of primitive value and to combine these abstract domains in a unique abstract domain for . Consider, for each primitive value , an abstract domain (we denote the domain without bottom as ), equipped with an abstraction and a concretization forming a Galois insertion [cousot1977].

Coalesced sum.

One way to merge domains is the coalesced sum [cousot1997]. The resulting domain contains all the non-bottom elements of the domains, together with a new top and a new bottom, respectively covering all the elements and covered by all the elements. In our case, if we consider the abstract domains , and , the coalesced sum is the abstraction of depicted in Fig. 5.

Figure 5: Coalesced sum abstract domain for

This is the simplest choice, but unfortunately this is not suitable for dynamic languages, and in particular for dealing with dynamic typing and implicit type conversion. The problem is that the type of variables is inferred at run-time and/or may change during execution. For example, consider the following fragment: . The value of the variable y is statically unknown hence, in order to guarantee soundness, we must take into account both the branches, meaning that x may be both a string and a boolean value, after the if statement. On the coalesced sum domain, the analysis would lose any precision w.r.t. collecting semantics by returning .

Lifted union.

In order to catch union types, without losing too much precision, we need to complete [GRS00, GQ01, GM16] the above domain in order to observe collections of values of different value types. In order to define this combination, let us consider a lifted union of sets, i.e., given and ( and arbitrary sets), we define the lifted union as . Hence, the complete abstract domain w.r.t. dynamic typing and implicit type conversion is: , abstraction of . In this new lifted union domain, the value of x after the if-execution is precisely , now an element of the domain.
In the following, we consider the abstract domain for string analysis obtained as lifted union of the following abstractions: (the well-known abstract domain of intervals [cousot1977]), , .

4 The abstract semantics

In this section, we define the abstract semantics of the language on the abstract domain . In particular, we have to define the expressions abstract semantics , which is standard except for the string operations that will be explicitly provided by describing the algorithm for computing them. Let us first recall some important notions on regular languages, useful for the algorithms we will provide.

Definition 1 (Suffixes and prefixes[davis1994])

Let be a regular language. The suffixes of L are , and the prefixes of L are .

We can define the suffixes from a position, namely given , the set of suffixes from is . For instance, let , then .

Definition 2 (Left quotient[davis1994])

Let be regular languages. The left quotient of w.r.t  is .

Definition 3 (Right quotient[davis1994])

Let be regular languages. The right quotient of w.r.t  is .

For example, let and . The left quotient of w.r.t  is . Let and . The right quotient of w.r.t  is .

Definition 4 (Substrings/Factors[bordihn09])

Let be a regular language. The set of its substrings/factors is .

These operations are all defined as transformations of regular languages. In [davis1994] the corresponding algorithms on FA are provided. In particular, let and , then , , , , and are the algorithms corresponding to the transformations , , , , and , respectively. Namely, , , the following facts holds:

As far as (state) complexity is concerned[YuZS94], prefix and right quotient operations have linear complexity, while suffix, left quotient and factor operations, in general, are exponential[YuZS94, pribavkina2010].

4.1 Abstract semantics of substring.

In this section, we define the abstract semantics of substring, i.e., we define the operator SS, starting from an automaton, an interval of initial indexes and an interval of final indexes for substrings, and computing the automaton recognizing the set of all substrings of the input automata language between the indexes in the two intervals. Hence, since the abstract semantics has to take into account the swaps when the initial index is greater than the final one, several cases arise handling (potentially unbounded) intervals. Tab. 1 reports the abstract semantics of SS when (hence ). The definition of this semantics is by recursion with four base cases (the other cases are recursive calls splitting and rewriting the input intervals in order to match or to get closer to base cases) for which we describe the algorithmic characterization. Consider and , (for the sake of readability we denote by the automata least upper bound , and by the greatest lower bound ), the base cases are

  1. If (first row, first column of Tab. 1) we have to compute the language of all the substrings between an initial index in and a final index in , i.e., .For example, let , the set of its substrings from 1 to 3 is . The automaton accepting this language is computed by the operator

  2. When both intervals correspond to , the result is the automaton of all possible factors of A (last row, last column), i.e., ;

  3. If is defined and the interval of final indexes is unbounded, i.e., (first row, third column), we have to compute the automaton recognizing , i.e., all the strings between a finite interval of initial indexes and an unbounded final index. The automaton accepting this language is computed by

    The abstract semantics returns the least upper bound of all the automata of substrings from in to an unbounded index greater than or equal to ;

  4. When both intervals are unbounded ( and , third row, third column of Tab. 1), we split the language to accept. In particular, we compute the substrings between and (falling down into the previous case), and the automaton recognizing the language of all substrings with both initial and final index any value greater than , i.e., the language . This latter set is computed by the algorithm

We show here the table only for the case . Only few cases are not considered and they are reported in Tab. 2 and Tab. 3 in the appendix.


Table 1: Definition of when
Theorem 4.1 (Termination of )

For each , performs at most three recursive calls, before reaching a base case.

Theorem 4.2 (Soundness and completeness of )

Given , then .

4.2 Abstract semantics of charAt

The abstract semantics of charAt should return the automaton accepting the language of all the characters of strings accepted by an automaton A, in a position inside a given interval : This is computed by

We call (defined before) when the interval index is finite. In the last two cases, we use the function , returning the set of characters read in any transition of an automaton. When , we return the characters starting from together with while, when , we simply return the characters of the automaton together with .

Theorem 4.3 (Soundness and completeness of )

, .

4.3 Abstract semantics of length

The abstract semantics of length should return the interval of all the possible string lengths in an automaton, i.e., it is computed by Alg. 1, where return the minimum and the maximum paths between two states of the input automaton, respectively [rivestbook]. returns the size of a path, and checks whether the automaton contains cycles [rivestbook].


Input: Deterministic finite state automaton
1 ; if  then
2       foreach  do
3             ;
4             if  then
5                   ;
7             end if
9       end foreach
10      return ;
13       foreach  do
14             ; ;
15             if  then
16                   ;
18             end if
19            if  then
20                   ;
22             end if
24       end foreach
25      return ;
27 end if
Algorithm 1 algorithm

The idea is to compute the minimum and the maximum path reaching each final state in the automaton (in Fig. 5(a), we obtain and ). Then, we abstract the set of lengths obtained so far into intervals (in the example, ). Problems arise when the automaton contains cycles. In this case, we simply return the undefined interval starting from the minimum path, to a final state, to . For example, in the automaton in Fig. 5(b), the length interval is .




Figure 6: (a) , . (b) , .
Theorem 4.4

is sound but not complete: .

4.4 Abstract semantics of indexOf