This command, in Internet Explorer, opens a shell which may execute malicious commands. The command is not hard-coded in the fragment but it is built at run-time and the initial values of i,j and k, and therefore the number of iterations of the loops in the fragment, are unknown.
All these observations suggest us that, in order to statically understand statements dynamically generated and executed, it may be extremely useful to statically analyze the string value of dec.
Unfortunately, existing static analyzers for dynamic languages [tajs2009, jsai2014, safe2012, hauzar2015], may fail to precisely analyze strings in dynamic contexts. For instance, in the example above, existing static analyzers [tajs2009, jsai2014, safe2012] lose precision on the eval input value, returning any possible string value. Namely, the issue of analyzing dynamic languages, even if tackled by sophisticated tools as the cited ones, still lacks formal approaches for handling the more dynamic features of string manipulation, such as dynamic typing, implicit type conversion and dynamic code generation.
In Sect. 2 we recall relevant notions on finite state automata and the core language we adapt for this paper and the finite state automata domain, highlighting some important operations and theoretical results, respectively. In Sect. 3 we discuss and present two ways of combining abstract domains (for primitive types) suitable for dynamic languages. Then, In Sect. 4 we present the novel abstract semantics for string manipulation programs. Finally, in Sect. 5 we discuss the related work compared to this paper and we conclude the paper.
2.1 Basic notations and concepts
We denote by a finite alphabet of symbols, its Kleene-closure by and a string element by . If , the length of is and the element in the -th position is . Given two strings , is their concatenation. A language is a set of strings, i.e., . We use the following notations: and . Given , () the substring between and of is the string , and we denote it by . We denote by the set of numeric strings, i.e., strings corresponding to (signed) integers. maps numeric strings to the corresponding integers. Dually, we define the function that maps each integer to its minimal numeric string representation (e.g., 1 is mapped to the string "1", and not "+1").
Regular languages and finite state automata.
We follow [hopcroft1979] for automata notation. A finite state automaton (FA) is a tuple where is a finite set of states, is the initial state, is a finite alphabet, is the transition relation and is the set of final states. In particular, if is a function then A is called deterministic FA (DFA)222We consider DFA also those FAs which are not complete, namely such that a transition for each pair (, ) does not exists. They can be easily transformed in a DFA by adding a sink state receiving all the missing transitions.. The class of languages recognized by FAs is the class of regular languages. We denote the set of all DFAs as Dfa. Given an automaton A, we denote the language accepted by A as . A language L is regular iff there exists a FA A such that . From the Myhill-Nerode theorem[davis1994], for each regular language there uniquely exists a minimum automaton, i.e., with the minimum number of states, recognizing the language. Given a regular language L, we denote by the minimum DFA A s.t. .
The programming language.
Primitive values are with (strings on the alphabet ), and NaN a special value denoting not-a-number.
Implicit type conversion.
Program states are partial maps from identifiers to primitive values, i.e., . The concrete big-step semantics is quite standard, and it includes dynamic typing and implicit type conversion. Also the expression semantics, , is standard; we only provide the formal and precise semantics of the four string operations we have in : Let (otherwise a run-time error occurs), and (in both cases, values which are not strings or numbers respectively, are converted by the implicit type conversion primitives).
It extracts substrings from strings, i.e., all the characters between two indexes. The semantics is the function Ss defined as: Suppose (negative values are treated as zero),
It returns the character at a specified index. The semantics is the function Ca defined as follows:
It returns the position of the first occurrence of a given substring, namely . The semantics is the function Io defined as follows:
It returns the length of a string . Its semantics is the function Le trivially defined as .
2.2 The finite state automata domain for strings
In this section, we describe the automata abstract domain for strings [park2016, wid-approach, yu2008], namely the domain of regular languages over . In particular, our aim is that of underlying the well known theoretical foundations of regular languages (and therefore of DFA) characterizing automata as a domain for abstracting the computation of program semantics in the abstract interpretation framework.
The exploited idea is that of approximating strings as regular languages represented by the minimum DFAs [davis1994] recognizing them. In general, we have more DFAs than regular languages, hence the domain of automata is indeed the quotient w.r.t. the equivalence relation induced by language equality: .
We abuse notation by representing equivalence classes in the domain w.r.t. by one of its automata (usually the minimum), i.e., when we write we mean .
The partial order induced by language inclusion is , which is well defined since automata in the same -equivalence class recognize the same language.
The corresponding least upper bound on the domain , corresponds to the standard union between automata:
It is the minimum automaton recognizing the union of the languages and . This is a well-defined notion since regular languages are closed under union.
As example, consider Fig. 4, where the automaton in Fig. 3(c) is the least upper bound of and given in Fig. 3(a) and Fig. 3(b), respectively.
The (finite) greatest lower bound corresponds to automata intersection (since regular languages are closed under finite intersection):
is a sub-lattice but not a complete meet-sub-semilattice of .
In other words, there exists no Galois connections between and , i.e., there may exists no minimal automaton abstracting a language. 333Note that, some works [campeanu2002, domaratzki2001, mohri2001] have studied automatic procedures to compute, given an input language , the regular cover of [domaratzki2001] (i.e., an automaton containing the language ) Some of them[campeanu2002, domaratzki2001] studied regular covers guaranteeing that the automaton obtained is the best w.r.t. a minimal relation (but not minimum). However, this is not a concern, since the relation between concrete semantics and abstract semantics can be weakened still ensuring soundness [cousot1992]. A well known example is the convex polyhedra domain [cousot1978].
The domain is an infinite domain, and it is not ACC.444A domain is ACC if it does not contain infinite ascending chains. For instance, consider the set of languages forming an infinite ascending chain, then also the set of the corresponding minimal automata trivially forms an ascending chain on . This clearly implies that any computation on may lose convergence [cousot1992]. Most of the proposed abstract domains for strings [costantini2015, jsai2014, tajs2009, safe2012] trivially satisfy ACC being finite, but they may lose precision during the abstract computation [cousot1992-2]. In these cases, domains must be equipped with a widening operator approximating the least upper bound in order to force convergence (by necessarily losing precision) for any increasing chain [cousot1992-2]. As far as automata are concerned, existing widenings are defined in terms of a state equivalence relation merging states recognizing the same language, up to a fixed length (set as parameter for tuning the widening precision) [silva2006, DBLP:conf/cav/BartzisB04].
3 An abstract domain for string manipulation
In this section, we discuss how to design an abstract domain for string manipulation dealing also with other primitive types, namely able to combine different abstractions of different primitive types. In particular, since operations on strings combine strings also with other values (e.g., integers), an abstract domain for string analysis equipped with dynamic typing must include all the possible primitive values, i.e., the whole . The idea is to consider an abstract domain for each type of primitive value and to combine these abstract domains in a unique abstract domain for . Consider, for each primitive value , an abstract domain (we denote the domain without bottom as ), equipped with an abstraction and a concretization forming a Galois insertion [cousot1977].
One way to merge domains is the coalesced sum [cousot1997]. The resulting domain contains all the non-bottom elements of the domains, together with a new top and a new bottom, respectively covering all the elements and covered by all the elements. In our case, if we consider the abstract domains , and , the coalesced sum is the abstraction of depicted in Fig. 5.
This is the simplest choice, but unfortunately this is not suitable for dynamic languages, and in particular for dealing with dynamic typing and implicit type conversion. The problem is that the type of variables is inferred at run-time and/or may change during execution. For example, consider the following fragment: . The value of the variable y is statically unknown hence, in order to guarantee soundness, we must take into account both the branches, meaning that x may be both a string and a boolean value, after the if statement. On the coalesced sum domain, the analysis would lose any precision w.r.t. collecting semantics by returning .
In order to catch union types, without losing too much precision, we need to complete [GRS00, GQ01, GM16] the above domain in order to observe collections of values of different value types.
In order to define this combination, let us consider a lifted union of sets, i.e., given and ( and arbitrary sets), we define the lifted union as . Hence, the complete abstract domain w.r.t. dynamic typing and implicit type conversion is:
, abstraction of .
In this new lifted union domain, the value of x after the if-execution is precisely , now an element of the domain.
In the following, we consider the abstract domain for string analysis obtained as lifted union of the following abstractions: (the well-known abstract domain of intervals [cousot1977]), , .
4 The abstract semantics
In this section, we define the abstract semantics of the language on the abstract domain . In particular, we have to define the expressions abstract semantics , which is standard except for the string operations that will be explicitly provided by describing the algorithm for computing them. Let us first recall some important notions on regular languages, useful for the algorithms we will provide.
Definition 1 (Suffixes and prefixes[davis1994])
Let be a regular language. The suffixes of L are , and the prefixes of L are .
We can define the suffixes from a position, namely given , the set of suffixes from is . For instance, let , then .
Definition 2 (Left quotient[davis1994])
Let be regular languages. The left quotient of w.r.t is .
Definition 3 (Right quotient[davis1994])
Let be regular languages. The right quotient of w.r.t is .
For example, let and . The left quotient of w.r.t is . Let and . The right quotient of w.r.t is .
Definition 4 (Substrings/Factors[bordihn09])
Let be a regular language. The set of its substrings/factors is .
These operations are all defined as transformations of regular languages. In [davis1994] the corresponding algorithms on FA are provided. In particular, let and , then , , , , and are the algorithms corresponding to the transformations , , , , and , respectively. Namely, , , the following facts holds:
As far as (state) complexity is concerned[YuZS94], prefix and right quotient operations have linear complexity, while suffix, left quotient and factor operations, in general, are exponential[YuZS94, pribavkina2010].
4.1 Abstract semantics of substring.
In this section, we define the abstract semantics of substring, i.e., we define the operator SS, starting from an automaton, an interval of initial indexes and an interval of final indexes for substrings, and computing the automaton recognizing the set of all substrings of the input automata language between the indexes in the two intervals. Hence, since the abstract semantics has to take into account the swaps when the initial index is greater than the final one, several cases arise handling (potentially unbounded) intervals. Tab. 1 reports the abstract semantics of SS when (hence ). The definition of this semantics is by recursion with four base cases (the other cases are recursive calls splitting and rewriting the input intervals in order to match or to get closer to base cases) for which we describe the algorithmic characterization. Consider and , (for the sake of readability we denote by the automata least upper bound , and by the greatest lower bound ), the base cases are
If (first row, first column of Tab. 1) we have to compute the language of all the substrings between an initial index in and a final index in , i.e., .For example, let , the set of its substrings from 1 to 3 is . The automaton accepting this language is computed by the operator
When both intervals correspond to , the result is the automaton of all possible factors of A (last row, last column), i.e., ;
If is defined and the interval of final indexes is unbounded, i.e., (first row, third column), we have to compute the automaton recognizing , i.e., all the strings between a finite interval of initial indexes and an unbounded final index. The automaton accepting this language is computed by
The abstract semantics returns the least upper bound of all the automata of substrings from in to an unbounded index greater than or equal to ;
When both intervals are unbounded ( and , third row, third column of Tab. 1), we split the language to accept. In particular, we compute the substrings between and (falling down into the previous case), and the automaton recognizing the language of all substrings with both initial and final index any value greater than , i.e., the language . This latter set is computed by the algorithm
Theorem 4.1 (Termination of )
For each , performs at most three recursive calls, before reaching a base case.
Theorem 4.2 (Soundness and completeness of )
Given , then .
4.2 Abstract semantics of charAt
The abstract semantics of charAt should return the automaton accepting the language of all the characters of strings accepted by an automaton A, in a position inside a given interval : This is computed by
We call (defined before) when the interval index is finite. In the last two cases, we use the function , returning the set of characters read in any transition of an automaton. When , we return the characters starting from together with while, when , we simply return the characters of the automaton together with .
Theorem 4.3 (Soundness and completeness of )
4.3 Abstract semantics of length
The abstract semantics of length should return the interval of all the possible string lengths in an automaton, i.e., it is computed by Alg. 1, where return the minimum and the maximum paths between two states of the input automaton, respectively [rivestbook]. returns the size of a path, and checks whether the automaton contains cycles [rivestbook].
The idea is to compute the minimum and the maximum path reaching each final state in the automaton (in Fig. 5(a), we obtain and ). Then, we abstract the set of lengths obtained so far into intervals (in the example, ). Problems arise when the automaton contains cycles. In this case, we simply return the undefined interval starting from the minimum path, to a final state, to . For example, in the automaton in Fig. 5(b), the length interval is .
is sound but not complete: .