Improving Dynamic Code Analysis by Code Abstraction

09/07/2021 ∙ by Isabella Mastroeni, et al. ∙ University of Verona Università Ca' Foscari Venezia 0

In this paper, our aim is to propose a model for code abstraction, based on abstract interpretation, allowing us to improve the precision of a recently proposed static analysis by abstract interpretation of dynamic languages. The problem we tackle here is that the analysis may add some spurious code to the string-to-execute abstract value and this code may need some abstract representations in order to make it analyzable. This is precisely what we propose here, where we drive the code abstraction by the analysis we have to perform.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The possibility of dynamically building code instructions as the result of text manipulation is a key aspect in dynamic programming languages. In this scenario, programs can turn text, which can be built at run-time, into executable code [RichardsHBV11]. These features are often used in code protection and tamper-resistant applications, employing camouflage for escaping attack or detection [DMavrogiannopoulosKP11], in malware, in mobile code, in web servers, in code compression, and in code optimization, e.g., in Just-in-Time (JIT) compilers, employing optimized run-time code generation.
While the use of dynamic code generation may simplify considerably the art and performance of programming, this practice is also highly dangerous, making the code prone to unexpected behaviors and malicious exploits of its dynamic vulnerabilities, such as code/object-injection attacks for privilege escalation, database corruption, and malware propagation. It is clear that more advanced and secure functionalities based on string-to-code statements could be permitted if we better master how to safely generate, analyze, debug, and deploy programs that dynamically generate and manipulate code.

There are lots of good reasons to analyze programs building strings that can be later executed as code. An interesting example is code obfuscation. Recently, several techniques have been proposed for JavaScript code obfuscation111https://www.daftlogic.com/projects-online-javascript-obfuscator.htm,
http://www.danstools.com/javascript-obfuscate/,
http://javascript2img.com/,
https://javascriptobfuscator.herokuapp.com/,
https://javascriptobfuscator.com/
, meaning that also client-side code protection is becoming an increasingly important problem to be tackled by the research community and by practitioners. Hence, it is not always possible to simply ignore eval without accepting to lose the possibility of analyzing the rest of the program [tops20].

The Context: Analyzing Dynamic Code.

A major problem in presence of dynamic code generation is that static analysis becomes extremely hard if not impossible. This happens because program data structures, such as the control-flow graph and the system of recursive equations associated with the program in question, are themselves dynamically mutating objects. Recently [tops20], the problem of analyzing dynamic code has been tackled by treating code as any other dynamic structure that can be statically analyzed by abstract interpretation, and to treat the abstract interpreter as any other program function that can be recursively called. In particular, in [tops20], we provide a static analyzer architecture for a core dynamic language, containing non-removable eval statements, that still has some limitation in terms of precision but provides the necessary ground for studying more precise solutions to the problem. In particular,

  • We have designed an automata-based string abstract domain [mdpi] for analyzing string values during execution. Automata (FA) provide the perfect choice for abstracting strings that may be executed by eval since they allow us to over-approximate the set of possible values of string variables by keeping enough information for both analyzing properties of string variables that are never executed by an eval during computation and for extracting the potential executable sub-language.

  • In order to statically analyze the code potentially executed by an eval, we have designed a systematic process for extracting from the (abstract) argument of eval (i.e., from the FA collection of its potential arguments) an over-approximation of executable code that this collection contains. Clearly, this approximation must keep a form that the analyzer can interpret.

  • We designed a static analyzer for dynamic languages performing a recursive call of the interpreter on the (over-approximated) code that eval may execute.

The Problem: Improve Precision Analysis by Abstracting Code.

This analysis provides a first step towards the analysis of dynamic languages but still has some important precision loss [tops20]. In particular, there are particular forms of FA (which occur when the string is dynamically generated by loops) avoiding the possibility of generating a control flow graph () able to approximate the code executed by an eval. For instance, when the FA accepts a language such as , the analysis in [tops20] cannot extract, from the FA, the approximating the eval argument. In order to better explain the problem, consider the code in Fig. 1, where the value of i is statically unknown. In Fig. 1, we draw the automaton representing the abstract value of str before the eval execution. The problem is that has a cycle not involving a whole statement [tops20]. This situation makes the analyzer unable to build a over-approximating the code potentially executed since, intuitively, such a should be infinite. Indeed, only an infinite could capture all the possible assignments described by the FA, namely all the assignments of any possible number formed only by to the variable (i.e., x=5;,x=55;,x=555;).

str = "x=5";
while (i < 3) {
 str = str + "5";
 i = i + 1;
}
str = str + ";"; eval(str);

scale=0.8

x

=

5

5

;

Figure 1: s.t. , where means repeated times.

In order to make it possible to overcome this limitation, at least for a set of potential eval patterns, we propose to define a form of abstract able to finitely represent a potential infinite set of s, e.g., we look for a representing x=5.
Unfortunately, things are not so easy as it may seem, since this abstract code representation has to be built in such a way that the analyzer may still be able to interpret it.

Contribution.

The main contributions for tackling the problem above are:

  • We first define the notion of abstract , based on the idea of making it possible to still perform a given analysis. The idea is to leave the control structure unchanged while approximating the edge labels (the statements to execute) to sets of labels, i.e., those sharing a fixed abstract property.

  • We show how completeness of code abstraction w.r.t. the semantic observation models the possibility, for the static analyzer, of interpreting also the abstract code, and we show how we can make any code abstraction complete.

  • We provide a systematic approach, based on the one proposed in [tops20], allowing us to analyze also the eval patterns described above, for which, instead, the analysis in [tops20] loses precision.

2 The Core Language: Imp

The language is quite standard (see Fig. 2222We use to denote the semantic value corresponding to the syntactic symbol n.), and each statement is annotated with a label (not part of the syntax) corresponding to the statement program point333 We suppose that there exists a function that, taken a well-written program, can label it with a fresh label for each program point..

Figure 2: Syntax of Imp

In order to analyze a program , we need to model it by building a corresponding control flow graph [compiler-design] ( for short), which embeds the control structure in the graph structure and leaves in the edges (or equivalently on the nodes) only the access to states, i.e., manipulation of the states (assignments) and guards. The approach we use is quite standard, and we follow [compiler-design] for the construction of the control flow graph. For technical details see [tops20], here we show the construction on the example in Fig. 3, where denotes the node corresponding to the program point .

 x := 0;
 while (x<5)
  {x := x + 1};
 x:=7

(a)
(b)
Figure 3: Example of : (a) Fragment of code and (b) corresponding .

Note that, by construction [tops20], the language of the edge labels is an intermediate language slightly different from the Imp grammar. Edge labels correspond to a primitive statement (i.e., an assignment or eval) or a boolean guard, namely they form the language generated by the grammar .

Concrete Semantics.

The concrete semantics of our language Imp is intuitive and it is fully reported in [arceri2018]. Since our aim is to analyze Imp programs by analyzing their s, we focus here only on the interpretation of ’s labels [compiler-design]. In particular, we have to specify the semantics associated with each possible edge of the . In other words, we have to formalize how each statement transforms a current state, which is represented as a store, namely as an association between identifiers and values. It is well known that static program analysis works by computing (abstract) collecting semantics, namely, for each program point and for each variable x, it computes the set of values that the variable x can have in any computation at the program point . Hence, we define (collecting) memories , associating with each variable a set of values. The basic values of Imp are integers, booleans and strings, hence we define the set of memories as , ranged over the meta-variable , where . Let us denote by this domain of collections of values . The update of memory for a variable x with set of values is denoted . The partial order between memories is defined as . Finally, lub and glb of memories are computed point-wise, i.e., and .
The collecting (input/output) semantics of statements is defined as the function . We denote by the collecting semantics of expressions, defined as additive lift444Let be a generic function, by additive lift we mean its extension to sets of elements, i.e., we define . If , then its lift to sets of memories is to sets of memories of the standard expression semantics. We abuse notation by denoting as also its additive lift to sets of statements.

where is the intersection in the set of Imp programs. By computing the traces of application of this transfer function, starting from any possible input memory, we precisely compute the maximal trace semantics [mine2013].

Static Analysis on : Semantic Abstraction.

It is well known that when we perform static analysis on a , we interpret, on the corresponding abstract domain, all the edges, and more specifically all the labels (in ) [compiler-design]. This is also a quite standard approach, but we recall it here for fixing the notation used. We suppose to abstract values on the coalesced sum [arceri2018] of the abstract domain for integers, of the concrete domain for booleans and of the (deterministic) finite state automata abstract domain for strings [arceri2018]555A string static analyzer using finite state automata abstract domain has been developed and it is available in [arceri2018].. Let us consider an abstraction 666For the sake of simplicity here we abuse notation by considering a unique which is indeed the coalesced sum of three abstractions, one for integers, one for booleans and one for strings. of the values manipulated by our language, we denote by the set of (collecting) memories, where sets of values are abstracted by , ranged over . In the following, we abuse notation by applying to memories in , simply by defining as 777For the sake of simplicity of presentation and implementation, we have considered here non-relational abstractions of data, anyway we believe that it is possible to easy extend our work to relational abstractions.. In this way, we can see abstract memories as sets of concrete memories, and therefore as particular collecting memories, i.e., . Finally, we can define the abstract edge effect [compiler-design] telling us how to abstractly interpret each edge of the :

where .The semantics of a path in the is the composition of the interpretation of each edge, and the interpretation of an edge is the interpretation, given above, of its label [compiler-design].

This is clearly, what happens when the is not abstracted, namely when the edge labels are single statements. Finally, since we deal with potential abstract , we have to say how we execute them, potentially on an abstract semantics. The idea is simple, since we move from executing single statements to executing sets of statements, we simply take as execution of the abstract the additive lift of the single statements executions. Since the semantics is always additive888A function is said to be additive if it commutes with least upper bound., in order to guarantee that everything works, also the semantic abstraction must be additive. Hence, in the following of the paper we always require to be additive.

3 Semantic-driven Code Abstraction

In this section, we study how we can model a syntactic abstraction of the and which is its relation with the semantic abstraction, i.e., the code analysis.

Modeling Code Abstraction.

Following the standard approach for abstracting objects, we should abstract each in a set of s sharing an invariant property, i.e., an equivalence class of s. In particular, since we aim at abstracting code () without changing the analysis performed on the code, we choose to abstract by abstracting edge labels, and by leaving unchanged the control structure of the . In other words, an abstract , denoted , is a pair , where we leave the nodes unchanged, while the edge labels are abstracted to sets of labels. Formally, , where is the label language.

Given , is the built from a in terms of , where .

Figure 4: abstracted by signs.

As an example, consider the in Fig. 3, in Fig. 4 we have the where numerical expressions are abstracted by 999We use to denote the semantic value corresponding to the syntactic symbol n. (where is the well-known sign abstraction such as ). For instance, x:=x+1 is abstracted in where , being .

Abstracting Code vs Abstracting Semantics.

As previously noted, we aim at characterizing code abstractions, for dynamically generated code, for which the given analysis works precisely. Formally, let us consider the following equation:

(1)

If this equality does not hold it means that the abstract semantic interpretation merges predicates distinguished by . Namely, when the program is observed by means of its (abstract) semantics the actual abstraction of predicates is not precisely , but it is affected in some way by . By changing the point of view, we have that, in this case, the analysis cannot precisely interpret the abstract code, since abstracts the code by distinguishing information that cannot distinguish.
As an example, consider the sign domain above, when the equation does not hold since the concrete semantics of this set does not take any positive value for x. While, if , then Eq. 1 holds since its concrete semantics is precisely the set of non-negative values. It is worth noting that Eq. 1 is a forward completeness [GQ01] of the code abstraction w.r.t. the semantic interpretation, meaning that the semantic abstraction does not add imprecision to the code one.
In order to investigate the relation existing between the code abstraction and the semantic abstraction , we observe that, whenever we have a semantic abstraction , we have a natural code abstraction induced by . Namely, by only observing (abstract) information about the computation, we cannot distinguish statements with the same (abstract) semantics, independently from what any possible code abstraction does. For instance, if we analyze parity of program variables, we are unable to distinguish x:=2 from x:=4, independently from how a potential code abstraction is defined on x:=2. The first step consists in defining a code abstraction for expressions in terms of semantic one. Consider , we define inductively on the expressions structure

At this point, we can characterize the labels abstraction , as the additive lift of the function

where is treated as the implicit representation of all the statements that it can execute, namely it represents the (potentially infinite) set .

The following result is immediate by construction.

Proposition 3.1

Given , then and it is additive.

Finally, in order to show that this code abstraction can be used to force satisfiability of Eq. 1, we have first to characterize the meaning of interpreting an edge label abstracted by :

Then we have the following results

Lemma 3.2

Given additive, then (trivially implying ) and .

Proof.  Let us prove first the property for expressions by induction on the syntactic structure of e.

  • : , while (where );

  • : , while (since );

  • : Suppose op any arithmetic or boolean operator.
    by inductive hypothesis. But this is precisely since op is computed on the semantics as additive lift to sets.

  • Analogously, we can prove all the other cases.

Now, let us prove the fact for single edge labels, again by induction on the syntactic structure. Note that, being additive then also is additive, being also the concrete semantics additive on sets of statements.

Finally, for each set of labels , we have that , since all the involved functions are additive by definition or by construction.

Then we have that:

Theorem 3.3

Let additive, and . Then satisfies Eq. 1.

Proof.  It is worth noting that, we trivially have by abstraction that . Let us prove the other implication:


This result tells us that by taking a code abstraction more abstract than (or equal to) , we guarantee that the abstract interpretation can be performed on the abstracted program (Eq. 1). We have so far proved that it is always possible to force Eq. 1, in order to make it possible to continue the analysis (observing ) also on the abstracted code. In the following we show how this framework can be integrated with the existing analysis of dynamic code [tops20] in order to improve its precision.

4 An Improved Dynamic Code Analysis

In this section we show how the constructive code abstraction characterization, provided in the previous section, can be used for representing the code approximation which soundly captures the potential code executed by a string-to-code statement. As we will show, without abstracting code, we cannot capture situations where the collecting semantics on strings generates sets of statements that cannot be represented by using the concrete syntax. Nevertheless, we must also observe that the analyzer cannot change dynamically with the generated code, hence the abstraction must be driven by the semantic property analyzed. This means that, without using the proposed framework, the analysis would surely be less precise in those situations where code abstraction becomes a necessity.

Let us summarize how we propose to exploit the framework:

  • Consider a fixed semantic abstraction and a corresponding static analyzer, designed in such a way that it can interpret also code abstracted by .

  • Analyze the program, and when an eval is met, extract the language of its argument. If the language is infinite (under specific conditions that we will discuss) build the abstract approximating it and extract the corresponding code abstraction . In general, this code abstraction is not more abstract than (the code abstraction already embedded in the static analyzer, depending only on );

  • Build in order to make also the generated code (approximated by the generated abstract ) analyzable by the static analysis for .

Analyzing Dynamic Code.

Let be a static analysis performing in particular on strings, where denotes strings over a finite alphabet . Note that, our analyzer has to work on any (abstract) that can be dynamically generated, hence it has to be designed with this purpose in mind. In particular, as we will show, we will generate only abstract s with a code abstraction complete w.r.t. . This means, by construction, that must be more abstract than , which means that each set of elements in corresponds to a subset of the elements (abstract predicates) of . Hence, in order to guarantee to interpret predicates in any complete, it is sufficient to design the analyzer soundly interpreting any abstract predicate in . For instance, is the abstraction containing all the predicates, involving integers, of the form x:=S, x<S, etc, with , e.g., an abstract predicate is , and the analyzer for should be able to interpret also such abstract predicates.
Let x be the input string parameter of an eval statement, we denote by the abstract value for x computed by the analysis on . For example, suppose that the collection of values for the string x before the eval is . By defining as the -bounded string set abstract domain [amadini18], with , , while by using the prefix abstract domain  [costantini15], . When the abstracted string and the abstraction is clear from the context, we simply denote this set by and we assume (for the sake of simplicity) that any string in is an executable language statement101010Note that, this assumption corresponds to a decidable condition, hence it is possible to check it and to implement ad hoc solutions when it does not hold.. In the following, we abuse notation by denoting also the automaton recognizing the language.
Consider for example, the program reported in Fig. 4(a), a program building and manipulating the string str at run-time, which is, afterwards, interpreted as executable code, being the input parameter of the string-to-code statement eval. Since the value of N is unknown at compile-time, we cannot predict the precise number of iterations of the while-loop. In this case, a suitable string abstract analysis would approximate the value of str, before the eval execution, to an abstract value corresponding to an over-approximation of the possible values for str, which may be also, due to abstraction, an infinite set of strings, and therefore an infinite set of possible programs.

   str := "x:=5"; i := 0;
   while (i < N) {
    str := str + "5";
    i:=i+1;
   }
   str := "if(x<5){"  + str
     + "}else{x:=1};";
   eval(str)

(a)
(b)
Figure 5: (a) Dynamically-generating code sample. (b) associated to str labeled with abstract expressions.

For instance, in the example, if we abstract strings into the regular expression abstract domain [choi2006] (or equivalently into the finite state automata abstract domain [arceri2018]), the value of str after the while loop will be the abstract value corresponding to an infinite set of programs, i.e., x:=5;, x:=55, x:=555;…. In this case, the common practice for analyzing eval is simply to give up with the analysis, for example by halting the analysis throwing an exception [jensen2012] or forbidding its usage [jsai].

Let be the abstract domain for all the possible values (integers, strings and booleans) [tops20]. Note that, contains, for integers, predicates like the ones in the abstract in Fig. 4.
The analysis at point , due to widening111111Widening is a fix-point accelerator used in infinite domains with infinite ascending chains, namely where the semantic fix-point computation may diverge. In this case we use a widening on automata defined in [choi2006] applied in the analysis of the while loop [arceri2018], abstracts the value of str in the infinite language (namely x is assigned to any value represented by a finite sequence of ). Hence, at point the analysis abstracts str to the strings set meaning that, the true-branch of the string that may be transformed by eval may be either x:=5, or x:=55, or . The automaton corresponding to the abstract value of str is reported in Fig. 6, and it denotes an infinite language, i.e., an infinite set of possible statements. Unfortunately, this is a problem for the analysis provided in [tops20], where the language containing all the possible strings would be returned, losing any precision.

Figure 6: Finite state automaton corresponding to the abstract value of str.

Generating the Code: From Automata to s.

At this point, we have the (potentially infinite) language of the eval argument (and hence an automaton ), and the goal is to generate a modeling an over-approximation of the executable code contained in the language of the automaton . The idea is to generate a from a language of strings, i.e., from an automaton, by performing a parsing on the paths of the automaton. Indeed, we have defined and implemented an algorithm121212In the following, we only discuss the main parts of the algorithm for space limitations., reported in Alg. 1, performing an abstract parser on automata that, given an automaton , returns the that over-approximates, for each (executable), the concrete execution of eval.

The idea of Alg. 1 is to perform a depth-first search on the automaton and, when a language statement is recognized, to generate an edge in the . This phase is handled by lines 3-13 of Alg. 1, building the set of nodes Nodes and the set of edges Edges of the resulting . The set contains the states of the finite state automaton for which we still have to generate edges in the and it is initialized, at line 2, with the initial state . At this point, Alg. 1 looks for language statements readable from any path of the input automaton starting from a state , taken from , by means of the module (line 5). In particular, returns a set of triples , where each returned triple means that from to a language statement c has been recognized.

Input:
Output: over-approximating executable strings of
1 ;
2 ; ; ; ; while  do
3       select and remove from ;
4       ;
5       foreach  do
6             ;
7             ;
8             ;
9             ;
10             ;
11            
12       end foreach
13      
14 end while
15return ;
Algorithm 1

The set returned by corresponds to the set of statements of readable from the state , hence they are added to Edges, substituting the reached states with the corresponding labels by means of the function (lines 7-8). At this point, we need to look for the statements that can be read from , hence, is added to in order to be eventually processed at the next iterations of the while loop at lines 3-13. When there are no more states of to be processed, namely when is empty, the is returned (line 14), with entry label and exit labels the ones associated with the states in .

Problems arise when the automaton contains cycles (namely, when the automaton denotes an infinite language). In this case, Alg. 1 first transforms, at line 1, the input automaton, over the alphabet , in an automaton without cycles, over the alphabet , by means of the module . Given an input automaton , we retrieve the cycles of using the well-known Tarjan’s algorithm [tarjan] for identifying cycles. Then, for each detected cycle of , we check whether the string read by the cycle is a whole statement or not. In the first case, we substitute the cycle of the string in the automaton, i.e., , with the automaton reading the string corresponding to the statement while(true)\{ \} over the alphabet . Otherwise, if the cycle does not read a whole statement, the idea is to collapse the cycle in a single transition, labeled with the regular expression corresponding to what is read in the cycle, i.e., denoting a set of string on (). Hence the resulting automaton is on the alphabet . In Fig. 7 we report an example of application of algorithm.

(a)
(b)
Figure 7: (a) Finite state automaton with cycle. (b) Result of .

As example note that, by applying Alg. 1 to the automaton for in Fig. 6, we generate the , depicted in Fig. 4(b). It is worth noting that the obtained so far may contain abstract expressions on edges, hence edges may represent an infinite collection of statements. At this point, we need to approximate these edges for making it possible to analyze the .

Making the Code Analyzable: Abstracting the .

Let us recall that we have to perform the analysis also on the resulting code, in order to continue the static analysis. Hence, as observed before, we have to combine the code abstraction corresponding to the generated (abstract) with the code abstraction induced by the semantic abstraction , i.e., , which models, as code abstraction, the analysis.
First of all, we have to formally characterize the abstraction induced by the construction of the given above, namely we characterize how the construction abstracts together different predicates. Let us build a code abstraction starting from the built in Alg. 1: In particular, let be the set of collections of predicates between any pair of states in the , we define

(2)

Note that, this abstraction, being characterized starting from the is defined only in terms of a finite subset of , namely on the predicates in the given , i.e., .
In the example, , hence we have that , being already a partition. In Fig. 7(a) this abstraction is partially depicted.

closure

(a)

(b)
Figure 8: (a) Code abstraction w.r.t. the reported in Fig. 4(b), (b) Code abstraction

Finally, we need to satisfy Eq. 1 (completeness) between the code abstraction , built so far, and the static analysis, modeled as a semantic abstraction , performing (introduced above) on strings. Clearly we have no guarantee that satisfies Eq. 1, hence, we have to (further) abstract the in order to guarantee completeness w.r.t. the performed static analysis, namely in order to make it possible to perform the given static analysis on the code in the generated . As observed in the previous section, in order to force completeness, we have to combine the desired abstraction on predicates, with the code abstraction . Formally, in order to allow this operation, since is defined on , we have to restrict also on (denoted ). This abstraction is obtained by intersecting the meaning of each one of its elements (i.e., its concretization) with the set of predicates in the . In the running example, we have to compute , which is the code abstraction induced by the on the predicates in . For instance, all the predicates in and the predicate x:=1 cannot be distinguished when integers are abstracted by observing only their signs, hence the resulting abstraction is depicted in Fig. 7(b), where the abstract predicate x:= corresponds, in the concrete, to the set of predicates , while x< and (x<) correspond, respectively, to and to (all the other elements corresponds to ).

Finally, we aim at building a code abstraction which can be interpreted by the initial abstract interpreter , namely, that satisfies Eq. 1. By Th. 3.3 such an abstraction is .

Corollary 4.1

Let be additive. Then the code abstraction is complete w.r.t. the semantic abstraction , i.e., it satisfies Eq. 1.

Figure 9: Abstract generated by abstracting by means of

Hence, in our example, the code abstraction satisfies Eq. 1. In particular, we can observe that . Finally, we have to abstract the , previously generated, by applying to each edge of the . In our example, the so far resulting abstract is reported in Fig. 9, where the abstract generated by abstracting by means of is depicted.

A Taste of Implementation.

A static analyzer based on finite state automata is available at [arceri2018]. Moreover, we have implemented Alg. 1 in order to validate our approach131313Available at
https://github.com/SPY-Lab/java-fsm-library/tree/abstract-parser
. The implementation of a static analysis of abstract s is in an early stage development and it is left as future work. Nevertheless, it is able to parse executable automata and to abstract them into abstract s, as we have previously described. In order to make these abstract s effectively analyzable, we are currently extending the static analyzer, and the underlying abstract interpreter, to parse, and thus analyze, also abstract predicates.

5 Conclusion

We conclude by highlighting the value, in the context of static analysis, of the framework presented in this paper. What we propose here is a precision improvement of [tops20], an analysis that attacks an extremely hard problem in static program analysis by abstract interpretation, since the standard static analysis assumption (i.e., the program code we want to analyze must be static) is broken when we have to deal with string-to-code statements. In [tops20], we have shown that even without this assumption, it is still possible for static analysis to semantically analyze dynamically mutating code in a meaningful and sound way. It has been the very first proof of concept for a sound static analysis for self-modifying code based on bounded reflection for a high-level script-like programming language. In this paper, we improve this approach by characterizing code transformations that do not lose precision w.r.t. a fixed abstract semantics/analysis of the code. The idea we develop consists of embedding the property to analyze in the code transformation in order to make the property analysis work also on the transformed code (as it happens in dynamic code analysis). Hence, the main contribution is to make even more precise the first truly dynamic static analyzer, which has the feature to keep the analysis going on, even when code is dynamically built.
Clearly, the framework improved here is still at an early stage and surely there is much work to do, not only for the presented algorithm and the implementation, which has clearly to be further developed but also for making the approach more precise and general. As far as the algorithm is concerned we have not explicitly provided soundness and completeness proofs or discussions. In particular, completeness holds under decidable hypotheses (the input automaton has to recognize only executable strings), here only briefly treated, and therefore these aspects need further formal development.
On the other hand, a direction for improving precision can be that of integrating the proposed static analysis in a hybrid solution, by using, for instance, taint analysis (or other dynamic analyses) for driving when to apply static analysis, or considering more advanced forms of automata-based domains for abstracting strings, such as the one reported in [negrini21]. Finally, we have considered only eval as a string-to-code statement, while there are other ways, for dynamically executing code built out of strings, that should be investigated. However, we strongly believe that the same approach used for eval, could be easily applied to any other string-to-code statement. Moreover, we believe that this framework could be instantiated in order to deal with other forms of code transformations, maybe by considering more general abstractions.

From a more theoretical point of view, interesting future works consist of exploiting the proposed approach for analyzing code in order to investigate, on dynamic languages, several application contexts where static analysis by abstract interpretations has been exploited. First of all, we could trace (abstract) flows of information during execution [GiacobazziM18, MastroeniZ17, Mastroeni13, MastroeniN10, GiacobazziM10, GiacobazziM10bis, GM04CSL] in order to tackle different security issues, such as the detection of (abstract) code injections [BuroM18, MB10] or the formal characterization of dynamic code obfuscators and of their potency [JGM12, GiacobazziM12]. Moreover, the ability to analyze malware code could be exploited for extracting code properties which could be used for analyzing code similarity [PredaGLM15bis]

, a technique useful for instance to identify or at least classify malicious code.

References