Binary code presents several complex aspects that cannot be encountred in source code. One of these aspects is self-modifying code, i.e., code that can modify its own instructions during the execution of the program. Self-modifying code makes reverse code engineering harder. Thus, it is extensively used to protect software intellectual property. It is also heavily used by malware writers in order to make their malwares hard to analyse and detect by static analysers and anti-viruses. Thus, it is crucial to be able to analyse self-modifying code.
There are several kinds of self-modifying code. In this work, we consider self-modifying code caused by self-modifying instructions. These kind of instructions treat code as data. This allows them to read and write into code, leading to self-modifying instructions. These self-modifying instructions are usually mov instructions, since mov allows to access memory and read and write into it.
Let us consider the example shown in Figure1. For simplicity, the addresses’ length is assumed to be 1 byte. In the right box, we give, respectively, the binary code, the addresses of the different instructions, and the corresponding assembly code, obtained by translating syntactically the binary code at each address. For example, 0c is the binary code of the jump jmp. Thus, 0c 02 is translated to jmp 0x2 (jump to address 0x2). The second line is translated to push 0x9, since ff is the binary code of the instruction push. The third instruction mov 0x2 0xc will replace the first byte at address 0x2 by 0xc. Thus, at address 0x2, ff 09 is replaced by 0c 09. This means the instruction push 0x9 is replaced by the jump instruction jmp 0x9 (jump to address 0x9), etc. Therefore, this code is self-modifying: the mov instruction was able to modify the instructions of the program via its ability to read and write the memory. If we study this code without looking at the semantics of the self-modifying instructions, we will extract from it the Control Flow Graph CFG a that is in the left of the figure, and we will reach the conclusion that the call to the API function CopyFileA at address 0x9 cannot be made. However, you can see that the correct CFG is the one on the right hand side CFG b, where the call to the API function CopyFileA at address 0x9 can be reached. Thus, it is very important to be able to take into account the semantics of the self-modifying instructions in binary code.
In this paper, we consider the LTL model-checking problem of self-modifying code. To this aim, we use Self-Modifying Pushdown Systems (SM-PDSs)  to model self-modifying code. Indeed, SM-PDSs were shown in  to be an adequate model for self-modifying code since they allow to mimic the program’s stack while taking into account the self-modifying semantics of the transitions. This is very important for binary code analysis and malware detection, since malwares are based on calls to API functions of the operating system. Thus, antiviruses check the API calls to determine whether a program is malicious or not. Therefore, to evade from these antiviruses, malware writers try to hide the API calls they make by replacing calls by push and jump instructions. Thus, to be able to analyse such malwares, it is crucial to be able to analyse the program’s stack. Hence the need to a model like pushdown systems and self-modifying pushdown systems for this purpose, since they allow to mimic the program’s stack.
Intuitively, a SM-PDS is a pushdown system (PDS) with self-modifying rules, i.e., with rules that allow to modify the current set of transitions during execution. This model was introduced in  in order to represent self-modifying code. In , the authors have proposed algrithms to compute finite automata that accept the forward and backward reachability sets of SM-PDSs. In this work, we tackle the problem of LTL model-checking of SM-PDSs. Since SM-PDSs are equivalent to PDSs , one possible approach for LTL model checking of SM-PDS is to translate the SM-PDS to a standard PDS and then run the LTL model checking algorithm on the equivalent PDS [2, 10]. But translation from a SM-PDS to a standard PDS is exponential. Thus, performing the LTL model checking on the equivalent PDS is not efficient.
To overcome this limitation, we propose a direct LTL model checking algorithm for SM-PDSs. Our algorithm is based on reducing the LTL model checking problem to the emptiness problem of Self Modifying Büchi Pushdown Systems (SM-BPDS). Intuitively, we obtain this SM-BPDS by taking the product of the SM-PDS with a Büchi automaton accepting an LTL formula . Then, we solve the emptiness problem of an SM-BPDS by computing its repeating heads. This computation is based on computing labelled configurations by applying a saturation procedure on labelled finite automata.
We implemented our algorithm in a tool. Our experiments show that our direct techniques are much more efficient than translating the SM-PDS to an equivalent PDS and then applying the standard LTL model checking for PDSs [2, 10]. Moreover, we successfully applied our tool to the analysis of 892 self-modifying malwares. Our tool was also able to detect several self-modifying malwares that well-known antiviruses like BitDefender, Kinsoft, Avira, eScan, Kaspersky, Qihoo-360, Baidu, Avast, and Symantec were not able to detect.
Related Work. Model checking and static analysis approaches have been widely used to analyze binary programs, for instance, in [9, 5, 23, 11, 3]. Temporal Logics were chosen to describe malicious behaviors in [20, 11, 3, 4, 8]. However, these works cannot deal with self-modifying code.
is a malware detector based on PDSs and machine learning. However, POMMADE and STAMAD cannot deal with self-modifying code.
Cai et al.  use local reasoning and separation logic to describe self-modifying code and treat program code uniformly as regular data structure. However,  requires programs to be manually annotated with invariants. In , the authors propose a formal semantics for self-modifying codes, and use that to represent self-unpacking code. This work only deals with packing and unpacking behaviours. Bonfante et al.  provide an operational semantics for self-modifying programs and show that they can be constructively rewritten to a non-modifying program. However, all these specifications [6, 7, 26] are too abstract to be used in practice.
In , the authors propose a new representation of self-modifying code named State Enhanced-Control Flow Graph (SE-CFG). SE-CFG extends standard control flow graphs with a new data structure, keeping track of the possible states programs can reach, and with edges that can be conditional on the state of the target memory location. It is not easy to analyse a binary program only using its SE-CFG, especially that this representation does not allow to take into account the stack of the program.
 propose abstract interpretation techniques to compute an over-approximation of the set of reachable states of a self-modifying program, where for each control point of the program, an over-approximation of the memory state at this control point is provided.  combine static and dynamic analysis techniques to analyse self-modifying programs. Unlike our approach, these techniques [24, 18] cannot handle the program’s stack.
Outline. The rest of the paper is structured as follows: Section 2 recalls the definition of Self Modifying pushdown systems. LTL model checking and SM-BPDSs are defined in Section 3. Section 4 solves the emptiness problem of SM-BPDS. Finally, the experiments are reported in Section 5.
2 Self Modifying Pushdown Systems
We recall in this section the definition of Self-modifying Pushdown Systems .
A Self-modifying Pushdown System (SM-PDS) is a tuple , where is a finite set of control points, is a finite set of stack symbols, is a finite set of transition rules, and is a finite set of modifying transition rules. If , we also write . If , we also write . A Pushdown System (PDS) is a SM-PDS where .
Intuitively, a Self-modifying Pushdown System is a Pushdown System that can dynamically modify its set of rules during the execution time: rules are standard PDS transition rules, while rules modify the current set of transition rules: expresses that if the SM-PDS is in control point and has on top of its stack, then it can move to control point , pop and push onto the stack, while expresses that when the PDS is in control point , then it can move to control point , remove the rule from its current set of transition rules, and add the rule .
Formally, a configuration of a SM-PDS is a tuple where is the control point, is the stack content, and is the current set of transition rules of the SM-PDS. is called the current phase of the SM-PDS. When the SM-PDS is a PDS, i.e., when , a configuration is a tuple , since there is no changing rule, so there is only one possible phase. In this case, we can also write . Let be the set of configurations of a SM-PDS. A SM-PDS defines a transition relation between configurations as follows: Let be a configuration, and let be a rule in , then:
if is of the form , such that , then , where . In other words, the transition rule updates the current set of transition rules by removing from it and adding to it.
if is of the form , then . In other words, the transition rule moves the control point from to , pops from the stack and pushes onto the stack. This transition keeps the current set of transition rules unchanged.
Let be the transitive, reflexive closure of and be its transitive closure. An execution (a run) of is a sequence of configurations s.t. for every . Given a configuration , the set of immediate predecessors (resp. successors) of is (resp. ). These notations can be generalized straightforwardly to sets of configurations. Let (resp. ) denote the reflexive-transitive closure of (resp. ). We remove the subscript when it is clear from the context.
We suppose w.l.o.g. that rules in are of the form such that , and that the self-modifying rules in are such that . Note that this is not a restriction, since for a given SM-PDS, one can compute an equivalent SM-PDS that satisfies these conditions  .
2.2 SM-PDS vs. PDS
Let be a SM-PDS. It was shown in  that:
can be described by an equivalent pushdown system (PDS). Indeed, since the number of phases is finite, we can encode phases in the control point of the PDS. However, this translation is not efficient since the number of control points of the equivalent PDS is .
can also be described by an equivalent Symbolic pushdown system , where each SM-PDS rule is represented by a single, symbolic transition, where the different values of the phases are encoded in a symbolic way using relations between phases. This translation is not efficient neither since the size of the relations used in the symbolic transitions is .
2.3 From Self-modifying Code to SM-PDS
It is shown in  how to describe a self-modifying binary code using a SM-PDS. The basic idea is that the control locations of the SM-PDS store the control points of the binary program and the stack mimics the program’s stack. Our translation relies on the disassembler Jakstab  to disassemble binary code, construct the control flow graph (CFG), determine indirect jumps, compute the possible values of used variables, registers and the memory locations at each control point of program. After getting the control flow graph whose edges are equipped with disassembled instructions, we translate the CFG into a SM-PDS as described in . The non self-modifying instructions of the program define the rules of the SM-PDS (which are standard PDS rules), and can be obtained following the translation of  that models non self-modifying instructions of the program by a PDS. Self-modifying instructions are represented using self-modifying transitions of the SM-PDS. For more details, we refer the reader to .
3 LTL Model-Checking of SM-PDSs
3.1 The linear-time temporal logic LTL
Let be a finite set of atomic propositions. LTL formulas are defined as follows (where ):
Formulae are interpreted on infinite words over . Let be an infinite word over . We write for the suffix of starting at . We denote to express that satisfies a formula :
The temporal operators G (globally) and F (eventually) are defined as follows: and . Let be the set of infinite words that satisfy an LTL formula . It is well known that can be accepted by Büchi automata:
A Büchi automaton is a quintuple where is a finite set of states, is a finite input alphabet, is a set of transitions, is the initial state and is the set of accepting states. A run of on a word is a sequence of states s.t. . An infinite word is accepted by if has a run on that starts at and visits accepting states from infinitely often.
 Given an LTL formula , one can effectively construct a Büchi automaton which accepts .
3.2 Self Modifying Büchi Pushdown Systems
A Self Modifying Büchi Pushdown Systems (SM-BPDS) is a tuple where is a set of control locations, is a set of accepting control locations, is a finite set of transition rules, and is a finite set of modifying transition rules in the form where .
Let be the transition relation between configurations as follows: Let , and , then
If and , then .
If , and , then where .
A run of is a sequence of configurations s.t. for every . is accepting iff it infinitely often visits configurations having control locations in .
Let and be two configurations of the SM-BPDS . The relation is defined as follows: iff there exists a configuration , s.t. . We remove the subscript when it is clear from the context. We define as follows: iff there exists a sequence of configurations s.t. and .
A head of SM-BPDS is a tuple where , and . A head is repeating if there exists such that . The set of repeating heads of SM-BPDS is called .
We assume w.l.o.g. that for every rule in of the form ,
3.3 From LTL Model-Checking of SM-PDSs to the emptiness problem of SM-BPDSs
Let be a self modifying pushdown system. Let be a set of atomic propositions. Let be a labelling function. Let be an execution of the SM-PDS . Let be an LTL formula over the set of atomic propositions . We say that
Let be a configuration of . We say that iff has a path starting at such that .
Our goal in this paper is to perform LTL model-checking for self-modifying pushdown systems. Since SM-PDSs can be translated to standard (symbolic) pushdown systems, one way to solve this LTL model-checking problem is to compute the (symbolic) pushdown system that is equivalent to the SM-PDS (see section 2.2), and then apply the standard LTL model-checking algorithms on standard PDSs . However, this approach is not efficient (as will be witnessed later in the experiments). Thus, we need a direct approach that performs LTL model-checking on the SM-PDS, without translating it to an equivalent PDS. Let be a Büchi automaton that accepts . We compute the SM-BPDS by performing a kind of product between the SM-PDS and the Büchi automaton as follows:
if and , then . Let be the set of rules of obtained from the rule , i.e., rules of of the form .
if a rule and , then where . Let be the set of rules of obtained from the rule , i.e., rules of of the form .
We can show that:
Let be a configuration of the SM-PDS . iff has an accepting run from where is the set of rules of obtained from the rules of as described above.
Thus, LTL model-checking for SM-PDSs can be reduced to checking whether a SM-BPDS has an accepting run. The rest of the paper is devoted to this problem.
4 The Emptiness Problem of SM-BPDSs
From now on, we fix a SM-BPDS . We can show that has an accepting run starting from a configuration if and only if from , it can reach a configuration with a repeating head:
A SM-BPDS has an accepting run starting from a configuration if and only if there exists a repeating head such that for some .
Proof: : Let be an accepting run starting at configuration where and . We construct an increasing sequence of indices with a property that once any of the configurations is reached, the rest of the run never changes the bottom elements of the stack anymore. This property can be written as follows:
Because has only finitely many different heads, there must be a head which occurs infinitely often as a head in the sequence . Moreover, as some becomes a control location infinitely often, we can find a subsequence of indices with the following property: for every there exist
Because is never looked at or changed in this path, we can have . This proves this direction of the proposition.
: Because is a repeating head, we can construct the following run for some and :
Since occurs infinitely often, the run is accepting.
Thus, since there exists an efficient algorithm to compute the of SM-PDSs , the emptiness problem of a SM-BPDS can be reduced to computing its repeating heads.
4.1 The Head Reachability Graph
Our goal is to compute the set of repeating heads , i.e., the set of heads such that there exists , . I.e., s.t. this path goes through an accepting location in . To this aim, we will compute a finite graph whose nodes are the heads of of the form , where , and ; and whose edges encode the reachability relation between these heads. More precisely, given two heads and , is an edge of the graph means that the configuration can reach a configuration having as head, i.e., it means that there exists s.t. . Moreover, we need to keep the information whether this path visits an accepting location in or not. This information is recorded in the label of the edge : means that the path visits an accepting location in , i.e. that . Otherwise, . Therefore, if the graph contains a loop from a head to itself such that this loop goes through an edge labelled by , then is a repeating head. Thus, computing can be reduced to computing the graph and finding 1-labelled loops in this graph.
More precisely, we define the head reachability graph as follows:
The head reachability graph is a tuple such that is an edge of iff:
there exists a transition , , , and iff ;
there exists a transition and iff ;
there exists a transition , for , , s.t. , and iff or
Let be the head reachability graph. We define as follows: let and be two heads of . We write iff booleans , heads s.t. contains the following path where and .
Let be the reflexive transitive closure of the graph relation , and let be defined as follows: Given two heads and , iff there is in a path between and that goes through a 1-labelled edge, i.e., iff there exist heads and s.t.
We can show that:
Let be a self-modifying Büchi pushdown system, and let be its corresponding head reachability graph. A head of is repeating iff has a loop on the node that goes through a 1-labeled edge.
To prove this theorem, we first need to prove the following lemma:
The relations and have the following properties: For any heads and :
iff for some .
iff for some .
Proof: “”: Assume . We proceed by induction on .
Basis. . In this case, , then we can get
Step. . Then there exist and such that . From the induction hypothesis, there exists such that
Since , we have for , hence .
The property holds.
cannot hold for the case .
Basis. In this case, , then we can get and . The property holds.
Step. . As done in the proof of part (a) of this lemma, there exists s.t. . Then if , either or holds. In the first case i.e. , by the induction hypothesis, we can have , hence, holds
The second case depends on the rule applied to get according to Definition 4.
If this edge corresponds to a transition , then and . Since we can obtain from part and , then . This implies that for some
If this edge corresponds to a transition , then and . Since we can obtain from part and , then . This implies that for some .
If this edge corresponds to a transition , then either or holds. If , then we have . Otherwise, . Since we can obtain from part . Therefore, . This implies that for some .
‘”: Assume . We proceed by induction on .
Basis. . In this case, and , then holds.
Step. . Then there exist and such that . There are 2 cases:
Case There must exist a rule such that and . Let denote the minimal length of the stack on the path from to . Then can be written as where (that means will remain on the stack for the path). Furthermore, there exists such that for some . We have for . By the induction on , we have