1. Introduction
Web applications are being used to implement interfaces of a myriad of services. They are often the first target of attacks, and despite considerable efforts to improve security, there are still many examples of high impact compromises. In the 2017 OWASP Top 10 list, vulnerabilities like SQL injection (SQLI) and cross-site scripting (XSS) continue to raise significant concerns, but other classes are also listed as being commonly exploited (Williams and Wichers, 2017). Millions of websites have been compromised since Oct. 2014 due to vulnerabilities in plugins of Drupal (BBC Technology, 2014) and WordPress (The Hacker News, 2017b; threatpost, 2017), and the data of more than a billion users has been stolen using SQLI attacks against various kinds of services (governmental, financial, education, mail, etc) (The Hacker News, 2017a; HELPNETSECURITY, 2017). In addition, the next wave of XSS attacks has been predicted for the past two years, with an important expected growth of the problem (Sink, 2017; Imperva, 2017).
Many of these vulnerabilities are related to malformed inputs that reach some relevant asset (e.g., the database or the user’s browser) by traveling through a code slice (a series of instructions) of the web application. Therefore, a good practice to enhance security is to pass inputs through sanitization functions that invalidate dangerous metacharacters or/and validation functions that check their content. In addition, programmers commonly use static analysis tools to search automatically for bugs in the source code, facilitating their removal. The development of these tools, however, requires coding explicitly the knowledge on how each vulnerability can be detected (Dahse and Holz, 2014; Fonseca and Vieira, 2014; Jovanovic et al., 2006; Medeiros et al., 2016b), which is a complex task. Moreover, this knowledge might be incomplete or partially wrong, making the tools inaccurate (Dahse and Holz, 2015). For example, if the tools do not understand that a certain function sanitizes inputs, they could raise an alert about a vulnerability that does not exist.
This paper presents a new approach for static analysis that is based on learning to recognize vulnerabilities
. It leverages from artificial intelligence (AI) concepts, more precisely from classification models for sequences of observations that are commonly used in the field of natural language processing (NLP). NLP is a confluence of AI and linguistics, which involves intelligent analysis of written language, i.e., the natural languages. In this sense, NLP is considered a sub-area of AI. It can be viewed as a new form of intelligence in an artificial way that can get insights how humans understand natural languages. NLP tasks, such as parts-of-speech (PoS) tagging or named entity recognition (NER), are typically modelled as sequence classification problems, in which a class (e.g., a given morpho-syntactic category) is assigned to each word in a given sentence, according to estimate given by a structured prediction model that takes word order into consideration. The model’s parameters are normally inferred using supervised machine learning techniques, taking advantage of annotated corpora.
We propose applying a similar approach to web programming languages, i.e., to analyse source code in a similar manner to what is being done with natural language text. Even though, these languages are artificial, they have many characteristics in common with natural languages, such as words, syntactic rules, sentences, and a grammar. NLP usually employs machine learning to extract rules (knowledge) automatically from a corpus
. Then, with this knowledge, other sequences of observations can be processed and classified. NLP has to take into account the
order of the observations, as the meaning of sentences depends on it. Therefore NLP involves forms of classification more sophisticated than approaches based on standard classifiers(e.g., naive Bayes, decision trees, support vector machines), which simply check the presence of certain observations without considering any relation between them.
Our approach for static analysis resorts to machine language techniques that take the order of source code instructions into account – sequence models – to allow accurate detection and identification of the vulnerabilities in the code. Previous applications of machine learning in the context of static analysis neither produced tools that learn to make detection nor employed sequence models. For example, PhpMinerII resorts to machine learning to train standard classifiers, which then verify if certain constructs (associated with flaws) exist in the code. However, it does not provide the exact location of the vulnerabilities (Shar and Tan, 2012a, b). WAP and WAPe use a taint analyser to search for vulnerabilities and a standard classifier to confirm that the found bugs111In software security context, we consider vulnerability as a being a bug or a flaw that can be exploitable. can actually create security problems (Medeiros et al., 2016b). None of these tools considers the order of code elements or the relation among them, leading to bugs being missed (false negatives, FN) and alarms being raised on correct code (false positives, FP).
Our sequence model is a Hidden Markov Model (HMM) (Rabiner, 1989)
. A HMM is a Bayesian network composed of nodes corresponding to the states and edges associated to the probabilities of transitioning between states. States are hidden, i.e., are not observed. Given a sequence of observations, the hidden states (one per observation) are discovered following the model and taking into account the order of the observations. Therefore, the HMM can be used to find the series of states that
best explains the sequence of observations.The paper also presents the hidDEn marKov model diAgNosing vulnerabiliTies
(DEKANT) tool that implements our approach for applications written in PHP. The tool was evaluated experimentally with a diverse set of 23 open source web applications with bugs disclosed in the past. These applications are substantial, with an aggregated size of around 8,000 files and 2.5 million lines of code (LoC). All flaws that we are aware of being previously reported were found by DEKANT. More than one thousand slices were analyzed, 714 were classified as having vulnerabilities and 305 as not. The false positives were in the order of two dozens. In addition, the tool checked 23 plugins of WordPress and found
62 zero-day vulnerabilities. These flaws were reported to the developers, and some of them already confirmed their existence and fixed the plugins. DEKANT was also assessed with several other vulnerability detection tools, and the results give evidence that our approach leads to better accuracy and precision.The main contributions of the paper are: (1) a novel approach for improving the security of web applications by letting static analysis tools learn to discover vulnerabilities through an annotated corpus; (2) an intermediate language representation capturing the relevant features of PHP, and a sequence model that takes into consideration the place where code elements appear in the slices and how they alter the spreading of the input data; (3) a static analysis tool that implements the approach; (4) an experimental evaluation that demonstrates the ability of this tool to find known and zero-day vulnerabilities with a residual number of mistakes.
2. Related Work
Static analysis tools search for vulnerabilities in the applications usually by processing the source code (e.g., (Fonseca and Vieira, 2014; Jovanovic et al., 2006; Shankar et al., 2001; Son and Shmatikov, 2011; Dahse and Holz, 2014; Medeiros et al., 2016b; Backes et al., 2017)). Many of these tools perform taint analysis, tracking user inputs to determine if they reach a sensitive sink (i.e., a function that could be exploited). Pixy (Jovanovic et al., 2006) was one of the first tools to automate this kind of analysis on PHP applications. Later on, RIPS (Dahse and Holz, 2014) extended this technique with the ability to process more advanced constructs of PHP (e.g., objects). phpSAFE (Fonseca and Vieira, 2014) is a recent solution that does taint analysis to look for flaws in CMS plugins (e.g., WordPress plugins). WAP (Medeiros et al., 2016b, c) also does taint analysis, but aims at reducing the number of false positives by resorting to data mining, besides also correcting automatically the located bugs. Other works (Yamaguchi et al., 2014, 2015) detect vulnerabilities by processing source code properties represented as graphs. In this paper, we propose an novel approach which, unlike these works, does not involve programming information about bugs, but instead extracts this knowledge from annotated code samples and thus learns to find the vulnerabilities.
Machine learning has been used in a few works to measure the quality of software by collecting a series of attributes that reveal the presence of software defects (Arisholm et al., 2010; Lessmann et al., 2008). Other approaches resort to machine learning to predict if there are vulnerabilities in a program (Neuhaus et al., 2007; Walden et al., 2009; Perl et al., 2015), which is different from identifying precisely the bugs, something that we do in this paper. To support the predictions they employ various features, such as past vulnerabilities and function calls (Neuhaus et al., 2007), or a combination of code-metric analysis with metadata gathered from application repositories (Perl et al., 2015). In particular, PhpMinerI and PhpMinerII predict the presence of vulnerabilities in PHP programs (Shar and Tan, 2012a, b; Shar et al., 2013). The tools are first trained with a set of annotated slices that end at a sensitive sink (but do not necessarily start at an entry point), and then they are ready to identify slices with errors. WAP and WAPe are different because they use machine learning and data mining to predict if a vulnerability detected by taint analysis is actually a real bug or a false alarm (Medeiros et al., 2016b, c)
. In any case, PhpMiner and WAP tools employ standard classifiers (e.g., Logistic Regression or a Multi-Layer Perceptron) instead of structured prediction models (i.e., a sequence classifier) as we propose here.
There are a few static analysis tools that implement machine learning techniques. Chucky (Yamaguchi et al., 2013) discovers vulnerabilities by identifying missing checks in C language software. VulDeePecker (Li et al., 2018)
resorts to code gadgets to represent parts of C programs and then transforms them into vectors. A neural network system then determines if the target program is vulnerable due to buffer or resource management errors. Russell et al.
(Russell et al., 2018) developed a vulnerability detection tool for C and C++ based on features learning from a dataset and artificial neural network. Scandariato et al. (Scandariato et al., 2014) performs text mining to predict vulnerable software components in Android applications. SuSi (Rasthofer et al., 2014) employs machine learning to classify sources and sinks in the code of Android API.This paper extends our previous work (Medeiros et al., 2016a). Our approach extracts PHP slices, but contrary to the others it translates them into a tokenized language to be processed by a HMM. While tools in the literature collect attributes from a slice and classify them without considering ordering relations among statements, which is simplistic, DEKANT also does classification but takes into account the place in which code elements appear in the slice. Such form of classification assists on a more accurate and precise detection of bugs.
PHP code | slice-isl | variable map | tainted list | slice-isl classification |
1 $u = $_POST[‘username’]; | input var | 1 - u | TL = {u} | input,Taint var_vv_u,Taint |
2 $q = "SELECT pass FROM users WHERE user=’".$u."’"; | var var | 1 u q | TL = {u, q} | var_vv_u,Taint var_vv_q,Taint |
3 $r = mysqli_query($con, $q); | ss var var | 1 - q r | TL = {u, q, r} | ss,N-Taint var_vv_q,Taint var_vv_r,Taint |
(a) code with SQLI vulnerability | (b) slice-isl | (c) outputting the final classification |
PHP code | slice-isl | variable map | list |
1 $u = (isset($_POST[‘name’]) ? $u = $_POST[‘name’] : ’’; | input var | 1 - u | TL = {u}; CTL = {} |
2 $a = $_POST[‘age’]; | input var | 1 - a | TL = {u, a}; CTL = {} |
3 if (isset($a) && preg_match(’/[a-zA-Z]+/’, $u) && | cond fillchk var contentchk var | 0 - - a - u - a - | TL = {u, a}; CTL = {u, a} |
is_int($a)) | typechk var cond | ||
4 echo ’input type="hidden" name="user" value="’.$u.’"’; | cond ss var | 0 - - u | TL = {u, a}; CTL = {u, a} |
5 else | cond | 0 - | TL = {u, a}; CTL = {} |
6 echo $u . "is an invalid user"; | ss var | 0 - u | TL = {u, a}; CTL = {} |
(a) code with XSS vulnerability and validation | (b) slice-isl and variable map | (c) artefacts lists |
3. Surface Vulnerabilities
Many classes of security flaws in web applications are caused by improper handling of user inputs. Therefore, they are denominated surface vulnerabilities or input validation vulnerabilities. In PHP programs the malicious input arrives to the application (e.g, $_POST), then it may suffer various modifications and might be copied to variables, and eventually reaches a security-sensitive function (e.g., mysqli_query or echo) inducing an erroneous action. Below, we introduce the 12 classes of surface vulnerabilities that will be considered in rest of the paper.
SQLI is the class of vulnerabilities with highest risk in the OWASP Top 10 list (Williams and Wichers, 2017). Normally, the malicious input is used to change the behavior of a query to a database to provoke the disclosure of private data or corrupt the tables.
Example 3.1 ().
The PHP script of Fig. 1 (a) has a simple SQLI vulnerability. $u receives the username provided by the user (line 1), and then it is inserted in a query (lines 2-3). An attacker can inject a malicious username like ’ OR 1 = 1 - - , modifying the structure of the query and getting the passwords of all users.
XSS vulnerabilities allow attackers to execute scripts in the users’ browsers. Below we give an example:
Example 3.2 ().
The code snippet of Fig. 2 (a) has a XSS vulnerability. If the user provides a name, it gets saved in $u (line 1). Then, if conditional validation is false (line 3), the value is returned to the user by echo (line 6). A script provided as input would be executed in the browser, possibly carrying out some malicious deed.
The other classes are presented briefly. Remote and local file inclusion (RFI/LFI) flaws also allow attackers to insert code in the vulnerable web application. While in RFI the code can be located in another web site, in LFI it has to be in the local file system (but there are also several strategies to put it there). OS command injection (OSCI) lets an attacker to provide commands to be run in a shell of the OS of the web server. Attackers can supply code that is executed by a eval function by exploring PHP command injection (PHPCI) bugs. LDAP injection (LDAPI), like SQLI, is associated to the construction and execution of queries, in this case for the LDAP service. An attacker can read files from the local file system by exploiting directory traversal / path traversal (DT/PT) and source code disclosure (SCD) vulnerabilities. A comment spamming (CS) bug is related to the ranking manipulation of spammers’ web sites. Header injection or HTTP response splitting (HI) allows an attacker to manipulate the HTTP response. An attacker can force a web client to use a session ID he defined by exploiting a session fixation (SF) flaw.
4. Overview of the Approach
Our approach for vulnerability detection examines program slices to determine if they contain a bug. The slices are collected from the source code of the target application, and then their instructions are represented in an intermediate language developed to express features that are relevant to surface vulnerabilities. Bugs are found by classifying the translated instructions with an HMM sequence model. Since the model has an understanding of how the data flows are affected by operations related to sanitization, validation and modification, it becomes feasible to make an accurate analysis. In order to setup the model, there is a learning phase where an annotated corpus is employed to derive the knowledge about the different classes of vulnerabilities. Afterwards, the model is used to detect vulnerabilities. Fig. 3 illustrates this procedure.

In more detail, the following steps are carried out. The learning phase is composed mainly of steps (1)-(3) while the detection phase encompasses (1) and (4):
(1) Slice collection and translation: get the slices from the application source code (either for learning or detection). Since we are focusing on surface vulnerabilities, the only slices that have to be considered need to start at some point in the program where an user input is received (i.e., at an entry point) and then they have to end at a security-sensitive instruction (i.e., a sensitive sink). The resulting slice is a series of tracked instructions between the two points. Then, each instruction of a slice is represented into the Intermediate Slice Language (ISL) (Section 5). ISL is a categorized language with grammar rules that aggregate in classes the code elements by functionality. A slice in the ISL format is going to be named as slice-isl;
(2) Create the corpus: build a corpus with a group of instructions represented in the intermediate language, which are labeled either as vulnerable or non-vulnerable. The instructions are provided individually or gathered from slices of training programs. Overall, the corpus includes both representative pieces of programs that have various kinds of flaws and that handle inputs adequately;
(3) Knowledge extraction: acquire knowledge from the corpus to configure the HMM sequence model, namely compute the probability matrices;
(4) Search for Vulnerabilities: use the model to find the best sequence of states that explains a slice in the intermediate language. Each instruction in the slice corresponds to a sequence of observations. These observations are classified by the model, tracking the variables from the previous instructions to find out which emission probabilities are selected. The state computed for the last observation of the last instruction determines the overall classification, either as vulnerable or not. If a flaw is found, an alert is reported including the location in the source code.
5. Intermediate Slice Language
All slices commence with an entry point and finish with a sensitive sink; between them there can be an arbitrary number of statements, such as assignments that transmit data to intermediate variables and various kinds of expressions that validate or modify the data. In other words, a slice contains all instructions (lines of code) that manipulate and propagate an input arriving at an entry point and until a sensitive sink is reached, but no other statements.
ISL expresses an instruction into a few tokens. The instructions are composed of code elements that are categorized in classes of related items (e.g., class input takes PHP entry points like $_GET and $_POST). Therefore, classes are the tokens of the ISL language and these are organized together accordingly to a grammar. Next we give a more careful explanation of ISL assuming that the source code is programmed in the PHP language. However, the approach is generic and other languages could be considered.
5.1. Tokens
ISL abstracts away aspects of the PHP language that are irrelevant to the discovery of surface vulnerabilities. Therefore, as a starting point to specify ISL, it was necessary to identify the essential tokens. To achieve this, we followed an iterative approach where we began with an initial group of tokens which were gradually refined. In every iteration, we examined various slices (vulnerable and not) to recognize the important code elements. We also looked at the PHP instructions that could manipulate entry points and be associated to bugs or prevent them (e.g., functions that replace characters in strings). In addition, for PHP functions, we studied cautiously their parameters to determine which of them are crucial for our analysis. In the end, we defined around twenty tokens that are sufficient to describe the instructions of a PHP program.
Example 5.1 ().
Function mysqli_query and its parameters correspond to two tokens: ss for sensitive sink; and var for variable or input if the parameter receives data originating from an entry point. Although this function has three parameters (the last of them optional), notice that just one of them (the second) is essential to represent.
Token | Description | PHP Function | Taint |
---|---|---|---|
input | entry point | $_GET, $_POST, $_COOKIE, $_REQUEST | Yes |
$_HTTP_GET_VARS, $_HTTP_POST_VARS | |||
$_HTTP_COOKIE_VARS, $_HTTP_REQUEST_VARS | |||
$_FILES, $_SERVERS | |||
var | variable | – | No |
sanit_f | sanitization function | mysql_escape_string, mysql_real_escape_string | No |
mysqli_escape_string, mysqli_real_escape_string | |||
mysqli_stmt_bind_param, mysqli::escape_string | |||
mysqli::real_escape_string, mysqli_stmt::bind_param | |||
htmlentities, htmlspecialchars, strip_tags, urlencode | |||
ss | sensitive sink | mysql_query, mysql_unbuffered_query, mysql_db_query | Yes |
mysqli_query, mysqli_real_query, mysqli_master_query | |||
mysqli_multi_query, mysqli_stmt_execute, mysqli_execute | |||
mysqli::query, mysqli::multi_query, mysqli::real_query | |||
mysqli_stmt::execute | |||
fopen, file_get_contents, file, copy, unlink, move_uploaded_file | |||
imagecreatefromgd2, imagecreatefromgd2part, imagecreatefromgd | |||
imagecreatefromgif, imagecreatefromjpeg, imagecreatefrompng | |||
imagecreatefromstring, imagecreatefromwbmp | |||
imagecreatefromxbm, imagecreatefromxpm | |||
require, require_once, include, include_once | |||
readfile | |||
passthru, system, shell_exec, exec, pcntl_exec, popen | |||
echo, print, printf, die, error, exit | |||
file_put_contents, file_get_contents | |||
eval | |||
typechk_str | type checking string function | is_string, ctype_alpha, ctype_alnum | Yes |
typechk_num | type checking numeric function | is_int, is_double, is_float, is_integer | No |
is_long, is_numeric, is_real, is_scalar, ctype_digit | |||
contentchk | content checking function | preg_match, preg_match_all, ereg, eregi | No |
strnatcmp, strcmp, strncmp, strncasecmp, strcasecmp | |||
fillchk | fill checking function | isset, empty, is_null | Yes |
cond | if instruction presence | if | No |
join_str | join string function | implode, join | No |
erase_str | erase string function | trim, ltrim, rtrim | Yes |
replace_str | replace string function | preg_replace, preg_filter, str_ireplace, str_replace | No |
ereg_replace, eregi_replace, str_shuffle, chunk_split | |||
split_str | split string function | str_split, preg_split, explode, split, spliti | Yes |
add_str | add string function | str_pad | Yes/No |
sub_str | substring function | substr | Yes/No |
sub_str_replace | replace substring function | substr_replace | Yes/No |
char5 | substring with less than 6 chars | – | No |
char6 | substring with more than 5 chars | – | Yes |
start_where | where the substring starts | – | Yes/No |
conc | concatenation operator | – | Yes/No |
var_vv | variable tainted | – | Yes |
miss | miss value | – | Yes/No |
Table 1 summarizes the currently defined ISL tokens. The first column shows above the twenty tokens that stand for PHP code elements whereas the last two tokens are necessary only for the description of the corpus and the implementation of the model. The next two columns explain succinctly the purpose of the token and give a few examples. Column four defines the taintedness status of each token which is used when building the corpus or performing the analysis.
A more cautious inspection of the tokens shows that they enable many relevant behaviors to be expressed. For example: Since the manipulation of strings plays a fundamental role in the exploitation of surface vulnerabilities, there are various tokens that enable a precise modeling of these operations (e.g., erase_str or sub_str); Tokens char5 and char6 act as the amount of characters that are manipulated by functions that extract or replace the contents from a user input; The place in a string where modifications are applied (begin, middle or end) is described by start_where; Token cond can correspond to an if statement that might have validation functions over variables (e.g., user inputs) as part of its conditional expression. This token allows the correlation among the validated variables and the variables that appear inside the if branches.
There are a few tokens that are context-sensitive, i.e., whose selection depends not only on the code elements being translated but also on how they are utilized in the program. Tokens char5 and char6 are two examples as they depend on the substring length. If this length is only defined at runtime, it is impossible to know precisely which token should be assigned. This ambiguity may originate errors in the analysis, either leading to false positives or false negatives. However, since we prefer to be conservative (i.e., report false positives instead of missing vulnerabilities), in the situation where the length is undefined, ISL uses the char6 token because it allows larger payloads to be manipulated. Something similar occurs with the contentchk token that depends on the verification pattern.
ISL must be able to represent PHP instructions in all steps of the two phases of the approach. When slices are extracted for analysis, ISL sets all variables to the default token value var. However, when instructions are placed in the corpus or are processed by the detection procedure, it is necessary to keep information about taintedness. In this case, tainted and untainted variables are depicted respectively by the tokens var_vv and var. The miss token is also used with the corpus and it serves to normalize the length of sequences (Section LABEL:s:impl_eval_dekant).
5.2. Grammar
The ISL grammar is specified by the rules in Listing 1. It allows the representation of the code elements included in the instructions into the tokens (Table 1, column 3 entries are transformed into the column 1 tokens). A slice translated into ISL consists of a set of statements (line 2), each one defined by either: a rule that covers various operations like string concatenation (lines 4-11); or an conditional (line 12); or an assignment (line 13). The rules take into consideration the syntax of the functions (in column 3 of the table) in order to convey: a sensitive sink (line 4), the sanitization (line 5), the validation (line 6), the extraction and modification (lines 7-10), and the concatenation (line 11).
As we will see in Section 6, tokens will correspond to the observations of the HMM. However, while a PHP assignment sets the value of the right-hand-side expression to the left-hand side, the tokens will be processed from left to right by the model; therefore, the assignment rule in ISL follows the HMM scheme.
Example 5.2 ().
PHP instruction $u = $_GET[’user’]; is translated to input var. The assignment and parameter rules (lines 13, 22 and 23) derive the input token, while the attribution rule produces the var token (line 24).
6. The Sequence Model
This section presents the sequence model that supports vulnerability detection. It explains the graph that represents the model, identifying the states and the observations that can be emitted.
6.1. Hidden Markov Model
A Hidden Markov Model (HMM) is a statistical generative model that represents a process as a Markov chain with unobserved (hidden) states. It is a dynamic Bayesian network with nodes that stand for random variables and edges that denote probabilistic dependencies between these variables
(Baum and Petrie, 1966; Jurafsky and Martin, 2008; Smith, 2011). The variables are divided in two groups: observed variables – observations – and hidden variables – states. A state transitions to other states with some probability and emits observations (see example in Fig. 5).A HMM is specified by the following: (1) a vocabulary, a set of words, symbols or tokens that make up the sequence of observations; (2) the states, a group of states that classify the observations of a sequence; (3) parameters, a set of probabilities where (i) the initial probabilities indicate the probability of a sequence of observations begins at each start-state; (ii) the transition probabilities between states; and (iii) the emission probabilities, which specify the probability of a state emitting a given observation.
In the context of NLP, sequence models are used to classify a series of observations, which correspond to the succession of words observed in a sentence. In particular, a HMM is used in PoS tagging tasks, allowing the discovery of a series of states that best explains a new sequence of observations. This is known as the decoding problem, which can be solved by the Viterbi algorithm (Viterbi, 1967). This algorithm resorts to dynamic programming to pick the best hidden state sequence. Although the Viterbi algorithm employs bigrams to generate the i-th
state, it takes into account all previously generated states, but this is not directly visible. In a nutshell, the algorithm iteratively obtains the probability distribution for the
i-th state based on the probabilities computed for the (i-1)-th state, taking into consideration the parameters of the model.The parameters of the HMM are learned by processing a corpus that is created for training. Observations and state transitions are counted, and afterwards the counts are normalized in order to obtain probability distributions; a smoothing procedure may also be applied to deal with rare events in the training data (e.g., add-one smoothing).
State | Description | Emitted observations |
---|---|---|
Taint | Tainted | conc, input, var, var_vv |
N-Taint | Not tainted | conc, cond, input, var, var_vv, ss |
San | Sanitization | input, sanit_f, var, var_vv |
Val | Validation | contentchk, fillchk, input, typechk_num, |
typechk_str, var, var_vv | ||
Chg_str | Change string | add_str, char5, char6, erase_str, input, |
join_str, replace_str, split_str, start_where, | ||
sub_str, sub_str_replace, var, var_vv |
6.2. Vocabulary and States
As our HMM operates over the program instructions translated into ISL, the vocabulary is composed of the previously described ISL tokens. The states are selected to represent the fundamental operations that can be performed on the input data as it flows through a slice. Five states were defined as displayed Table 2. The final state of an instruction in ISL is either vulnerable (Taint) or not-vulnerable (N-Taint). However, in order to attain an accurate detection, it is necessary to take into account the sanitization (San), validation (Val) and modification (Chg_str) of the user inputs and the variables that may depend on them. Therefore, these three factors are represented as intermediate states in the model. As strings are on the base of web surface vulnerabilities, these three states allow the model to determine the intermediate state when an application manipulates them.

6.3. Graph of the Model
Our HMM consists of the graph in Fig. 4, where the nodes constitute the states and the edges the transitions between them. The dashed squares next to the nodes hold the observations that can be emitted in each state.
An ISL instruction corresponds to a sequence of observations. The sequence can start in any state except Val. However, it can reach the Val state for example due to conditionals that check the input data. In the example of Fig. 2 (b), in line 3, one notices a sequence that initiates with a cond observation that could be emitted by the N-Taint initial state. Then, it would transit to the Val state due to the check that is carried out in the if conditional. When the processing of the sequence completes, the model is always either in the Taint or N-Taint states. Therefore, the final state determines the overall classification of the statement, i.e., if the instruction is vulnerable or not.
Example 6.1 ().
Fig. 5 shows an instantiation of the model for one sequence. The sanitization instruction is translated to the ISL sequence sanit_f input var. The sequence starts in the San state and emits the sanit_f observation; next it remains in the same state and emits the input observation; then, it transits to N-Taint state, emitting the var observation (untainted variable).

(a) PHP instruction: $p = mysqli_real_escape_string($con, $_GET[’user’])
ISL instruction: sanit_f input var
Sequence: sanit_f,San input,San var,N-Taint
7. Learning and Vulnerability Detection
This section explains the main activities related with our approach. The learning phase encompasses a number of activities that culminate with the computation of the parameters of the HMM model. Following that, the vulnerabilities are found by processing the slices of the target application through model in the detection phase. Fig. 3 illustrates the fundamental steps.
7.1. Slice Extraction and Translation Process
The slice extractor analyses files with the source code, gathering the slices that start with an entry point and eventually reach some security-sensitive sink. The instructions between these points are those that implement the application logic based on the user input data. The slice extractor performs intra- and inter-procedural analysis, as it tracks the inputs and its dependencies along the program, walking through the invoked functions. The analysis is context-sensitive as it takes into account the results of function calls.
A translation process occurs when the instructions are collected and consists in representing them as ISL tokens. However, ISL does not maintain much information about the variables portrayed by the var token. This knowledge is nevertheless crucial for a more accurate vulnerability detection as variables are related to the inputs in distinct manners and their contents can suffer all sorts of modifications. Therefore, to address this issue, we update a data structure called variable map while the slice is translated. The map associates each occurrence of var in the slice-isl with the name of the variable that appears in the source code. This lets us track how input data propagates to different variables when the slice code elements are processed.
There is an entry in the variable map per instruction. Each entry starts with a flag, 1 or 0, indicating if the statement is an assignment or not. The rest of the entry includes one value per token of the instruction, which is either the name of the variable (without the $) or the - character (stands for a token that is not occupied by a variable).
Example 7.1 ().
Fig. 1(a) displays a PHP code snippet that is vulnerable to SQLI and Fig. 1(b) shows the translation into ISL and the variable map (ignore the right-hand side for now). The first line is the assignment of an input to a variable, $u = $_POST[’username’];. As explained above, it becomes input var in ISL. The variable map entry 1 - u is initialized to 1 to denote that the instruction is an assignment to the var in the second position. The next line is an assignment of a SQL query composed by concatenating constant substrings with a variable. It is represented in ISL by var var and in the variable map by 1 u q. The last line corresponds to a sensitive sink (ss) and two variables.
Example 7.2 ().
Fig. 2 has a slightly more complex code snippet. The slice extractor takes from the code two slices: lines {1, 2, 3, 4} and {1, 3, 5, 6}. The first prevents an attack with a form of input validation, but the second is vulnerable to XSS. The corresponding ISL and variable map are shown in the middle columns. The interesting cases are in lines 3 and 4, which are the if statement and its true branch. Both are prefixed with the cond token and the former also ends with the same token. This cond termination makes a distinction between the two types of instructions. In addition, the sequence model will understand that variables from the former may influence those that appear in latter instructions.
7.2. Process of Creating the Corpus
The corpus plays an important role as it incorporates the knowledge that will be learned by the model, namely which instructions may lead to a flaw. In our case, the corpus is a group of instructions (not slices) converted to ISL, where tokens are tagged with information related to taint propagation. The model sees the tokens of an instruction in ISL as a sequence of observations. The tags correspond to the states of the model. Therefore, an alternative way to look at the corpus is as a group of sequences of observations annotated with states.
The corpus is built in four steps: (1) collection of a group of instructions that are vulnerable and not-vulnerable, which are placed in a bag; (2) representation of each instruction in the bag in ISL; (3) annotation of the tokens of every instruction (e.g., as tainted or sanitized), i.e., associate a state to each observation of the sequence; and (4) removal of duplicated entries in the bag. In the end, an instruction becomes a list of pairs of token,state.
In the first step, it is necessary to get representative instructions of all classes of bugs that one wants to catch, various forms of validations, diverse forms of manipulating (changing) strings, and different combinations of code elements. To achieve this in practice, we can gather individual instructions or/and we can select a large number of slices captured from open source training applications. Therefore, both the collection and representation can be performed in an automatic manner (with the slice collector module), but the annotation of the tokens is done manually (as in all supervised machine learning approaches).
Example 7.3 ().
Instruction $var = $_POST[’paramater’] becomes input var in ISL, and is annotated as input,Taint var_vv,Taint. Both states are Taint (compromised) because the input can be the source of malicious data, and therefore is always Taint, and then the taint propagates to the variable.
As mentioned in the previous section, the token var_vv is not produced when slices are translated into ISL, but used in the corpus to represent variables with state Taint (tainted variables). In fact, during translation into ISL variables are not known to be tainted or not, so they are represented by the var token. In the corpus, if the state of the variable is annotated as Taint, the variable is portrayed by var_vv, forming the pair var_vv,Taint.
The state of the last observation of a sequence corresponds to a final state, and therefore it can only be Taint (vulnerable) or N-Taint (not-vulnerable). If this state is tainted then it means that a malicious input is able to propagate and potentially compromise the execution. Therefore, in this case, the instruction is perceived as vulnerable. Otherwise, the instruction is deemed correct (non-vulnerable).
Example 7.4 ().
Instruction $v = htlmentities ($_GET[’user’]) is translated to sanit_f input var and placed in the corpus as the succession of pairs sanit_f,San input,San var,N-Taint. The first two tokens are annotated with the San state because function htlmentities sanitizes its parameter; the last token is labeled with the N-Taint state, meaning that the ultimate state of the sequence is not tainted.
Comments
There are no comments yet.