Statically Detecting Vulnerabilities by Processing Programming Languages as Natural Languages

10/12/2019
by   Ibéria Medeiros, et al.
0

Web applications continue to be a favorite target for hackers due to a combination of wide adoption and rapid deployment cycles, which often lead to the introduction of high impact vulnerabilities. Static analysis tools are important to search for bugs automatically in the program source code, supporting developers on their removal. However, building these tools requires programming the knowledge on how to discover the vulnerabilities. This paper presents an alternative approach in which tools learn to detect flaws automatically by resorting to artificial intelligence concepts, more concretely to natural language processing. The approach employs a sequence model to learn to characterize vulnerabilities based on an annotated corpus. Afterwards, the model is utilized to discover and identify vulnerabilities in the source code. It was implemented in the DEKANT tool and evaluated experimentally with a large set of PHP applications and WordPress plugins. Overall, we found several hundred vulnerabilities belonging to 12 classes of input validation vulnerabilities, where 62 of them were zero-day.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/22/2020

DevReplay: Automatic Repair with Editable Fix Pattern

Static analysis tools, or linters, detect violation of source code conve...
12/30/2020

Stack-based Buffer Overflow Detection using Recurrent Neural Networks

Detecting vulnerabilities in software is a critical challenge in the dev...
07/05/2018

Improving Fuzzing Using Software Complexity Metrics

Vulnerable software represents a tremendous threat to modern information...
04/26/2022

Wasmati: An Efficient Static Vulnerability Scanner for WebAssembly

WebAssembly is a new binary instruction format that allows targeted comp...
05/12/2019

Static Analyzers and Potential Future Research Directions for Scala: An Overview

Static analyzers are tool sets which are proving to be indispensable to ...
06/12/2018

SoK: Sanitizing for Security

The C and C++ programming languages are notoriously insecure yet remain ...
02/18/2020

Discovering ePassport Vulnerabilities using Bisimilarity

We uncover privacy vulnerabilities in the ICAO 9303 standard implemented...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Web applications are being used to implement interfaces of a myriad of services. They are often the first target of attacks, and despite considerable efforts to improve security, there are still many examples of high impact compromises. In the 2017 OWASP Top 10 list, vulnerabilities like SQL injection (SQLI) and cross-site scripting (XSS) continue to raise significant concerns, but other classes are also listed as being commonly exploited (Williams and Wichers, 2017). Millions of websites have been compromised since Oct. 2014 due to vulnerabilities in plugins of Drupal (BBC Technology, 2014) and WordPress (The Hacker News, 2017b; threatpost, 2017), and the data of more than a billion users has been stolen using SQLI attacks against various kinds of services (governmental, financial, education, mail, etc) (The Hacker News, 2017a; HELPNETSECURITY, 2017). In addition, the next wave of XSS attacks has been predicted for the past two years, with an important expected growth of the problem (Sink, 2017; Imperva, 2017).

Many of these vulnerabilities are related to malformed inputs that reach some relevant asset (e.g., the database or the user’s browser) by traveling through a code slice (a series of instructions) of the web application. Therefore, a good practice to enhance security is to pass inputs through sanitization functions that invalidate dangerous metacharacters or/and validation functions that check their content. In addition, programmers commonly use static analysis tools to search automatically for bugs in the source code, facilitating their removal. The development of these tools, however, requires coding explicitly the knowledge on how each vulnerability can be detected (Dahse and Holz, 2014; Fonseca and Vieira, 2014; Jovanovic et al., 2006; Medeiros et al., 2016b), which is a complex task. Moreover, this knowledge might be incomplete or partially wrong, making the tools inaccurate (Dahse and Holz, 2015). For example, if the tools do not understand that a certain function sanitizes inputs, they could raise an alert about a vulnerability that does not exist.

This paper presents a new approach for static analysis that is based on learning to recognize vulnerabilities

. It leverages from artificial intelligence (AI) concepts, more precisely from classification models for sequences of observations that are commonly used in the field of natural language processing (NLP). NLP is a confluence of AI and linguistics, which involves intelligent analysis of written language, i.e., the natural languages. In this sense, NLP is considered a sub-area of AI. It can be viewed as a new form of intelligence in an artificial way that can get insights how humans understand natural languages. NLP tasks, such as parts-of-speech (PoS) tagging or named entity recognition (NER), are typically modelled as sequence classification problems, in which a class (e.g., a given morpho-syntactic category) is assigned to each word in a given sentence, according to estimate given by a structured prediction model that takes word order into consideration. The model’s parameters are normally inferred using supervised machine learning techniques, taking advantage of annotated corpora.

We propose applying a similar approach to web programming languages, i.e., to analyse source code in a similar manner to what is being done with natural language text. Even though, these languages are artificial, they have many characteristics in common with natural languages, such as words, syntactic rules, sentences, and a grammar. NLP usually employs machine learning to extract rules (knowledge) automatically from a corpus

. Then, with this knowledge, other sequences of observations can be processed and classified. NLP has to take into account the

order of the observations, as the meaning of sentences depends on it. Therefore NLP involves forms of classification more sophisticated than approaches based on standard classifiers

(e.g., naive Bayes, decision trees, support vector machines), which simply check the presence of certain observations without considering any relation between them.

Our approach for static analysis resorts to machine language techniques that take the order of source code instructions into account – sequence models – to allow accurate detection and identification of the vulnerabilities in the code. Previous applications of machine learning in the context of static analysis neither produced tools that learn to make detection nor employed sequence models. For example, PhpMinerII resorts to machine learning to train standard classifiers, which then verify if certain constructs (associated with flaws) exist in the code. However, it does not provide the exact location of the vulnerabilities (Shar and Tan, 2012a, b). WAP and WAPe use a taint analyser to search for vulnerabilities and a standard classifier to confirm that the found bugs111In software security context, we consider vulnerability as a being a bug or a flaw that can be exploitable. can actually create security problems (Medeiros et al., 2016b). None of these tools considers the order of code elements or the relation among them, leading to bugs being missed (false negatives, FN) and alarms being raised on correct code (false positives, FP).

Our sequence model is a Hidden Markov Model (HMM) (Rabiner, 1989)

. A HMM is a Bayesian network composed of nodes corresponding to the states and edges associated to the probabilities of transitioning between states. States are hidden, i.e., are not observed. Given a sequence of observations, the hidden states (one per observation) are discovered following the model and taking into account the order of the observations. Therefore, the HMM can be used to find the series of states that

best explains the sequence of observations.

The paper also presents the hidDEn marKov model diAgNosing vulnerabiliTies

(DEKANT) tool that implements our approach for applications written in PHP. The tool was evaluated experimentally with a diverse set of 23 open source web applications with bugs disclosed in the past. These applications are substantial, with an aggregated size of around 8,000 files and 2.5 million lines of code (LoC). All flaws that we are aware of being previously reported were found by DEKANT. More than one thousand slices were analyzed, 714 were classified as having vulnerabilities and 305 as not. The false positives were in the order of two dozens. In addition, the tool checked 23 plugins of WordPress and found

62 zero-day vulnerabilities. These flaws were reported to the developers, and some of them already confirmed their existence and fixed the plugins. DEKANT was also assessed with several other vulnerability detection tools, and the results give evidence that our approach leads to better accuracy and precision.

The main contributions of the paper are: (1) a novel approach for improving the security of web applications by letting static analysis tools learn to discover vulnerabilities through an annotated corpus; (2) an intermediate language representation capturing the relevant features of PHP, and a sequence model that takes into consideration the place where code elements appear in the slices and how they alter the spreading of the input data; (3) a static analysis tool that implements the approach; (4) an experimental evaluation that demonstrates the ability of this tool to find known and zero-day vulnerabilities with a residual number of mistakes.

2. Related Work

Static analysis tools search for vulnerabilities in the applications usually by processing the source code (e.g., (Fonseca and Vieira, 2014; Jovanovic et al., 2006; Shankar et al., 2001; Son and Shmatikov, 2011; Dahse and Holz, 2014; Medeiros et al., 2016b; Backes et al., 2017)). Many of these tools perform taint analysis, tracking user inputs to determine if they reach a sensitive sink (i.e., a function that could be exploited). Pixy (Jovanovic et al., 2006) was one of the first tools to automate this kind of analysis on PHP applications. Later on, RIPS (Dahse and Holz, 2014) extended this technique with the ability to process more advanced constructs of PHP (e.g., objects). phpSAFE (Fonseca and Vieira, 2014) is a recent solution that does taint analysis to look for flaws in CMS plugins (e.g., WordPress plugins). WAP (Medeiros et al., 2016b, c) also does taint analysis, but aims at reducing the number of false positives by resorting to data mining, besides also correcting automatically the located bugs. Other works (Yamaguchi et al., 2014, 2015) detect vulnerabilities by processing source code properties represented as graphs. In this paper, we propose an novel approach which, unlike these works, does not involve programming information about bugs, but instead extracts this knowledge from annotated code samples and thus learns to find the vulnerabilities.

Machine learning has been used in a few works to measure the quality of software by collecting a series of attributes that reveal the presence of software defects (Arisholm et al., 2010; Lessmann et al., 2008). Other approaches resort to machine learning to predict if there are vulnerabilities in a program (Neuhaus et al., 2007; Walden et al., 2009; Perl et al., 2015), which is different from identifying precisely the bugs, something that we do in this paper. To support the predictions they employ various features, such as past vulnerabilities and function calls (Neuhaus et al., 2007), or a combination of code-metric analysis with metadata gathered from application repositories (Perl et al., 2015). In particular, PhpMinerI and PhpMinerII predict the presence of vulnerabilities in PHP programs (Shar and Tan, 2012a, b; Shar et al., 2013). The tools are first trained with a set of annotated slices that end at a sensitive sink (but do not necessarily start at an entry point), and then they are ready to identify slices with errors. WAP and WAPe are different because they use machine learning and data mining to predict if a vulnerability detected by taint analysis is actually a real bug or a false alarm (Medeiros et al., 2016b, c)

. In any case, PhpMiner and WAP tools employ standard classifiers (e.g., Logistic Regression or a Multi-Layer Perceptron) instead of structured prediction models (i.e., a sequence classifier) as we propose here.

There are a few static analysis tools that implement machine learning techniques. Chucky (Yamaguchi et al., 2013) discovers vulnerabilities by identifying missing checks in C language software. VulDeePecker (Li et al., 2018)

resorts to code gadgets to represent parts of C programs and then transforms them into vectors. A neural network system then determines if the target program is vulnerable due to buffer or resource management errors. Russell et al. 

(Russell et al., 2018) developed a vulnerability detection tool for C and C++ based on features learning from a dataset and artificial neural network. Scandariato et al. (Scandariato et al., 2014) performs text mining to predict vulnerable software components in Android applications. SuSi (Rasthofer et al., 2014) employs machine learning to classify sources and sinks in the code of Android API.

This paper extends our previous work (Medeiros et al., 2016a). Our approach extracts PHP slices, but contrary to the others it translates them into a tokenized language to be processed by a HMM. While tools in the literature collect attributes from a slice and classify them without considering ordering relations among statements, which is simplistic, DEKANT also does classification but takes into account the place in which code elements appear in the slice. Such form of classification assists on a more accurate and precise detection of bugs.

PHP code slice-isl variable map tainted list slice-isl classification
1 $u = $_POST[‘username’]; input var 1 - u TL = {u} input,Taint var_vv_u,Taint
2 $q = "SELECT pass FROM users WHERE user=’".$u."’"; var var 1 u q TL = {u, q} var_vv_u,Taint var_vv_q,Taint
3 $r = mysqli_query($con, $q); ss var var 1 - q r TL = {u, q, r} ss,N-Taint var_vv_q,Taint var_vv_r,Taint
(a) code with SQLI vulnerability (b) slice-isl (c) outputting the final classification
Figure 1. Code vulnerable to SQLI, translation into ISL, and detection of the vulnerability.
PHP code slice-isl variable map list
1 $u = (isset($_POST[‘name’]) ? $u = $_POST[‘name’] : ’’; input var 1 - u TL = {u}; CTL = {}
2 $a = $_POST[‘age’]; input var 1 - a TL = {u, a}; CTL = {}
3 if (isset($a) && preg_match(’/[a-zA-Z]+/’, $u) && cond fillchk var contentchk var 0 - - a - u - a - TL = {u, a}; CTL = {u, a}
        is_int($a))       typechk var cond
4    echo ’input type="hidden" name="user" value="’.$u.’"’; cond ss var 0 - - u TL = {u, a}; CTL = {u, a}
5 else cond 0 - TL = {u, a}; CTL = {}
6    echo $u . "is an invalid user"; ss var 0 - u TL = {u, a}; CTL = {}
(a) code with XSS vulnerability and validation (b) slice-isl and variable map (c) artefacts lists
Figure 2. Code with a slice vulnerable to XSS (lines {1, 3, 5, 6}) and a slice not vulnerable (lines {1, 2, 3, 4}), with ISL translation.

3. Surface Vulnerabilities

Many classes of security flaws in web applications are caused by improper handling of user inputs. Therefore, they are denominated surface vulnerabilities or input validation vulnerabilities. In PHP programs the malicious input arrives to the application (e.g, $_POST), then it may suffer various modifications and might be copied to variables, and eventually reaches a security-sensitive function (e.g., mysqli_query or echo) inducing an erroneous action. Below, we introduce the 12 classes of surface vulnerabilities that will be considered in rest of the paper.

SQLI is the class of vulnerabilities with highest risk in the OWASP Top 10 list (Williams and Wichers, 2017). Normally, the malicious input is used to change the behavior of a query to a database to provoke the disclosure of private data or corrupt the tables.

Example 3.1 ().

The PHP script of Fig. 1 (a) has a simple SQLI vulnerability. $u receives the username provided by the user (line 1), and then it is inserted in a query (lines 2-3). An attacker can inject a malicious username like ’ OR 1 = 1 - - , modifying the structure of the query and getting the passwords of all users.

XSS vulnerabilities allow attackers to execute scripts in the users’ browsers. Below we give an example:

Example 3.2 ().

The code snippet of Fig. 2 (a) has a XSS vulnerability. If the user provides a name, it gets saved in $u (line 1). Then, if conditional validation is false (line 3), the value is returned to the user by echo (line 6). A script provided as input would be executed in the browser, possibly carrying out some malicious deed.

The other classes are presented briefly. Remote and local file inclusion (RFI/LFI) flaws also allow attackers to insert code in the vulnerable web application. While in RFI the code can be located in another web site, in LFI it has to be in the local file system (but there are also several strategies to put it there). OS command injection (OSCI) lets an attacker to provide commands to be run in a shell of the OS of the web server. Attackers can supply code that is executed by a eval function by exploring PHP command injection (PHPCI) bugs. LDAP injection (LDAPI), like SQLI, is associated to the construction and execution of queries, in this case for the LDAP service. An attacker can read files from the local file system by exploiting directory traversal / path traversal (DT/PT) and source code disclosure (SCD) vulnerabilities. A comment spamming (CS) bug is related to the ranking manipulation of spammers’ web sites. Header injection or HTTP response splitting (HI) allows an attacker to manipulate the HTTP response. An attacker can force a web client to use a session ID he defined by exploiting a session fixation (SF) flaw.

4. Overview of the Approach

Our approach for vulnerability detection examines program slices to determine if they contain a bug. The slices are collected from the source code of the target application, and then their instructions are represented in an intermediate language developed to express features that are relevant to surface vulnerabilities. Bugs are found by classifying the translated instructions with an HMM sequence model. Since the model has an understanding of how the data flows are affected by operations related to sanitization, validation and modification, it becomes feasible to make an accurate analysis. In order to setup the model, there is a learning phase where an annotated corpus is employed to derive the knowledge about the different classes of vulnerabilities. Afterwards, the model is used to detect vulnerabilities. Fig. 3 illustrates this procedure.

Figure 3. Overview on the proposed approach.

In more detail, the following steps are carried out. The learning phase is composed mainly of steps (1)-(3) while the detection phase encompasses (1) and (4):

(1) Slice collection and translation: get the slices from the application source code (either for learning or detection). Since we are focusing on surface vulnerabilities, the only slices that have to be considered need to start at some point in the program where an user input is received (i.e., at an entry point) and then they have to end at a security-sensitive instruction (i.e., a sensitive sink). The resulting slice is a series of tracked instructions between the two points. Then, each instruction of a slice is represented into the Intermediate Slice Language (ISL) (Section 5). ISL is a categorized language with grammar rules that aggregate in classes the code elements by functionality. A slice in the ISL format is going to be named as slice-isl;

(2) Create the corpus: build a corpus with a group of instructions represented in the intermediate language, which are labeled either as vulnerable or non-vulnerable. The instructions are provided individually or gathered from slices of training programs. Overall, the corpus includes both representative pieces of programs that have various kinds of flaws and that handle inputs adequately;

(3) Knowledge extraction: acquire knowledge from the corpus to configure the HMM sequence model, namely compute the probability matrices;

(4) Search for Vulnerabilities: use the model to find the best sequence of states that explains a slice in the intermediate language. Each instruction in the slice corresponds to a sequence of observations. These observations are classified by the model, tracking the variables from the previous instructions to find out which emission probabilities are selected. The state computed for the last observation of the last instruction determines the overall classification, either as vulnerable or not. If a flaw is found, an alert is reported including the location in the source code.

The next two sections explain the ISL language and the sequence model (Section 5 and 6). Then, the four above steps are elaborated in Section 7. An overview of the tool that implements our approach is given in Section LABEL:s:impl_eval_dekant.

5. Intermediate Slice Language

All slices commence with an entry point and finish with a sensitive sink; between them there can be an arbitrary number of statements, such as assignments that transmit data to intermediate variables and various kinds of expressions that validate or modify the data. In other words, a slice contains all instructions (lines of code) that manipulate and propagate an input arriving at an entry point and until a sensitive sink is reached, but no other statements.

ISL expresses an instruction into a few tokens. The instructions are composed of code elements that are categorized in classes of related items (e.g., class input takes PHP entry points like $_GET and $_POST). Therefore, classes are the tokens of the ISL language and these are organized together accordingly to a grammar. Next we give a more careful explanation of ISL assuming that the source code is programmed in the PHP language. However, the approach is generic and other languages could be considered.

5.1. Tokens

ISL abstracts away aspects of the PHP language that are irrelevant to the discovery of surface vulnerabilities. Therefore, as a starting point to specify ISL, it was necessary to identify the essential tokens. To achieve this, we followed an iterative approach where we began with an initial group of tokens which were gradually refined. In every iteration, we examined various slices (vulnerable and not) to recognize the important code elements. We also looked at the PHP instructions that could manipulate entry points and be associated to bugs or prevent them (e.g., functions that replace characters in strings). In addition, for PHP functions, we studied cautiously their parameters to determine which of them are crucial for our analysis. In the end, we defined around twenty tokens that are sufficient to describe the instructions of a PHP program.

Example 5.1 ().

Function mysqli_query and its parameters correspond to two tokens: ss for sensitive sink; and var for variable or input if the parameter receives data originating from an entry point. Although this function has three parameters (the last of them optional), notice that just one of them (the second) is essential to represent.

Token Description PHP Function Taint
input entry point $_GET, $_POST, $_COOKIE, $_REQUEST Yes
$_HTTP_GET_VARS, $_HTTP_POST_VARS
$_HTTP_COOKIE_VARS, $_HTTP_REQUEST_VARS
$_FILES, $_SERVERS
var variable No
sanit_f sanitization function mysql_escape_string, mysql_real_escape_string No
mysqli_escape_string, mysqli_real_escape_string
mysqli_stmt_bind_param, mysqli::escape_string
mysqli::real_escape_string, mysqli_stmt::bind_param
htmlentities, htmlspecialchars, strip_tags, urlencode
ss sensitive sink mysql_query, mysql_unbuffered_query, mysql_db_query Yes
mysqli_query, mysqli_real_query, mysqli_master_query
mysqli_multi_query, mysqli_stmt_execute, mysqli_execute
mysqli::query, mysqli::multi_query, mysqli::real_query
mysqli_stmt::execute
fopen, file_get_contents, file, copy, unlink, move_uploaded_file
imagecreatefromgd2, imagecreatefromgd2part, imagecreatefromgd
imagecreatefromgif, imagecreatefromjpeg, imagecreatefrompng
imagecreatefromstring, imagecreatefromwbmp
imagecreatefromxbm, imagecreatefromxpm
require, require_once, include, include_once
readfile
passthru, system, shell_exec, exec, pcntl_exec, popen
echo, print, printf, die, error, exit
file_put_contents, file_get_contents
eval
typechk_str type checking string function is_string, ctype_alpha, ctype_alnum Yes
typechk_num type checking numeric function is_int, is_double, is_float, is_integer No
is_long, is_numeric, is_real, is_scalar, ctype_digit
contentchk content checking function preg_match, preg_match_all, ereg, eregi No
strnatcmp, strcmp, strncmp, strncasecmp, strcasecmp
fillchk fill checking function isset, empty, is_null Yes
cond if instruction presence if No
join_str join string function implode, join No
erase_str erase string function trim, ltrim, rtrim Yes
replace_str replace string function preg_replace, preg_filter, str_ireplace, str_replace No
ereg_replace, eregi_replace, str_shuffle, chunk_split
split_str split string function str_split, preg_split, explode, split, spliti Yes
add_str add string function str_pad Yes/No
sub_str substring function substr Yes/No
sub_str_replace replace substring function substr_replace Yes/No
char5 substring with less than 6 chars No
char6 substring with more than 5 chars Yes
start_where where the substring starts Yes/No
conc concatenation operator Yes/No
var_vv variable tainted Yes
miss miss value Yes/No
Table 1. Intermediate Slice Language tokens.

Table 1 summarizes the currently defined ISL tokens. The first column shows above the twenty tokens that stand for PHP code elements whereas the last two tokens are necessary only for the description of the corpus and the implementation of the model. The next two columns explain succinctly the purpose of the token and give a few examples. Column four defines the taintedness status of each token which is used when building the corpus or performing the analysis.

A more cautious inspection of the tokens shows that they enable many relevant behaviors to be expressed. For example: Since the manipulation of strings plays a fundamental role in the exploitation of surface vulnerabilities, there are various tokens that enable a precise modeling of these operations (e.g., erase_str or sub_str); Tokens char5 and char6 act as the amount of characters that are manipulated by functions that extract or replace the contents from a user input; The place in a string where modifications are applied (begin, middle or end) is described by start_where; Token cond can correspond to an if statement that might have validation functions over variables (e.g., user inputs) as part of its conditional expression. This token allows the correlation among the validated variables and the variables that appear inside the if branches.

There are a few tokens that are context-sensitive, i.e., whose selection depends not only on the code elements being translated but also on how they are utilized in the program. Tokens char5 and char6 are two examples as they depend on the substring length. If this length is only defined at runtime, it is impossible to know precisely which token should be assigned. This ambiguity may originate errors in the analysis, either leading to false positives or false negatives. However, since we prefer to be conservative (i.e., report false positives instead of missing vulnerabilities), in the situation where the length is undefined, ISL uses the char6 token because it allows larger payloads to be manipulated. Something similar occurs with the contentchk token that depends on the verification pattern.

ISL must be able to represent PHP instructions in all steps of the two phases of the approach. When slices are extracted for analysis, ISL sets all variables to the default token value var. However, when instructions are placed in the corpus or are processed by the detection procedure, it is necessary to keep information about taintedness. In this case, tainted and untainted variables are depicted respectively by the tokens var_vv and var. The miss token is also used with the corpus and it serves to normalize the length of sequences (Section LABEL:s:impl_eval_dekant).

5.2. Grammar

The ISL grammar is specified by the rules in Listing 1. It allows the representation of the code elements included in the instructions into the tokens (Table 1, column 3 entries are transformed into the column 1 tokens). A slice translated into ISL consists of a set of statements (line 2), each one defined by either: a rule that covers various operations like string concatenation (lines 4-11); or an conditional (line 12); or an assignment (line 13). The rules take into consideration the syntax of the functions (in column 3 of the table) in order to convey: a sensitive sink (line 4), the sanitization (line 5), the validation (line 6), the extraction and modification (lines 7-10), and the concatenation (line 11).

1grammar isl {
2slice-isl : statement+
3statement :
4sensitive_sink
5| sanitization
6| validation
7| mod_all
8| mod_add
9| mod_sub
10| mod_rep
11| concat
12| cond statement+ cond?
13| assignment
14sensitive_sink : ss (param | concat)
15sanitization : sanit_f param
16validation : (typechk_str | typechk_num | fillchk | contentchk) param
17mod_all : (join_str | erase_str | replace_str | split_str) param
18mod_add : add_str param num_chars param
19mod_sub : sub_str param num_chars start_where?
20mod_rep : sub_str_replace param num_chars param start_where?
21concat : (statement | param) (conc concat)?
22assignment : (statement | param) attrib_var
23param : input | var
24attrib_var : var
25num_chars : char5 | char6
26}
Listing 1: Grammar rules of ISL.

As we will see in Section 6, tokens will correspond to the observations of the HMM. However, while a PHP assignment sets the value of the right-hand-side expression to the left-hand side, the tokens will be processed from left to right by the model; therefore, the assignment rule in ISL follows the HMM scheme.

Example 5.2 ().

PHP instruction $u = $_GET[’user’]; is translated to input var. The assignment and parameter rules (lines 13, 22 and 23) derive the input token, while the attribution rule produces the var token (line 24).

6. The Sequence Model

This section presents the sequence model that supports vulnerability detection. It explains the graph that represents the model, identifying the states and the observations that can be emitted.

6.1. Hidden Markov Model

A Hidden Markov Model (HMM) is a statistical generative model that represents a process as a Markov chain with unobserved (hidden) states. It is a dynamic Bayesian network with nodes that stand for random variables and edges that denote probabilistic dependencies between these variables

(Baum and Petrie, 1966; Jurafsky and Martin, 2008; Smith, 2011). The variables are divided in two groups: observed variables – observations – and hidden variables – states. A state transitions to other states with some probability and emits observations (see example in Fig. 5).

A HMM is specified by the following: (1) a vocabulary, a set of words, symbols or tokens that make up the sequence of observations; (2) the states, a group of states that classify the observations of a sequence; (3) parameters, a set of probabilities where (i) the initial probabilities indicate the probability of a sequence of observations begins at each start-state; (ii) the transition probabilities between states; and (iii) the emission probabilities, which specify the probability of a state emitting a given observation.

In the context of NLP, sequence models are used to classify a series of observations, which correspond to the succession of words observed in a sentence. In particular, a HMM is used in PoS tagging tasks, allowing the discovery of a series of states that best explains a new sequence of observations. This is known as the decoding problem, which can be solved by the Viterbi algorithm (Viterbi, 1967). This algorithm resorts to dynamic programming to pick the best hidden state sequence. Although the Viterbi algorithm employs bigrams to generate the i-th

state, it takes into account all previously generated states, but this is not directly visible. In a nutshell, the algorithm iteratively obtains the probability distribution for the

i-th state based on the probabilities computed for the (i-1)-th state, taking into consideration the parameters of the model.

The parameters of the HMM are learned by processing a corpus that is created for training. Observations and state transitions are counted, and afterwards the counts are normalized in order to obtain probability distributions; a smoothing procedure may also be applied to deal with rare events in the training data (e.g., add-one smoothing).

State Description Emitted observations
Taint Tainted conc, input, var, var_vv
N-Taint Not tainted conc, cond, input, var, var_vv, ss
San Sanitization input, sanit_f, var, var_vv
Val Validation contentchk, fillchk, input, typechk_num,
typechk_str, var, var_vv
Chg_str Change string add_str, char5, char6, erase_str, input,
join_str, replace_str, split_str, start_where,
sub_str, sub_str_replace, var, var_vv
Table 2. HMM states and the observations they emit.

6.2. Vocabulary and States

As our HMM operates over the program instructions translated into ISL, the vocabulary is composed of the previously described ISL tokens. The states are selected to represent the fundamental operations that can be performed on the input data as it flows through a slice. Five states were defined as displayed Table 2. The final state of an instruction in ISL is either vulnerable (Taint) or not-vulnerable (N-Taint). However, in order to attain an accurate detection, it is necessary to take into account the sanitization (San), validation (Val) and modification (Chg_str) of the user inputs and the variables that may depend on them. Therefore, these three factors are represented as intermediate states in the model. As strings are on the base of web surface vulnerabilities, these three states allow the model to determine the intermediate state when an application manipulates them.

Figure 4. Model graph of the proposed HMM.

6.3. Graph of the Model

Our HMM consists of the graph in Fig. 4, where the nodes constitute the states and the edges the transitions between them. The dashed squares next to the nodes hold the observations that can be emitted in each state.

An ISL instruction corresponds to a sequence of observations. The sequence can start in any state except Val. However, it can reach the Val state for example due to conditionals that check the input data. In the example of Fig. 2 (b), in line 3, one notices a sequence that initiates with a cond observation that could be emitted by the N-Taint initial state. Then, it would transit to the Val state due to the check that is carried out in the if conditional. When the processing of the sequence completes, the model is always either in the Taint or N-Taint states. Therefore, the final state determines the overall classification of the statement, i.e., if the instruction is vulnerable or not.

Example 6.1 ().

Fig. 5 shows an instantiation of the model for one sequence. The sanitization instruction is translated to the ISL sequence sanit_f input var. The sequence starts in the San state and emits the sanit_f observation; next it remains in the same state and emits the input observation; then, it transits to N-Taint state, emitting the var observation (untainted variable).

(a) PHP instruction: $p = mysqli_real_escape_string($con, $_GET[’user’])

ISL instruction: sanit_f input var

Sequence: sanit_f,San input,San var,N-Taint

Figure 5. Graph instantiation for an example sequence.

7. Learning and Vulnerability Detection

This section explains the main activities related with our approach. The learning phase encompasses a number of activities that culminate with the computation of the parameters of the HMM model. Following that, the vulnerabilities are found by processing the slices of the target application through model in the detection phase. Fig. 3 illustrates the fundamental steps.

7.1. Slice Extraction and Translation Process

The slice extractor analyses files with the source code, gathering the slices that start with an entry point and eventually reach some security-sensitive sink. The instructions between these points are those that implement the application logic based on the user input data. The slice extractor performs intra- and inter-procedural analysis, as it tracks the inputs and its dependencies along the program, walking through the invoked functions. The analysis is context-sensitive as it takes into account the results of function calls.

A translation process occurs when the instructions are collected and consists in representing them as ISL tokens. However, ISL does not maintain much information about the variables portrayed by the var token. This knowledge is nevertheless crucial for a more accurate vulnerability detection as variables are related to the inputs in distinct manners and their contents can suffer all sorts of modifications. Therefore, to address this issue, we update a data structure called variable map while the slice is translated. The map associates each occurrence of var in the slice-isl with the name of the variable that appears in the source code. This lets us track how input data propagates to different variables when the slice code elements are processed.

There is an entry in the variable map per instruction. Each entry starts with a flag, 1 or 0, indicating if the statement is an assignment or not. The rest of the entry includes one value per token of the instruction, which is either the name of the variable (without the $) or the - character (stands for a token that is not occupied by a variable).

Example 7.1 ().

Fig. 1(a) displays a PHP code snippet that is vulnerable to SQLI and Fig. 1(b) shows the translation into ISL and the variable map (ignore the right-hand side for now). The first line is the assignment of an input to a variable, $u = $_POST[’username’];. As explained above, it becomes input var in ISL. The variable map entry 1 - u is initialized to 1 to denote that the instruction is an assignment to the var in the second position. The next line is an assignment of a SQL query composed by concatenating constant substrings with a variable. It is represented in ISL by var var and in the variable map by 1 u q. The last line corresponds to a sensitive sink (ss) and two variables.

Example 7.2 ().

Fig. 2 has a slightly more complex code snippet. The slice extractor takes from the code two slices: lines {1, 2, 3, 4} and {1, 3, 5, 6}. The first prevents an attack with a form of input validation, but the second is vulnerable to XSS. The corresponding ISL and variable map are shown in the middle columns. The interesting cases are in lines 3 and 4, which are the if statement and its true branch. Both are prefixed with the cond token and the former also ends with the same token. This cond termination makes a distinction between the two types of instructions. In addition, the sequence model will understand that variables from the former may influence those that appear in latter instructions.

7.2. Process of Creating the Corpus

The corpus plays an important role as it incorporates the knowledge that will be learned by the model, namely which instructions may lead to a flaw. In our case, the corpus is a group of instructions (not slices) converted to ISL, where tokens are tagged with information related to taint propagation. The model sees the tokens of an instruction in ISL as a sequence of observations. The tags correspond to the states of the model. Therefore, an alternative way to look at the corpus is as a group of sequences of observations annotated with states.

The corpus is built in four steps: (1) collection of a group of instructions that are vulnerable and not-vulnerable, which are placed in a bag; (2) representation of each instruction in the bag in ISL; (3) annotation of the tokens of every instruction (e.g., as tainted or sanitized), i.e., associate a state to each observation of the sequence; and (4) removal of duplicated entries in the bag. In the end, an instruction becomes a list of pairs of token,state.

In the first step, it is necessary to get representative instructions of all classes of bugs that one wants to catch, various forms of validations, diverse forms of manipulating (changing) strings, and different combinations of code elements. To achieve this in practice, we can gather individual instructions or/and we can select a large number of slices captured from open source training applications. Therefore, both the collection and representation can be performed in an automatic manner (with the slice collector module), but the annotation of the tokens is done manually (as in all supervised machine learning approaches).

Example 7.3 ().

Instruction $var = $_POST[’paramater’] becomes input var in ISL, and is annotated as input,Taint var_vv,Taint. Both states are Taint (compromised) because the input can be the source of malicious data, and therefore is always Taint, and then the taint propagates to the variable.

As mentioned in the previous section, the token var_vv is not produced when slices are translated into ISL, but used in the corpus to represent variables with state Taint (tainted variables). In fact, during translation into ISL variables are not known to be tainted or not, so they are represented by the var token. In the corpus, if the state of the variable is annotated as Taint, the variable is portrayed by var_vv, forming the pair var_vv,Taint.

The state of the last observation of a sequence corresponds to a final state, and therefore it can only be Taint (vulnerable) or N-Taint (not-vulnerable). If this state is tainted then it means that a malicious input is able to propagate and potentially compromise the execution. Therefore, in this case, the instruction is perceived as vulnerable. Otherwise, the instruction is deemed correct (non-vulnerable).

Example 7.4 ().

Instruction $v = htlmentities ($_GET[’user’]) is translated to sanit_f input var and placed in the corpus as the succession of pairs sanit_f,San input,San var,N-Taint. The first two tokens are annotated with the San state because function htlmentities sanitizes its parameter; the last token is labeled with the N-Taint state, meaning that the ultimate state of the sequence is not tainted.

1$var = $_POST[‘parameter’]
2$var = $_GET[‘parameter’]
3$var = htmlentities($_POST[‘parameter’])
4$var = mysqli_real_escape_string($con, $_GET[‘parameter’])
5$var = htmlentities($var)
6$var = "SELECT field FROM table WHERE field = $var"
7$var = mysqli_query($con, $var)
8$var = mysql_query($var)
9echo $var
10include($var)
11$var = (isset($var)) ? $var : ’’
12if (isset($var) && $var > number)
13if (is_string($var) && preg_match(’pattern’, $var))
14if (isset($var) && preg_match(’pattern’, $var) && is_int($var))
1\end{figure}
2
3
4\begin{figure}%[b]{1\columnwidth}
5\begin{lstlisting}[breaklines=true, firstnumber=1, frame={top}, numbersep=5pt]
6$var = $_POST[‘parameter’]’
input var_vv
2$var = $_GET[‘parameter’]’
input var_vv
3$var = htmlentities($_POST[‘parameter’])’
sanit_f input var
4$var = mysqli_real_escape_string($con, $_GET[‘parameter’])’
sanit_f input var
5$var = htmlentities($var)
sanit_f var var
sanit_f var_vv var
6$var = "SELECT field FROM table WHERE field = $var"
var var
var_vv var_vv
7$var = mysqli_query($con, $var)
ss var var
ss var_vv var_vv
8$var = mysql_query($var)
ss var var
ss var_vv var_vv
9echo $var
ss var_vv
ss var
10include($var)
ss var_vv
ss var
11$var = (isset($var)) ? $var : ’’
var var
var_vv var_vv
12if (isset($var) && $var > number)
cond fillchk var_vv cond
cond fillchk var cond
13if (is_string($var) && preg_match(’pattern’, $var))
cond typechk_str var_vv contentchk var_vv cond
cond typechk_str var_vv contentchk var cond
cond typechk_str var contentchk var_vv cond
cond typechk_str var contentchk var cond
14if (isset($var) && preg_match(’pattern’, $var) && is_int($var))
cond typechk_str var_vv contentchk var_vv typechk_int var_vv cond
cond typechk_str var_vv contentchk var_vv typechk_int var cond
cond typechk_str var_vv contentchk var typechk_int var_vv cond
cond typechk_str var_vv contentchk var typechk_int var cond
cond typechk_str var contentchk var_vv typechk_int var_vv cond
cond typechk_str var contentchk var_vv typechk_int var cond
cond typechk_str var contentchk var typechk_int var_vv cond
cond typechk_str var contentchk var typechk_int var cond

 

1\end{figure}
2
3
4\begin{figure}[htb]
5\begin{lstlisting}[breaklines=true, numbersep=5pt]
6<input,Taint> <var_vv,Taint>
7<sanit_f,San> <input,San> <var,N-Taint>
8<sanit_f,San> <var,San> <var,N-Taint>
9<sanit_f,San> <var_vv,San> <var,N-Taint>
10<var,N-Taint> <var,N-Taint>
11<var_vv,Taint> <var_vv,Taint>
12<ss,N-Taint> <var,N-Taint> <var,N-Taint>
13<ss,N-Taint> <var_vv,Taint> <var_vv,Taint>
14<ss,N-Taint> <var_vv,Taint>
15<ss,N-Taint> <var,N-Taint>
16<cond,N-Taint> <fillchk,Val> <var_vv,Val> <cond,N-Taint>
17<cond,N-Taint> <fillchk,Val> <var,Val> <cond,N-Taint>
18<cond,N-Taint> <typechk_str,Val> <var_vv,Val> <contentchk,Val> <var_vv,Val> <cond,N-Taint>
19<cond,N-Taint> <typechk_str,Val> <var_vv,Val> <contentchk,Val> <var,Val> <cond,N-Taint>
20<cond,N-Taint> <typechk_str,Val> <var,Val> <contentchk,Val> <var_vv,Val> <cond,N-Taint>
21<cond,N-Taint> <typechk_str,Val> <var,Val> <contentchk,Val> <var,Val> <cond,N-Taint>
22<cond,N-Taint> <typechk_str,Val> <var_vv,Val> <contentchk,Val> <var_vv,Val> <typechk_int,Val> <var_vv,Val> <cond,N-Taint>
23<cond,N-Taint> <typechk_str,Val> <var_vv,Val> <contentchk,Val> <var_vv,Val> <typechk_int,Val> <var,Val> <cond,N-Taint>
24<cond,N-Taint> <typechk_str,Val> <var_vv,Val> <contentchk,Val> <var,Val> <typechk_int,Val> <var_vv,Val> <cond,N-Taint>
25<cond,N-Taint> <typechk_str,Val> <var_vv,Val> <contentchk,Val> <var,Val> <typechk_int,Val> <var,Val> <cond,N-Taint>
26<cond,N-Taint> <typechk_str,Val> <var,Val> <contentchk,Val> <var_vv,Val> <typechk_int,Val> <var_vv,Val> <cond,N-Taint>
27<cond,N-Taint> <typechk_str,Val> <var,Val> <contentchk,Val> <var_vv,Val> <typechk_int,Val> <var,Val> <cond,N-Taint>
28<cond,N-Taint> <typechk_str,Val> <var,Val> <contentchk,Val> <var,Val> <typechk_int,Val> <var_vv,Val> <cond,N-Taint>
29<cond,N-Taint> <typechk_str,Val> <var,Val> <contentchk,Val> <var,Val> <typechk_int,Val> <var,Val> <cond,N-Taint>
1\end{figure}
2
3\begin{example}
4        Listing \ref{f_app:build_corpus_1} displays fourteen PHP instructions collected from vulnerable and non-vulnerable slices. The representation of the instructions into ISL is illustrated in Listing \ref{f_app:build_corpus_2}.
5        It is possible to observe that some instructions may have more than one representation, depending if the extracted slice is vulnerable or not. For example, the instruction fifth position in Listing \ref{f_app:build_corpus_2} appears as two series (the two lines immediately below of it) corresponding to the sanitization of an untainted and a tainted variable, respectively.
6        In the listing, it is also visible the difference between the \textcode{var} and \textcode{var\_vv} tokens.
7        Listing \ref{f_app:build_corpus_3} has the final corpus that is produced after applying the last two steps. Each sequence of observations is annotated with the state as explained above. The duplicated sequences are eliminated as several PHP instructions can result in the same sequence. For example, PHP instructions in lines 1 and 2 (Listing \ref{f_app:build_corpus_1}) become the same sequence (line 1 of Listing \ref{f_app:build_corpus_3}).
8\end{example}
9
10
11
12
13
14
15
16%---------------------------------------------
17\subsection{Configuring the HMM}
18\label{s:hmm_ta-param}
19
20The sequence model was mostly defined in Section~\ref{s:hmm_ta}. The only missing piece of information are the \emph{parameters}, i.e., the various probabilities to arrive to the start-states, to do the state transitions, and to perform the emissions of the observations. The probabilities are computed from the corpus by counting the number of occurrences of observations and/or states. The result is 3 matrices of probabilities with dimensions of $(1\times s)$, $(s\times s)$ and $(t\times s)$, where $s$ and $t$ are the number of states and tokens of the model.
21The matrices are calculated as follows:
22
23\noindent
24\emph{Start-state probabilities:} count how many sequences begin in each state. Then, get the probability for each state by dividing these counts by the number of sequences of the corpus. This produces a matrix with the dimension $(1\times 5)$.
25\begin{example}
26        To obtain the start-state probability of the \textcodeit{San} state, we  count how many sequences begin with the \textcodeit{San} state and divide by the size of the \emph{corpus}.
27\end{example}
28
29\noindent
30\emph{Transition probabilities:} count how many times in the corpus a certain state $i$ transits to a state $k$ (including itself). The transition probability is obtained by dividing this count by the number of pairs of states that appear in the corpus that begin with the $i$ state.
31The resulting matrix has a dimension of $(5\times 5)$, keeping the various probabilities for all possible transitions between the five states.
32
33\begin{example}
34        The transition probability for the \textcodeit{N-Taint} state to the \textcodeit{Taint} state is the number of occurrences of this transition in the corpus divided by the number of pairs of states that begin in the \textcodeit{N-Taint} state.
35
36\end{example}
37
38\noindent
39\emph{Emission probabilities:} count how many times a certain observation is emitted by a particular state, i.e., count how many times a certain pair \textcode{$\langle$token,state$\rangle$} appears in the corpus. Then, calculate the emission probability by dividing this count by the total number of pairs \textcode{$\langle$token,state$\rangle$} that occur for that specific state.
40The resulting matrix -- called \emph{global emission probabilities matrix} -- has a dimension of $(22\times 5)$ in order to have a probability for the 22 tokens that could emitted by each of the 5 states.
41
42\begin{example}
43        To obtain the probability that the \textcodeit{Taint} state  emits the \textcodeit{var\_vv} token (\textcode{$\langle$var\_vv,Taint$\rangle$}), first get the number of occurrences of this pair in the corpus, and next divided it by the total number of pairs of the \textcodeit{Taint} state.
44\end{example}
45
46Zero-probabilities should be avoided because the Viterbi algorithm uses multiplication to calculate the probability of moving to the next state, and therefore one needs to ensure that this multiplication is never zero.
47The \emph{add-one smoothing} technique \cite{Jurafsky:08} can address this issue and help to compute the values of the parameters. This technique simply adds a unit to all counts, making zero-counts equal to one and the associated probability different from zero.
48
49
50
51%---------------------------------------------
52\subsection{Detecting Vulnerabilities}
53\label{s:hmm_ta-classify}
54
55
56Given the source code of an application, the collector gathers the slices that should be examined, and then every slice is inspected separately. To commence, the instructions of the slice are translated to ISL. This means that the slice becomes a list of sequences of observations, each one corresponding to a PHP instruction. The discovery of flaws is accomplished by processing the sequences in the order of appearance, starting with the first and concluding with the last.
57
58The HMM model is applied to each sequence of observations to find out the associated states.
59We resort to an extension of the Viterbi algorithm to perform this task. The algorithm employs dynamic programming to compute the most likely succession of states that explain a sequence of observations. As the algorithm finishes with a sequence, a final state comes out, either as \textcodeit{Taint} or \textcodeit{N-Taint}. This information is then propagated to the next sequence. The process is repeated for all sequences, and the final state of the last sequence defines the outcome for the slice --- either as vulnerable (if it is tainted) or non-vulnerable (if it is untainted).
60
61For the classification to be carried out effectively, it is necessary to spread faithfully the taintedness among the sequences under analysis, which means keeping information about the variables that are tainted. For this purpose,  we use three artifacts that are updated as the execution evolves:
62
63\begin{itemize}
64        \item \emph{Tainted List} (TL): as sequences are processed, it keeps the identifiers of the variables that are perceived as tainted;
65
66        \item \emph{Conditional Tainted List} (CTL): contains the inputs (token \textcodeit{input}) and tainted variables (belong to TL) that have been validated (e.g., by tokens \textcodeit{typechk\_num} and \textcodeit{contentchk});
67
68        \item \emph{Sanitized List} (SL): has essentially a similar aim as CTL, except that it maintains the variables that are sanitized or modified (e.g., with functions that manipulate strings).
69\end{itemize}
70
71\begin{example}
72        Fig.~\ref{t:xss} has the PHP code for the two slices composed of lines \{1, 2, 3, 4\} and \{1, 3, 5, 6\} respectively.
73        After processing the first slice, TL = \{u, a\} and CTL = \{u, a\} as variable $u$ is the parameter of the \textcodeit{contentchk} token and variable $a$ is the parameter of the \textcodeit{typechk\_int} token. The final state is \textcodeit{N-Taint} because variable $u$ is included in CTL. In the other slice, TL = \{u, a\} and CTL = \{ \} since there is no validation and the final state is \textcodeit{Taint}.
74\end{example}
75
76In our implementation, the Viterbi algorithm was extended to explore the information kept in the variable map and in these artifacts (further details in Section \ref{ss:mod_viterbi}). Handling a sequence of observations becomes a three step procedure: (1) a preprocessing step is carried out -- \textcodeit{beforeVit}; (2) then, the decoding step of the Viterbi algorithm is applied -- \textcodeit{decodeVit}; (3) and lastly, a post-processing step is executed -- \textcodeit{afterVit}.
77They work as follows:
78
79\begin{description}
80        \item [\textcodeit{beforeVit}:] the variable map is visited to get the name of the variable associated to each \textcodeit{var} observation. The TL and SL are checked to determine if they hold that name. In case the sequence starts with the token \textcodeit{cond}, the list CTL is also accessed. If a variable only belongs to TL, then the \textcodeit{var} observation is modified to \textcodeit{var\_vv}, thus capturing the effect of the variable being tainted. Finally, an emission probability sub-matrix for the observations of the sequence is also retrieved from the global emission probabilities matrix;
81
82        \item [\textcodeit{decodeVit}:] for each observation, the Viterbi algorithm calculates the probability of each state to emit it, considering the probabilities of emission, of transition, and of the states already discovered. The multiplication of these three probabilities results in a probability called \emph{score of state}. The state that is assigned to an observation is the one that has the highest score. The process is repeated for all observations and the state of the last observation is the one that classifies the sequence as \textcodeit{Taint} or \textcodeit{N-Taint}.
83
84        In more detail, the three probabilities are obtained as follows: emission come from the sub-matrix of emission probabilities, regarding the observations that will be processed; transition are from the matrix of transition probabilities; previous state is determined by picking up the highest score computed for the previous observation. This last probability brings to the calculation the order in which the observations appear in the sequence and the knowledge already discovered about the previous observations. However, since this knowledge does not exist for the first observation of the sequence, in this case the start-state probabilities are used;
85
86        \item [\textcodeit{afterVit}:] if the sequence \emph{is an assignment} (i.e., the last observation of the sequence is a \textcodeit{var} token and the entry in the variable map starts with 1), then the corresponding variable name is obtained from variable map. Next, the TL is updated: (i) inserting the variable name if the final state is \textcodeit{Taint}; or (ii) removing it if the state is \textcodeit{N-Taint} and the variable is in TL; in the presence of a sanitization sequence, the variable name is also added to SL. In case the sequence \emph{is an \textcodeit{if} condition} (i.e., the first and last observations are a \textcodeit{cond} token), then the variable map is searched for each \textcodeit{var} and \textcodeit{var\_vv} observation. Next, the TL is searched to discover if it includes the name, and in that situation, the CTL is updated by inserting that name.
87\end{description}
88
89The end result of these actions is that one gets the ability to keep the relevant knowledge about the propagation of inputs through the slice, and thus determine how they can influence the sensitive sinks.
90
91
92\begin{example}
93        Fig.~\ref{t:classif_sqli_1}(a) and (b) shows an example of the detection of a bug. It comprises from left to right: the PHP code, the representation in ISL, the variable map, and the TL after observations are classified.
94        In line 1, the Viterbi algorithm is applied and as result the \textcodeit{var} observation is tainted because by default an \textcodeit{input} observation is so; the model classifies it correctly and variable \textcodeit{u} is inserted in TL.
95        In line 2, the first \textcodeit{var} observation is updated to \textcodeit{var\_vv} because it corresponds to variable \textcodeit{u} that belongs to TL, and then the Viterbi algorithm is applied; the \textcodeit{var\_vv var} sequence is classified by the model and the final state is \textcodeit{Taint}; therefore, variable \textcodeit{q} is inserted in TL. The process is repeated for the next line, allowing the discovery of the flaw.
96        %
97        Fig.~\ref{t:classif_sqli_1}(c) presents the decoding of the slice while the processing progresses. Here, it is possible to see the places where \textcodeit{var} is replaced by \textcodeit{var\_vv}, with the relevant variable name as suffix. In addition, the states of each observation are also added.  By following the generated states, one can understand the effects of the code execution (without actually running it), which variables are tainted, and why the code is vulnerable. The state of the last observation indicates the final classification --- a vulnerability.
98\end{example}
99
100
101
102%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
103%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
104\section{Implementation and Initial Assessment}
105\label{s:impl_eval_dekant}
106
107Our approach is implemented in the DEKANT tool. A corpus was also created to train the model. This corpus can be extended in the future with additional annotated sequences, allowing the tool to evolve its knowledge and detection capabilities.
108
109
110%---------------------------------------------
111\subsection{Implementation of DEKANT}
112\label{ss:impl_dekant}
113\label{subsub:vul_det}
114
115DEKANT is programmed in Java and its architecture is divided in four major modules, which are explained below in more detail:
116
117\noindent
118\emph{Knowledge extractor:}  operates separately from the other modules and is executed when the corpus is built or later modified. It runs in three steps:
119(i) the sequences composed of series of annotated tokens are loaded from a plain text file. Each sequence is separated in pairs \textcode{$\langle$token,state$\rangle$} and the elements of each pair are inserted in the matrices called \emph{observations} and \emph{states}. Since sequences normally have different numbers of pairs, it becomes necessary to \emph{normalize the length of all sequences} in the corpus.
120This is accomplished by first determining the length of the largest sequence, and then by padding shorter sequences with the \textcodeit{miss} token together with the state of the last observation (i.e., with pairs \textcode{$\langle$miss,Taint$\rangle$} or \textcode{$\langle$miss,N-Taint$\rangle$}) to ensure that all sequences have the same length;
121(ii) then, the various probabilities of the model are computed as explained in the previous section;
122(iii) lastly, all relevant information about the model is saved in a plain text file to be loaded by the vulnerability detector module.
123
124\noindent
125\emph{Slice collector:} uses a lexer and a parser to process PHP code (based on ANTLR\footnote{https://www.antlr.org/}). It searches the application files for places where inputs arrive from the user and then tracks the data flows until either a security-critical instruction is reached or the program exits.
126Slices that have both an entry point and a sensitive sink are passed to the translator (and the others are discarded).
127The information about which entry points and sensitive sinks should be considered is provided in a configuration file.
128
129\noindent
130\emph{Slice translator:} The module reads configuration files describing the classes of tokens, e.g., containing the PHP functions that are represented by tokens. Some of them are transversal to any class of vulnerability, whereas others are specific to a particular bug. For example, the \textcode{input} file contains \textcode{\$\_GET} and \textcode{\$\_POST} global arrays and the \textcode{ss\_xss} file has the security-sensitive functions associated with XSS (e.g., \textcode{echo}). The module first parses the slice and next verifies which tokens should be assigned to each PHP instruction, following the ISL grammar rules. Simultaneously, it also generates the variable map.
131
132\noindent
133\emph{Vulnerability detector:} works in three steps to find the bugs.
134(i) the probabilities are loaded from a file and the model is setup internally; (ii) the slice translated into the intermediate language is processed using the modified Viterbi algorithm. Sometimes, it occurs that a sequence has more observations than the largest sequence that was seen in the corpus. When this happens, it is necessary to divide the sequence in sub-sequences with at most the maximum corpus sequence length. Then, each one is classified separately, but the algorithm is careful to ensure that the initial probability of the following sub-sequence is equal to the probability resulting from the previous sub-sequence; (iii) lastly,  the various probabilities are estimated for a sequence of observations to be explained by particular sequences of states, and the most probable is chosen. An alert message is issued if a vulnerability is found.
135
136%---------------------------------------------
137\subsubsection{Extensions to the Viterbi algorithm}
138\label{ss:mod_viterbi}
139
140We extended the Viterbi algorithm with the two procedures of Section \ref{s:hmm_ta-classify} (\textcodeit{beforeVit} and \textcodeit{afterVit}) to track the propagation of inputs while processing a slice and to explore the data structures that keep relevant knowledge about variables (e.g., the three artifacts TL, CTL and SL).
141
142
143        Listing \ref{f:viterbi_modified_1} presents the \textcodeit{beforeVit} preprocessing procedure that is run before the Viterbi algorithm.
144        \textcode{beforeVit} does a few tests to manipulate some flags and change the data structures. For each observation (\textcode{obs}) in the sequence (\textcode{inst\_slice\_isl}) there are checks to find out: (i) the presence of sanitization  (\textcode{sanit\_f}) or a \textcode{cond} tokens. For the latter case, it is verified the \textcode{obs} position in the sequence to discover if the instruction is an \textcode{if} statement, an instruction inside of a conditional statement, or an \textcode{else} statement (lines 19 to 30);
145        (ii) an \textcode{if} statement is searched for validation functions and if their parameters are a variable or an input (i.e., \textcode{var} or \textcode{input}). In such case, \textcode{var} or \textcode{input} are inserted in CTL (lines 32 to 48);
146        (iii) an instruction inside an \textcode{if} statement is checked if the \textcode{var} and \textcode{input} tokens  belong to CTL and/or SL. VM (variable map) is accessed to get the name of variable associated to \textcode{var} token. If the token \textcode{input} belongs to the SL or CTL lists, it is replaced by the \textcode{var} token because it has to loose its taintedness (we recall that by default this token is tainted and the \textcode{var} token is untainted, so this replacement is required) (lines 50 to 61);
147        (iv) in  presence of another instruction and if the observation is a \textcode{var} token, i.e., the \textcode{inst\_slice\_isl} is out of the validation scope, the name of variable is taken from VM and checked if it belongs to TL but not in SL. In such a case, the variable is tainted, and the observation is replaced by the \textcode{var\_vv} token (lines 63 to 71).
148        For all four verifications, the emission probability of the observation in analysis is retrieved from the global emission probabilities matrix (\textcode{GEP}), then it is inserted in the emission probabilities matrix (\textcode{EP}) of the \textcode{inst\_slice\_isl} (line 72).
149
150        Afterwards, the traditional Viterbi algorithm is executed (\textcodeit{decodeVit} step as explained in Section \ref{s:hmm_ta-classify}) and then the post-processing \textcode{afterVit} procedure runs.
151
152
153\begin{lstlisting}[numbers=left, numbersep=5pt, breaklines=true, language=pascal, caption={\small  {\emph{beforeVit}} extension to the Viterbi algorithm.}, captionpos=b, label=f:viterbi_modified_1]
154/* >>> Data structures and variables <<<
155** VM - variable map
156** TL - tainted list
157** CTL - conditional tainted list
158** SL - sanitized list
159** obs_index - index of obs in the instruction_slice_isl
160** var_name - variable name of the obs from inst_slice_isl
161** condition - variable for controlling if stattements
162** val - variable for controlling validation functions
163** san - variable for controlling sanitization functions
164** EP - emission probability matrix of instruction_slice_isl
165** GEP - global emission probabilities matrix
166** obs_ep - emission probability of the obs in analysis
167*/
168
169val = 0
170san = 0
171for each obs in inst_slice_isl do
172   if obs = sanit_f then san = 1 end_if
173
174   if obs = cond then
175      if obs_index = 1 then
176         if size(inst_slice_isl) = 1 then condition = 0 else condition = 1 end_if
177      else
178         condition = 2
179      end_if
180      get obs_ep from GEP
181   end_if