Intel ISEF Project, Yale Science and Engineering Award
As computing systems become increasingly advanced and as users increasingly engage themselves in technology, security has never been a greater concern. In malware detection, static analysis has been the prominent approach. This approach, however, quickly falls short as malicious programs become more advanced and adopt the capabilities of obfuscating its binaries to execute the same malicious functions, making static analysis virtually inapplicable to newer variants. The approach assessed in this paper uses dynamic analysis of malware which may generalize better than static analysis to variants. Widely used document classification techniques were assessed in detecting malware by doing such analysis on system call traces, a form of dynamic analysis. Features considered are extracted from system call traces of benign and malicious programs, and the task to classify these traces is treated as a binary document classification task using sparse features. The system call traces were processed to remove the parameters to only leave the system call function names. The features were grouped into various n-grams and weighted with Term Frequency-Inverse Document Frequency. Support Vector Machines were used and optimized using a Stochastic Gradient Descent algorithm that implemented L1, L2, and Elastic-Net regularization terms, the best of which achieved a highest of 98 identification of significant system call sequences that could be avenues for further research.READ FULL TEXT VIEW PDF
A common way to get insight into a malicious program's functionality is ...
Nowadays, malware and malware incidents are increasing daily, even with
Deep learning has been used in the research of malware analysis. Most
Malware detection plays a vital role in computer security. Modern machin...
The use of Machine Learning has become a significant part of malware
Malicious software threats and their detection have been gaining importa...
Malicious calls, i.e., telephony spams and scams, have been a long-stand...
Intel ISEF Project, Yale Science and Engineering Award
Static malware analysis has been the prominent approach in malware detection. Static analysis pertains to analyzing binaries of programs without executing them. Although many valid approaches have been proposed, research suggests that the binary code obfuscation techniques that are available today are incredibly difficult to overcome moser2007limits . For the same reason, although static analysis can often accurately detect known malware, it struggles against new variants and zero-day threats canzanese2015detection . Regarding this, an approach that may resolve this issue is dynamically analyzing a given program to detect if it is malicious, thereby analyzing the behavior of programs instead. The idea is that even if a malicious file changes, its behavior should remain the same. Dynamic analysis aims to find patterns in program execution, training a program to do which will allow future detection of malicious behavior, regardless of its code structure or whether or not it has been found before.
Some approaches to dynamic analysis of malware include looking for files that have been added or modified, newly installed services, newly running processes, registry modifications, and more distler2007malware . One such method is analyses of system calls. System calls are routines user programs call to use services of the operating system hubballi2011sequencegram . Any program running in an operating system has a definitive set of system calls. By analyzing sequences of system calls of programs running in normal operating conditions, one may gain insight in the abnormality of processes executed by a given malicious program by analyzing to what extent it diverges from usual behavior. FIG. 1 shows an example Windows Native API system call trace.
The task to classify traces of system calls as belonging to either a malicious or a benign program is treated in parallel to a binary document classification task. Document classification is a form of machine learning, a subset of Natural Language Processing (NLP), aiming to assign a category or a class to a document by analyzing its content ghaffari . With recent improvements and successes in document classification, it was deemed appropriate to utilize its methods to evaluate its effectiveness in this regard metz_2017 . The purpose of this research is to evaluate the effectiveness of a machine learning approach in malware detection, using document classification techniques to conduct dynamic analysis of malicious programs using system call traces. A trace of system calls of a given program will be equivalent of a document in a document classification task, and by training classifiers on the dataset, we aim to classify previously undetected malicious programs by making predictions based on their system calls.
The ease in which static data about malware can be obfuscated and the extent to which that increases the limits of static analysis approaches to malware detection have been extensively evaluated moser2007limits
. Intrusion detection using approaches ranging from variants of SVMs and neural networks using data pertaining to network traffic have been attempted with significant successmukkamala2002intrusion thaseen2017intrusion cannady1998artificial ibrahim2010anomaly . Specifically in analysis of system calls, there has been implementations on Android phones malik2016system chaba2017malware . In application to Windows operating systems, there has been an approach using n-grams of Windows API calls and SVM, a further discussion on n-grams of system calls and its variants dubbed Sequencegrams, and a paper assessing the use of one, two, three, and four grams of system calls weighted by TF-IDF in malware detection veeramani2012windows hubballi2011sequencegram canzanese2015detection .
This study aims to build upon the related work by attempting and comparing different approaches in using document classification techniques on system call analysis in detecting malware. It seeks to prove that certain choices are noticeably useful in this approach. It also proposes a system that integrates this approach of using system calls for malware detection into a deployable form.
Note: The feature extraction process and the detection algorithms were implemented using the Scikit-learn libaryscikit-learn .
Various malware corpora were collected from online sources such as VirusShare and The Zoo virusshare thezoo . VirusTotal, a website that displays results of testing the program on various antivirus software, was used to validate whether or not a given program was malicious or benign virustotal . A program was deemed to be malicious if more than 80% of the antivirus softwares shown in VirusTotal deemed it to be malicious, and a program was deemed to be benign if all agreed that it was harmless.
NtTrace, a native API tracer for Windows, was used to collect system call traces. orr . This program is designed to run in the command prompt, specifying the path to the program as well as options on how it should be traced (filters, logging only errors, etc). As running malware on personal computers will not be safe for various reasons, a virtual machine was created to be used as a host. VirtualBox was used for this purpose virtualbox . Malware samples were executed on a Windows operating system hosted in VirtualBox. Batch scripts were used to automate the process of collecting the system calls. A batch script is a file that executes a series of commands laurie . Batch scripts can run these commands in loops, automating the process of tracing the system calls executed by hundreds of programs.
Because the features considered are sequences of system call functions, for the purpose of this research, the parameters were not considered as a feature. A script was used to process the system call logs generated by NtTrace to remove the parameters, only leaving the function names. Furthermore, sections of the logs that were not related to system call function names, such as logs informing unloading of DLLs, were removed as well. FIG. 2 demonstrates this pre-processing process.
We define some terms and variables that will be referred to as following:
frequency of the word in
is the document corpus and represents a particular document.
is the vocabulary and represent each word that appears in the corpora
1. Bags of Words Model
Bags of words is the way in which features are extracted from text to be used for the machine learning algorithms. The idea behind “bag” is that order is not accounted for; this model only takes into account whether certain words occur in a document, not where they occur in a document brownlee_2017 . In this research, this model is applied in the sense that the “words” are system calls and the “documents” are logs of system calls. In this model, set is built as the vocabulary set of all unique system calls (represented by , where is the index of the system call in the vocabulary). Each document is represented by how many times every occurs in the document, as represented by , where represents the frequency of the system call of index in the log .
2. -Grams Model
This approach to use single system calls as features is a unigram approach, as the features consist of just one “gram”. This falls short in many text classification tasks as the bags-of-words model does not take into account sequences of text. The -gram model accommodates for this issue, where the features are sequences of system calls instead of a single system call, therefore analyzing frequencies of sequences of system calls instead of frequencies of individual system calls. In an -grams model, the features are sequences of n number of system calls. FIG. 3 demonstrates the conversion of unigrams to bigrams.
In this research, the implemented -grams were eight-grams, nine-grams, and ten-grams. This was derived through grid search algorithms specified later that showed that this combination provided the best accuracy. This was expected, considering that operations of any program would typically require reasonably long sequences of system calls.
3. Term Frequency - Inverse Document Frequency (TF-IDF) Weighting
In any document corpus, certain words are more common than others, such as “the”, “a”, “of”, and more. An assumption was made that the same idea applies to logs of system calls, that certain sequences pertain to all operations of any program in an operating system and certain sequences pertain to specific operations that may or may not be malicious. In document classification, one of the most popular model to derive weights to terms that occur in a document is the TF-IDF model.
Term Frequency (TF) refers to the number of times a certain word occurs in a document. Inverse Document Frequency (IDF) refers to the amount of times the word occurs throughout the corpus.
The TF-IDF weight of a term is computed as following:
: frequency of in document
Where is the number of documents the word appears in.
This is a logarithmically scaled value of the number of documents in the corpus divided by the number of times word w appears throughout the corpus.
The value increases proporationally by the frequency of in a document, decreases proportionally by the of the frequency of in the corpus. The assumption is that a word that is more prevalent throughout the corpus is more likely to be less significant.
The resulting vectors comprised of raw values that represent each document are normalized using the Euclidean norm:
1. Support Vector Machines (SVM)
The objective of a SVM classifier is to learn a decision boundary hyperplane that optimally separates the dataset. The optimized decision boundary is then used to compute whether or not a new data point that it is tested on pertains to malware or not.
Given a training data, , where is the collection of features (TF-IDF values of n-grams of system calls) of a document and , the objective of SVM is to learn a classifier so that:
Where is defined by:
The best decision boundary is determined by the margins between the decision boundary and the support vectors, the data points closest to the decision boundary. The best decision boundary is defined by one that has the largest margins from the support vectors.
2. Stochastic Gradient Descent (SGD) Optimization Using Various Regularization Terms
The Stochastic Gradient Descent algorithm optimizes this optimal decision boundary. With the goal of learning , the best model parameters is computed by minimizing the regularized training error, which is given by:
Where is the loss (cost) function, which measures the error of the model, and
is a regularization term, reducing the likelihood that the model is overfitting. The hinge loss function () is defined by the following:
where is the prediction of the classifier and is the intended output.
The regularization terms implemented were L1, L2, and Elastic Net penalties.
A convex combination of L1 and L2 Penalty.
Given this error function, the parameter is updated iteratively by the following operation:
Where is the learning rate that controls the step size:
Where is the time step.
3.Coordinate Descent Method Using LibLinear’s Linear Support Vector Machine
Another SVM was optimized using a different framework, which was a Linear SVM implemented using LibLinear, developed by the machine learning group in the National Taiwan University. LibLinear uses the Coordinate Descent method to optimize the linear support vector machines.
The width of the margin of a SVM can be represented by: , and the objective is: . This poses a constrained optimization problem:
The Lagrange form of which is:
The Lagrangian can be simplified to (due to the properties of the partial derivatives) the Wolfe dual form:
Minimizing is to:
Where Q is a matrix where and is a vector of all ones.
The Coordinate Descent algorithm finds the minimum of a multivariate function F(x) by solving univariate optimization problems iteratively through all its variables (inner iteration), and iteratively doing this several times (outer iteration). With the following definition, being the outer iteration and being the inner iteration:
solving the following univariate function:
Where and is the scaler to it, representing a step towards that direction. It becomes apparent that the minimum is found when , where there is no where to move to minimize the function , which is when
Where refers to the projected gradient.
4. Hyperparameter Optimization
4. Hyperparameter Optimization
Machine learning models require varying constraints, learning rate, and more that can affect the performance of the model. These factors are called hyperparameters and they must be tuned for a model to produce optimal performance. The method we implemented was Grid Search, which demonstrated by Algorithm 1.
The benefits of this process is visualized by Figure 4. Darker shades of red indicates a higher value in fl_score of the model with the corresponding combination.
1. Splitting Training & Testing Set
In order to create a realistic scenario for testing, the corpus of data was split into a training set which the classifiers were trained on, and a testing set which the classifiers have not seen before. This way the results of the classifiers could be trusted, since any system call logs encountered in the testing data would not have been encountered by the classifiers, and would not be able to directly classify.
2. Precision Score
The precision score is the number of correctly identified malicious programs over the number of true positives plus false positives:
This provides a practical score to judge the performance of each classifier, as false positives can be as troublesome as false negatives in certain cases, and are more noticeable during operation by users. Considering that a lopsided proportion of malicious traces may skew the accuracy if it only accounts for true positives, this precision score displays a better idea of the accuracy of the model.
The recall score is the number of true positives divided by the number of true positives plus false negatives:
This gives an intuitive rate of how many malicious programs were detected.
4. F1 Score
The F1 score is a weighted average of the precision score and recall score:
It offers a different view of the accuracy of each classifier, in which false positives and false negatives are both integrated. This acts as a more holistic means of comparison.
The dataset was divided by a 1:4 ratio; 80% were used to train the classifiers and 20% were used to test the classifiers. Also, the proportion of malicious trace in the testing set was 63.7%, and judgement on the scores provided below need to take into account this proportion. Table 1 shows the results of primal SVM optimized using Stochastic Gradient Descent and Table 2 shows the results of using LibLinear to optimize the dual form of SVM using the Coordinate Descent method.
It is crucial for a malware detection program to not only detect malware but have a very low false positive rate. The programs were optimized accordingly, preferring a better score in its precision score for benign data than for malware.
This TPR (True Positive Rate) vs. FPR (False Positive Rate) tradeoff can be represented through the ROC (Receiver Operating Characteristic) curve, which plots the values of TPR and FPR at different decision thresholds, shown in FIG. 5 for the SGD Classifier. FIG. 6 is the ROC curve for the SVM implemented using LibLinear.
Furthermore, refer to Table 2 to compare the effectiveness of the options in the feature extraction process such as TF-IDF weighting and the use of n-grams. The average values were computed by calculating the mean of the respective scores of the classifiers listed previously on Figure 4.
|Avg. Precision||Avg. Recall|
|TFIDF & 10-Grams||0.93||0.92|
|TFIDF & Unigrams||0.90||0.89|
|Term Frequency & 10-grams||0.78||0.78|
This is a valid testimony to the effectiveness of these options not only in document classification tasks but also in applying document classification on system call traces.
Furthermore, it is also observed that SVM classifiers were able to classify the traces in the testing set in the shortest time, which is one of the reasons why SVMs are often preferred in document classification tasks. Table 3 shows their training and testing time in comparison to other classifiers.
|Training Time (s)||Testing Time (s)|
|L2 Penalty SVM SGD||0.098s||0.000s|
Bernoulli Naive Bayes
Although these are small absolute differences, as the classification task scales, these differences may amplify to produce a considerable difference.
This paper demonstrates the effectiveness of applying document classification techniques on system call traces for the purpose of detecting malicious programs. The effectiveness and time complexities of certain algorithms as well as certain options popular in the realm of document classification was compared to judge the validity of these options.
The classifiers trained in this research were also able to provide features (TF-IDF values of 10-grams of system calls) that they deemed to be the most significant. These system call sequences could be avenues for further research to determine the nature of these sequences.
The sequences in Table 4 were produced by the SVM classifier optimized by SGD using L1 regularization term:
|7.20687749597||ntqueryinformationthread ntqueryinformationthread ntqueryinformationthread ntqueryinformationthread ntqueryinformationthread ntqueryinformationthread ntqueryinformationthread ntqueryinformationthread ntqueryinformationthread ntqueryinformationthread|
|6.41295595185||ntmapviewofsection ntunmapviewofsection ntmapviewofsection ntunmapviewofsection ntmapviewofsection ntunmapviewofsection ntmapviewofsection ntunmapviewofsection ntmapviewofsection ntunmapviewofsection|
|4.75759889433||ntsetinformationfile ntreadfile ntsetinformationfile ntreadfile ntsetinformationfile ntreadfile ntsetinformationfile ntreadfile ntsetinformationfile ntreadfile|
|4.75759889433||ntreadfile ntsetinformationfile ntreadfile ntsetinformationfile ntreadfile ntsetinformationfile ntreadfile ntsetinformationfile ntreadfile ntsetinformationfile|
Greater values of coefficients indicate greater relevance for the specific classifier. These features were the few among 237588 sequences that these classifiers took into consideration in classifying these logs of system calls. Table 5 shows the sequences produced by the same classifier as that of Table 4, but using L2 Penalty.
|3.0649184842||ntdelayexecution ntdelayexecution ntdelayexecution ntdelayexecution ntdelayexecution ntdelayexecution ntdelayexecution ntdelayexecution ntdelayexecution ntdelayexecution|
|2.07007999086||ntgetcurrentprocessornumber ntgetcurrentprocessornumber ntalpcsendwaitreceiveport ntgetcurrentprocessornumber ntgetcurrentprocessornumber ntalpcsendwaitreceiveport ntgetcurrentprocessornumber ntgetcurrentprocessornumber ntalpcsendwaitreceiveport ntgetcurrentprocessornumber|
|1.55237596748||ntdeviceiocontrolfile ntclose ntcreateevent ntdeviceiocontrolfile ntclose ntcreateevent ntdeviceiocontrolfile ntclose ntcreateevent ntdeviceiocontrolfile|
|1.52963083335||ntclose ntcreateevent ntdeviceiocontrolfile ntclose ntcreateevent ntdeviceiocontrolfile ntclose ntcreateevent ntdeviceiocontrolfile ntclose|
/NtMalDetect) is an open source project that utilizes the classifiers trained from this research to put them into an executable form. It utilizes boosted classifiers, combining inputs of various classifiers to produce one output, to detect malicious program before and after execution.
Thank you staff members and colleagues at Shanghai American School Puxi Campus and members of Coderbunker for providing the resources and guidance to be able to conduct this research and enter this project to the Intel International Science and Engineering Fair. This project has been recognized by being awarded the Yale Science and Engineering Award and being named a finalist project to the Intel International Science and Engineering Fair (ISEF). At Intel ISEF, this project was recognized by receiving special awards from King Abdulaziz & His Companions Foundation for Giftedness and Creativity, ”MAWHIBA” and from China Association for Science and Technology. It was awarded $1200 and $1000 from those organizations, respectively. At the fair, this project received the 4th place Grand Award for the Systems Software category, awarding it $500.
Sequencegram: n-gram modeling of system calls for program based anomaly detection.In Communication Systems and Networks (COMSNETS), 2011 Third International Conference on, pages 1–10. IEEE, 2011.
Intrusion detection model using fusion of chi-square feature selection and multi class svm.Journal of King Saud University-Computer and Information Sciences, 29(4):462–472, 2017.