Dynamic Malware Analysis with Feature Engineering and Feature Learning

07/17/2019 ∙ by Zhaoqi Zhang, et al. ∙ National University of Singapore 0

Dynamic malware analysis executes the program in an isolated environment and monitors its run-time behaviour (e.g., system API calls) for malware detection. This technique has been proven to be effective against various code obfuscation techniques and newly released ("zero-day") malware. However, existing works typically only consider the API name while ignoring the arguments, or require complex feature engineering operations and expert knowledge to process the arguments. In this paper, we propose a novel and low-cost feature extraction approach, and an effective deep neural network architecture for accurate and fast malware detection. Specifically, the feature representation approach utilizes a feature hashing trick to encode the API call arguments associated with the API name. The deep neural network architecture applies multiple Gated-CNNs (convolutional neural networks) to transform the extracted features of each API call. The outputs are further processed through LSTM (long-short term memory networks) to learn the sequential correlation among API calls. Experiments show that our solution outperforms baselines significantly on a large real dataset. Valuable insights about feature engineering and architecture design are derived from ablation study.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Cybersecurity imposes substantial economic cost all over the world. A report [1]

from the United States estimates that the costs by malicious cyber activity on the U.S. economy lay between $57 billion and $109 billion in 2016. Malware is one of the major cybersecurity threats that evolve rapidly. It is reported more than 120 million new malware samples are being discovered every year 

[2]. Therefore, the development of malware detection techniques is urgent and necessary.

Researchers have been working on malware detection for decades. The mainstream solutions include static analysis and dynamic analysis. The static analysis algorithms scan the binary code (or scripts) of the malware to create signatures [3, 4, 5]

(e.g. printable strings, n-grams, instructions) for malware matching or extract features for training malware recognition models. For the signature matching based approaches, the detection performance heavily depends on the size of the signature database 

[6, 7]. For the training based approaches, they might be vulnerable to code obfuscation[8] or inadequate to detect new (”zero-day”) malware with different features [9]. In contrast, the dynamic analysis algorithms execute each sample in an isolated environment to collect the run-time behaviour information for detection. In spite of the extra time cost from executing the samples, dynamic analysis typically exerts a higher detection rate and is more robust than static analysis [10, 11, 12]. In this paper, we focus on dynamic analysis.

Among the collected run-time information, the system API call sequence is the most popular data source for dynamic analysis as it captures all the operations (including network access, file manipulation operations, etc.) executed by the program. Each API call in the sequence consists of the API name, the argument names and argument values. Various machine learning models have been trained to recognize malware based on the API call sequence, such as Naive Bayes 

[13]

, Support Vector Machines (SVM) 

[14]

, and Random Forests 

[15]. To apply these models, feature engineering is conducted to extract features from the API sequence. For example, if we consider the API name as a gram, then the most N frequent n-grams features can be extracted (

) from the sequence. However, it is non-trivial to extract the features from the arguments of heterogeneous types, including strings, integers, addresses, etc. Recently, researchers have applied deep learning models 

[16, 17]

for dynamic analysis. Deep learning models like convolutional neural network (CNN) and recurrent neural network (RNN) can learn features from the sequential data directly without feature engineering. Nonetheless, the data of traditional deep learning applications like computer vision and natural language processing is homogeneous, e.g., images (or text). It is still challenging to process the heterogeneous API arguments using deep learning models. Therefore, most existing approaches ignore the arguments. As a result, a lot of information is unexploited.

There are a few approaches [15, 18, 19] leveraging API arguments. [15, 19] simply treats all argument names and values as the text string, and then extracts the most N frequent n-grams as the features. Consequently, the heterogeneous information from different types of arguments are not fully exploited. [18] maintains a list of strings to store the API name and the strings in the parameters and uses the bits of a feature vector to indicate whether a particular string is present. This method only focuses on the most frequent strings and other heterogeneous information is ignored.

In this paper, towards more effective dynamic malware analysis, we propose novel feature engineering methods for the system API arguments and propose a new deep learning architecture for malware detection. In particular, for different types of arguments, we propose corresponding hashing approaches to extract the features. The argument features and the features extracted (by hashing) from the API name and category are concatenated and fed into the deep learning model. Deep learning models have a good property that their capacity can be easily tuned by changing the architecture to match the complexity of the data. In addition, the model parameters can be trained to transform the input features into a better representation for the tasks of interest. We use multiple gated CNN models [20] to transform the hashed high dimensional features of each API call. The output from the CNN models, i.e., a sequence of transformed features, are processed by RNN to extract the sequential correlation for the final binary prediction, i.e., malware or benign program.

We conduct experiments over a large dataset of portable executable (PE) files provided by a security company. A distributed dynamic analysis framework is implemented to speed up the process, which uses Cuckoo111https://cuckoosandbox.org/

to collect the dynamic trace of each program in parallel. Our solution outperforms all baselines with a large margin. Through extensive ablation study, we find that both feature engineering and model architecture design (e.g., using batch normalization 

[21] and LSTM) are crucial for achieving high generalization performance.

The main contributions of this paper are:

  1. We propose a novel feature representation for system API arguments. The extracted features will be released for public access.

  2. We devise a deep neural network architecture to process the extracted features, which combines multiple gated CNNs and RNN. It outperforms all existing solutions with a large margin.

  3. We conduct extensive experiments over a large real dataset. Valuable insights about the feature and model architecture are found through ablation study.

The rest of this paper is organized as follows. Section 2 introduces related work on malware detection. Section 3 shows the proposed system. The methodologies including feature representation and model architecture are covered in Section 4. The experiments are demonstrated in Section 5 and the conclusions follow in Section 6.

Ii Related Work

Fig. 1: DMDS Architecture

In this section, we review the dynamic malware analysis from the feature engineering methods and deep learning models perspective, which are in accordance with our contribution on a new feature extraction method and a new deep learning architecture respectively.

Ii-a Feature Engineering for System API Calls

We review various feature representations of the system API calls below. They are typically fed into traditional machine learning models for dynamic analysis, such as clustering algorithms, SVM, random forest, etc. In our method, we use hashing tricks to extract the features, which requires less domain knowledge than some existing methods.

Bayer et al. [22] extend a controlled virtual environment called ANUBIS to collect sample’s execution trace. An 8-tuple is constructed as the representation, which consists of the system call’s name, corresponding objects such as files, and dependencies between these system calls and objects.

Trinius et al. [23] introduce a feature representation called Malware Instruction Set (MIST). MIST uses several levels of features to represent a system call. The first level represents the category and name of API calls. The following levels are specified manually for each API call to represent their arguments. Therefore, the feature from the same level but for different APIs could have different semantics. The inconsistency imposes challenges to learn patterns using machine learning models. Qiao, Yong et al. [24] extend the MIST and propose a representation called Byte-based Behavior Instruction Set (BBIS). They claim that only the first level of MIST is efficient and thus BBIS only uses the category and name of API calls. Besides, they propose an algorithm (CARL) to process consecutively repeated API calls.

Statistical features are popular for training machine learning models. API call names and their arguments are treated as separate strings in [15, 25, 26]. String frequency and the distribution of API name lengths are extracted as the features. Ahmed et al. [13]

also use statistical features that capture both the spatial and temporal information. Spatial information is extracted from arguments, such as the mean, variance, entropy of some arguments. Temporal information is extracted from the n-grams API call sequence, including the correlation and transformation possibility between two n-grams.

Salehi et al. [27] propose a feature representation associating the API call sequence with their arguments. It assigns each argument to bind with its API call to form a new sequence, However, this approach leads to an extremely long feature vector and might lose the pattern of API call sequence. Hansen et al. [28] propose another two feature representations. These representations consist of first 200 API calls as well as its “argument”. However, this “argument” only indicates whether this API call is connected with the later one and it might not maintain sufficient information from arguments.

Ii-B Deep Learning based Approaches

David and Netanyahu [29]

treat the sandbox report as an entire text string, and then split all strings by any special character. They count the frequency of each string and keep only top 20,000 frequent ones by using a 20,000-bit vector to represent it. Their model is a deep belief network (DBN) which consists of eight layers, from 20,000-sized vectors to 30-sized vectors. They append a softmax layer after the top layer. Cross-entropy loss is used to train the model, which attains 98.6% accuracy on a small dataset with 600 test samples.

Pascanu et al. [16]

propose a two-stage approach, a feature learning stage and a classification stage. At the first stage, they use the Recurrent Neural Networks (RNNs) to predict the next possible API call based on the previous API call sequence. For the classification stage, they freeze the RNNs, and feed the outputs into a max-pooling layer aggregate the features for classification. They attain 71.71% recall rate at a false positive rate of 0.1% on a dataset with 75,000 samples. Tobiyama et al. 

[30] use vanilla RNNs to extract the features, which are arranged into a matrix like a grey-scale image. CNN is applied to do classification over the matrix.

Kolosnjaji et al. [17] propose an approach which combines convolution neural network (CNN) with RNNs. Their approach stacks two CNN layers, and each CNN layer uses a 3-sized filter to simulate the 3-grams approach. A Long short-term memory (LSTM) with a 100-sized hidden vector is appended to handle the time-series sequence.

The previous 3 papers only use the API call sequence but ignore the arguments. Huang and Jack [31] uses a feature representation consisting of three parts, the presence of unpacked code fragments in the arguments, the combination of the API call name with one of its arguments (selected manually), and the 3-gram of API call sequence. This feature representation with 50,000 features which is reduced to 4,000 by a random projection. They claim for the first time the deep learning model (i.e., RNN) outperforms a shallow architecture proposed by Dahl et al. [32]. Agrawal et al. [19] also use the API call sequence and the arguments. Their feature representation consisting of a one-hot vector from API call name and top N frequent n-grams of the argument strings. The model uses several stacked LSTMs and shows a better performance than [17]. They also claim more LSTMs cannot increase the performance.

In our solution, we propose to transform the input features extracted from the API name, category and arguments using multiple gated-CNNs (with batch-normalization). The transformed features are processed by a LSTM to learn the sequential pattern.

Iii DMDS Framework

Fig. 1

depicts the overview of DMDS (Dynamic Malware Detection System). Our system consists of three main parts, i.e., data collection, dynamic analysis, and classifier learning. The work flow of our system starts from the data collection with multiple sources. Once the new portable executable (PE) samples are collected, an execution queue is applied to manage the execution of these samples. The dynamic analysis engine of our system is Cuckoo 

[33], a leading open source automated malware analysis system that can manage two types of machine clusters, virtual machine cluster and physical machine cluster. Each execution task is assigned to an instance of the cluster, and an execution log is generated after the execution. These logs then are processed by a multi-thread service to form a feature representation. Based on the feature vector, a deep neural network model is trained and updated accordingly.

Iii-a Data Collection

The data collection part of DMDS has been implemented and deployed by a local anti-virus company (SecureAge Technology). SecureAge maintains a platform with 12 anti-virus engines to collect and label the PE files. These collected data then will be fed into DMDS to optimize its model. In the future, once DMDS is stable, it will be deployed as a new engine. DMDS is an online model. With the dataset increases, it will update itself iteratively.

An execution queue is maintained to organize all available submitted tasks. It monitors the storage usage and decides whether to submit more tasks. Also, a task queue with atomic push and pop operation is a lock-free solution for the distributed system. With a moderate priority design, the execution queue reduces the waiting time of user submission tasks.

Iii-B Dynamic Analysis

In this system, a dynamic malware analysis engine, Cuckoo, is used to gather execution logs of malware. Cuckoo is an open source, self-hosted and automated sandbox. It executes the PE sample inside a sandbox (a virtual machine or a physical machine). Cuckoo applies API hooks to monitor which API is being called by the PE sample. Therefore, the actions of the entry process, as well as the actions of the child processes spawned by the entry process, are captured. Furthermore, Cuckoo provides a REST API for Distributed Cuckoo solution and we use it in our system together with the execution queue.

In the Cuckoo sandbox, we deploy two types of clusters, virtual machine cluster and physical machine cluster. The virtual machine cluster is cheap and easy to deploy. Each physical machine can maintain dozens of virtual machines to run. All virtual machines are installed with a 64-bit Windows 7 system and several daily-use software. We leverage the snapshot feature of the virtual machine to roll back it to its original status. However, several papers state that some malware might observe they are running in a sandbox so as to stop their malicious actions[34, 35]. We also deploy physical machine cluster, which is more complicated and time-consuming. Each sandbox must use a physical machine exclusively. And an image cloning server is required to make the machine roll back after execution, which costs lots of time. After being executed at the sandbox, all generated execution logs are stored locally at the Cuckoo server.

Iii-C Classifier Learning

The execution logs generated by the sandbox consists of lots of detailed information, which might be several-fold larger than its original file. And because execution logs are stored at the Cuckoo server, if they are not processed in time, they will waste lots of space and deter the execution of later. Therefore, we design a multi-thread feature extraction service to process these logs.

After the feature vectors are extracted, they are split into training, validation and test dataset. The trained models are stored at a model server. When the system is deployed, a user submitted program is processed by a model from the model server. The details of the feature processing and model architecture will be illustrated in Section 4.

Iv Methodology

At this part, we describe the methodology of DMDS, including the novel feature extraction process and the new deep learning architecture for malware classification.

Iv-a Feature Representation

Feature Type Details Dim
API Events API name
Internal words
hashing trick
8
API category Hashing trick 4
API Arguments Integers Hashing trick 16
Strings Paths
Hashing trick
with hierarchy
16
Dlls 8
Registry
keys
12
Urls 16
IPs 12
String
statistics
numStrings, avLength,
numChars, entropy,
numPaths, numDlls,
numUrls, numIPs,
numRegistryKeys,
numMZ
10
TABLE I: Feature representation overview

As discussed earlier in the paper, most previous works [24, 16, 17] only consider to encode the API call sequence but ignore its arguments. This incurs a problem that API events (the API name and its category of a API call) indicate the same meaning whenever it occurs in the sequence. Because the arguments of API call has been discarded, the single API event loses its unique profile [19]. For example, a write file operation might be benign when the target file is created by the PE sample but be malicious when the target file is created by the system or other software. A few works which consider both arguments and API events either cannot provide a homogeneous, aligned representation [23] or apply too much Natural Language Processing(NLP) prior-knowledge [19, 31].

We propose a feature representation which leverages the feature hashing method [36] to encode each API event and its arguments to get a homogeneous feature vector. As shown Table I, our feature representation consists of different types of information. The API events part has 12 bins, 8 for API name and 4 for API category. The API arguments part has 89 bins, 16 for the integer arguments and 73 for the string arguments. For the string arguments, several specific types of strings (file path, Dlls, etc.) are processed. In addition, 10 statistical features are extracted. All these features are concatenated to form a 102-dimension feature vector.

Iv-A1 API Events

The Cuckoo sandbox supports monitoring 312 API calls in total which belongs to 17 categories[33]. Each API name consists of multiple words with the first letter of each word capitalized. We first split the whole API name into words and then they are processed using the feature hashing trick with 8 bins. For the API category, all the characters of the category string are processed using the hashing trick to sketch the set with 4 bins. In addition, we use the MD5 value of each API event and its arguments to remove any repeated API events.

The main technique we use to encode the sequences of strings into a fixed length vector is feature hashing (hashing trick). We use random variable

to denote an input vector which consists of a sequence of elements (either strings or characters), to denote the number of bins, i.e., 8 for API name, and 4 for API category. Then the value of the -th bin is calculated by:

(1)

where is a hash function that maps an element, e.g., , to a natural number to find the location of the bin that it belongs; is a hash function that maps an element to . That is, for each element of whose hash value is , we calculate its hash value and add it into the -th bin.

Iv-A2 API Arguments

As for API arguments, there are only two types of values, integers and strings. The individual value of an integer is meaningless. It requires its argument name to indicate the meaning of the value. The same integer might indicate totally different meaning with different argument keys. For example, number 22 with the name ”port” is totally different from the one with the name ”size”. We apply feature hashing to form a 16-dimension vector to encode the integer’s argument name as well as its value. We firstly hash the argument’s name into the corresponding dimensions of the vector, then multiply the logarithm of its value to these bins. Moreover, multi-integers’ results will be added together. Specifically, the -th element of the feature vector is

(2)

where and are the same hash functions as in Equation 1; is the argument name, and is argument value, i.e., an integer.

For strings, their values are more complicated than an integer. Some strings starting with ’0x’ contain the address of some objects. And some other may contain file path, IP address, URL, even plain text. Besides, some API calls allow the user to write a new file from the string argument, so strings even can contain an entire file. The variety of strings makes it more complicated to handle them. Based on the previous work [15, 25, 26, 13], the most important strings are the values about file path, DLLs, registry key, URL, and IP address. So we extract these types of strings and use the feature hashing to extract the features for them.

To capture the hierarchy information contained in these strings, we parse the whole string into several substrings and process them individually. For example, we use "C:
"
to identify a file path. For a path like "C:
a
b
c"
, four substrings would be parsed, "C:", "C:
a"
, "C:
a
b"
, and "C:
a
b
c"
. All these substrings use the hashing trick of Equation 1 to hash into the corresponding bins. The same parsing method is applied at DLLs, registry keys and IPs. The DLLs are strings always ending with ”.dll”. The registry keys often start with ”HKEY_”. IPs are those strings with four numbers (range from 0 to 255) separated by dots. The parsing method is slightly different for URLs. The word in a URL that appears later often represents a lower level or broader category, so we treat the URLs in the reverse order. For example, "https://cikm2016.cs.iupui.edu/" will be parsed as ”edu”, "iupui.edu", "cs.iupui.edu" and "cikm2016.cs.iupui.edu".

Apart from the above specifics strings, there are lots of other types of strings. Based on the previous work[15, 25, 26, 13], we extract statistical information of all the printable strings. The printable strings consist of characters ranging from 0x20 to 0x7f. Therefore, all the paths, registry keys, URLs, IPs and some other printable strings are included. One of the other types of strings starts with ”MZ”, which is often a buffer that contains a whole executable file. Such strings usually occur in malicious behaviours such as thread injection[37]. Thus, a 10-dimension vector is used to record the number of strings, their average length, the number of characters, the entropy of characters across all printable strings, and the number of paths, DLLs, URLs, registry keys, IPs and ”MZ” strings.

Iv-B Model Architecture

Fig. 2: An illustration of the proposed model

Based on our feature representation method, we present a deep neural network architecture that can leverage both the API events and their arguments. Fig. 2 is an overview of the proposed deep learning model.

The input shape of a sample (program) is , where is the number of the API calls in each sample and is the dimension of each extracted API feature, i.e. d=102. We first normalize the input by a batch normalization layer [21]

. This batch normalization layer normalizes the input values by subtracting the batch mean and dividing by the batch standard deviation. It makes sure some dimensions of the feature vector are not so large to affect the training; it also has a slight regularization effect, which is validated in the experiments.

Then, several gated-CNNs [20] are applied. Gated-CNNs have shown to be competitive with recurrent models on language tasks but consuming less resource and less time. For each gated CNN, the input is fed into two convolution layers respectively. All convolution layers’ filter size is 128. Let denote the output of the first convolution layer, and denote the output of the second one; they are combined by , which involves an element-wise multiplication operation. Here,

is the sigmoid function

, so is the gate that controls the information from passed to the next layer in the model. Conceptually, the gating mechanism is important because it allows the selection of important and relevant information [38]. Following the idea in [39]

, 1-D convolutional filters are used as n-grams detectors. To capture the 2-gram and 3-gram information, the kernel sizes in these two components are configured as 2 and 3; and the strides are 1.

After n-grams representations are concatenated together, another batch normalization layer is applied to reduce overfitting before it is sent into the LSTM layer to learning sequential patterns. LSTM is a recurrent neural network architecture, in which several gates are designed to control the information transmission status so that it is able to capture the long-term context information [40, 41, 42, 43]. Similar to the neural models discussed in [19]

, one bidirectional LSTM is utilized to capture sequential behaviour patterns from the API events and the length is set to 100. Since the detection of malware relies on signals observed throughout the sequence, performing max pooling over the full sequence, instead of using the final activation from the LSTM, helps retain the relevant activation learned throughout the sequence. So a temporal global max pooling layer is applied to extract abstract features from the hidden vectors.

After the LSTM layer, a dense layer (64 units) with a ReLU activation, a dropout layer with probability 0.5 (to reduce overfitting 

[44]), and a dense layer (1 units) with a sigmoid activation, are applied to the end of the model to reduce the dimensionality and output the probability of malware.

Our model is supervised with the label associated with each file, which is used to measure the loss for training the model. Binary cross entropy function is used,

(3)

The optimization method we take is Adam.

V Experiments

V-a Dataset

As described in Section 3.1, 12 commercial anti-virus engines are set to label whether the sample is malicious. If 4 or more engines agree that a particular sample is malicious, then it is labelled as positive. If none of the engines detects it as malicious, then it is labelled as negative. Any other samples with 1-3 engines labelled as malicious are deemed inconclusive and therefore discarded (not used for training or testing).

The collected data are archived by the date and we use two months (April and May) data in our experiment. All these PE samples are pushed into the execution task of DMDS to collect the API call sequence. Since the physical machines take lots of time, for the dataset collection, we only use the virtual machines to execute these PE samples. Table II is a summary of the data, where one row represents the statistics of the data in one month.

Dataset Positive samples Negative samples
April 15931 11417
May 11856 21983
TABLE II: Summary of the data

Because our dataset is collected from the wild world, we conduct the experiment and evaluate the models by such a method; we use cross-validation (CV) to train the model on the April dataset and evaluate it on the May dataset.

Type Approach Arguments 4-fold CV Performance Test Performance
Inference Time
(ms/sample)
AUC ACC Recall AUC ACC Recall
Machine Learning Uppal et al. [45] No 96.1808% 90.8922% 1.3745% 94.0211% 86.0071% 1.6092% 98.5575
Tian et al. [15] Yes 99.0966% 95.8322% 82.4162% 97.7732% 93.1798% 67.4472% 123.0334
Fang et al.  [18] 98.6166% 94.5826% 53.5148% 97.0195% 90.8757% 41.7190% 116.0063
Deep Learning Pascanu et al.  [16] No 95.3459% 89.0551% 9.1624% 50.6858% 32.2058% 0.6737% 94.2259
Kolosnjaji et al. [17] 98.8247% 95.3359% 59.4759% 97.5785% 93.3156% 42.4899% 92.1873
Agrawal et al.  [19] Yes 99.0688% 95.8684% 77.7816% 98.1853% 94.8553% 60.1126% 257.5729
Proposed Model 99.4640% 96.7637% 88.7535% 98.7123% 95.3328% 71.4831% 129.2115
TABLE III: The experimental results
(a) Validation ROC curve
(b) Test ROC curve
Fig. 3: Comparisons of ROC curve of different models

V-B Model evaluation

In order to investigate the performance improvement of our proposed framework, we compare the proposed model with three machine learning-based baseline models and three deep learning based baseline models. All the experiments are conducted with the same dataset and configuration.

  • Uppal et al. [45]

    In this model, 3-grams of API call names are extracted and the most important ones are then selected by the odds ratio of each gram. SVM is applied for the final classification.

  • Tian et al. [15] A hash table is used to indicates the presence of strings as features. These strings come from both API names and arguments. Their model is Random Forest.

  • Fang et al. [18]

    It also uses hashing trick to hash the API call names, return value and module name (a part of the arguments) into some fixed-size bins. Then top n important features are selected by their proposed approach and fed into XGBoost.

  • Pascanu et al. [16] It firstly use RNNs to predict next possible API call, then they freeze the RNNs and extract features from each hidden vectors for classification. The model’s input is the one-hot vector from API call name.

  • Kolosnjaji et al. [17] They propose a model which combines stacked CNNs with RNNs. The model’s input is the one-hot vector from API call name.

  • Agrawal et al.[19] Their feature representation consists of a one-hot vector from API call name and top N frequent n-grams from the string of its arguments. The model uses several stacked LSTMs.

We use the 4-fold cross-validation method to train and evaluate models on the April dataset to illustrate their ability to detect known malware. We also evaluate their performance on the May dataset to indicate their capability for detecting unknown malware.

Three metrics are considered: ROC (receiver operating characteristic curve) AUC (Area Under the Curve) score, ACC (accuracy) and Recall when FP (false positive) rate is 0.1% or below. The recall is defined as the ratio of the malware correctly identified as malware, while the FP rate is the ratio of benign software incorrectly identified as malware. Anti-virus products are required to keep a low false alarm rate to avoid disturbing the users frequently 

[46, 47]. The models are expected to achieve a high recall rate for a fixed low false positive rate. In addition, the inference time per sample, which includes the time for feature processing and model prediction, is also taken into account.

From the experimental results in Table III, regardless of the cross-validation performance or test performance, our proposed model achieve the best AUC score, accuracy and recall among all the baseline models, which is also evident in Fig. 3. Fig. 3 displays the ROC of all the models. Therefore, our efforts on extracting the features of the API call sequence and design of the deep learning model have made a difference.

In Fig. 3, the dashed curves are the ROC of those traditional machine learning models, while the solid lines are the ROC of those deep learning models. The experimental results illustrate that the traditional machine learning approaches and deep learning approaches are comparable. Although it is generally accepted that recurrent neural networks are the most advanced architecture for processing and classifying sequences, the features that fed into recurrent neural networks are also important. It should be noted that the model of Tian et al. [15] just adopts a basic method to extract the string information, and achieves quite good results. In other words, it indicates the importance of strings in feature processing, and our model did emphasize on it.

The results also show that models that take argument features into consideration generally outperform those without argument features. The addition of argument features increased the test AUC score of the traditional machine learning method by 3% and also increased the test AUC score of deep learning by nearly 1%. Therefore, including API arguments is necessary and significant.

Since the training dataset is collected before the testing dataset, some new malware variants were generated over time, which means that the distribution of software populations differs between the training and testing dataset. The proposed model achieves the best performance, which confirms the ability of our model in detecting new and constantly evolving malware.

As for the inference time, models with the argument features take a slightly longer time than the ones without argument feature. And the inference time taken by the models is far less the time taken by feature processing. Moreover, the proposed model uses the hash method to avoid the double scans of the log file [19] so as to reduce half of the feature processing time.

V-C Ablation Study

Apart from the data collection part which has been implemented by SecureAge, the other parts of our framework have not been deployed at the production environment. Instead, we release the insights when we exploit the effectiveness of each important component of our proposed model.

The proposed model consists of several components that can be flexibly adjusted, e.g., the Gated CNNs, LSTM and Batch Normalization layers. Therefore, in order to explore the effects of a different configuration of these components, we did several sets of comparison experiments (fix other structures and only change the testing component). These result of these experiments serve as the basis for the decision of our final model structure.

  • Gated CNNs Three sets of experiments are conducted, the Gated CNNs only with kernel size 2 (2-GatedCNN), two Gated CNNs with kernel size 2 and 3 (2,3-GatedCNN), three Gated CNNs with kernel size 2, 3 and 4 (2,3,4-GatedCNN).

  • LSTM Three sets of experiments are conducted, the model without LSTM, the model with one LSTM, the model with two LSTM stacked.

  • Batch normalization layers Three sets of experiments are conducted, the model without any batch normalization(BN) layers applied, the model only with the BN layer after the input (first BN), the model only with the BN layer after the Gated CNNs (second BN), the model with both BN layers.

(a) Validation AUC
(b) Test AUC
Fig. 4: Comparison of AUC with different number of Gated CNNs
(a) Validation AUC
(b) Test AUC
Fig. 5: Comparison of AUC with different number of Batch Normalization layers
(a) Validation AUC
(b) Test AUC
Fig. 6: Comparison of AUC with different number of LSTM

Fig. 4

depicts the comparisons between these different numbers of Gated CNNs. 2-GatedCNN seems to converge slower and the other two are not evident to distinguish. It indicates the 2-GatedCNN might not provide sophisticated features and thus it requires more epoch to converge. In addition, increasing the number of gated CNN from 2 to 3 might not extract addition information. The best AUC score of 2-GatedCNN and 2,3-GatedCNN is 98.7964% and 98.8559% respectively. We choose 2,3-GatedCNN as the final configuration of the proposed model.

The performance of the model with a different number of batch normalization layers is displayed in Fig. 5. In the validation subfigure, at the first several epochs, the model with both two BN layers is slightly better than those with one BN layer and 0-BatchNorm is the worst one. Although these four curves tend to be closer at the later epochs, in the test subfigure, the curve with both BN layers shows slightly superior performance, whose highest AUC score is 98.7964%.

As for various number of LSTM, Fig. 6 shows the validation and testing performance for each configuration. Obviously, in the validation AUC figure, the curve of 0-LSTM is below the other two curves by a large margin, which indicates that the LSTM is necessary for processing the API sequence data. The other two curves in this subfigure are continuously staggered, and it is hard to tell which one is better. However, in the test AUC subfigure, although all these three curves are not stable enough, 1-LSTM is slightly better than others, and its highest point reaches 98.7964%. In addition, the training process of the model with 1-LSTM is 2 times faster than that of the model with 2-LSTM. Thus, we choose 1-LSTM as the final configuration of the proposed model.

Vi Conclusion

In this work, a novel feature engineering method over the API call sequence and arguments, and a new deep learning architecture for malware detection are proposed. We use hashing tricks to extract a homogeneous and low-cost feature representation across different types of arguments. For the proposed deep learning architecture, multiple gated-CNNs are applied to transform the hashed high dimensional features from each API call. LSTM is applied to capture the sequential patterns from the API call sequence. The experiments show that our approach outperforms all baselines. Ablation study over multiple architecture variations gives us valuable insights on architecture design.

References

  • [1] The Council of Economic Advisers in Executive Office of the President of the United States, “The Cost of Malicious Cyber Activity to the U.S. Economy,” https://www.whitehouse.gov/wp-content/uploads/2018/03/The-Cost-of-Malicious-Cyber-Activity-to-the-U.S.-Economy.pdf, 2018.
  • [2] AV-TEST The Independend IT-Security Institute, “Security Report 2017/18,” https://www.av-test.org/fileadmin/pdf/security_report/AV-TEST_Security_Report_2017-2018.pdf, 2017.
  • [3] D. Wagner and R. Dean, “Intrusion detection via static analysis,” in Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.   IEEE, 2000, pp. 156–168.
  • [4] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna, “Polymorphic worm detection using structural information of executables,” in International Workshop on Recent Advances in Intrusion Detection.   Springer, 2005, pp. 207–226.
  • [5] Q. Zhang and D. S. Reeves, “Metaaware: Identifying metamorphic malware,” in Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).   IEEE, 2007, pp. 411–420.
  • [6] B. Birrer, R. A. Raines, R. O. Baldwin, M. E. Oxley, and S. K. Rogers, “Using qualia and novel representations in malware detection,” in Intelligent Sensing, Situation Management, Impact Assessment, and Cyber-Sensing, vol. 7352.   International Society for Optics and Photonics, 2009, p. 73520W.
  • [7] D. Venugopal and G. Hu, “Efficient signature based malware detection on mobile devices,” Mobile Information Systems, vol. 4, no. 1, pp. 33–49, 2008.
  • [8] M. Rhode, P. Burnap, and K. Jones, “Early-stage malware prediction using recurrent neural networks,” computers & security, vol. 77, pp. 578–594, 2018.
  • [9] P. Vinod, R. Jaipur, V. Laxmi, and M. Gaur, “Survey on malware detection methods,” in Proceedings of the 3rd Hackers’ Workshop on computer and internet security (IITKHACK’09), 2009, pp. 74–79.
  • [10] A. Damodaran, F. Di Troia, C. A. Visaggio, T. H. Austin, and M. Stamp, “A comparison of static, dynamic, and hybrid analysis for malware detection,” Journal of Computer Virology and Hacking Techniques, vol. 13, no. 1, pp. 1–12, 2017.
  • [11] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel, “Adversarial examples for malware detection,” in European Symposium on Research in Computer Security.   Springer, 2017, pp. 62–79.
  • [12] L. Nataraj, V. Yegneswaran, P. Porras, and J. Zhang, “A comparative assessment of malware classification using binary texture analysis and dynamic analysis,” in

    Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence

    .   ACM, 2011, pp. 21–30.
  • [13] F. Ahmed, H. Hameed, M. Z. Shafiq, and M. Farooq, “Using spatio-temporal information in api calls with machine learning algorithms for malware detection,” in Proceedings of the 2nd ACM workshop on Security and artificial intelligence.   ACM, 2009, pp. 55–62.
  • [14] J. Pfoh, C. Schneider, and C. Eckert, “Leveraging string kernels for malware detection,” in International Conference on Network and System Security.   Springer, 2013, pp. 206–219.
  • [15] R. Tian, R. Islam, L. Batten, and S. Versteeg, “Differentiating malware from cleanware using behavioural analysis,” in 2010 5th international conference on malicious and unwanted software.   IEEE, 2010, pp. 23–30.
  • [16] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas, “Malware classification with recurrent networks,” in 2015 IEEE ICASSP.   IEEE, 2015, pp. 1916–1920.
  • [17] B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, “Deep learning for classification of malware system call sequences,” in Australasian Joint Conference on Artificial Intelligence.   Springer, 2016, pp. 137–149.
  • [18] Y. Fang, B. Yu, Y. Tang, L. Liu, Z. Lu, Y. Wang, and Q. Yang, “A new malware classification approach based on malware dynamic analysis,” in Australasian Conference on Information Security and Privacy.   Springer, 2017, pp. 173–189.
  • [19] R. Agrawal, J. W. Stokes, M. Marinescu, and K. Selvaraj, “Neural sequential malware detection with parameters,” in 2018 IEEE ICASSP.   IEEE, 2018, pp. 2656–2660.
  • [20] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.   JMLR. org, 2017, pp. 933–941.
  • [21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [22] U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda, “Scalable, behavior-based malware clustering.” in NDSS, vol. 9.   Citeseer, 2009, pp. 8–11.
  • [23] P. Trinius, C. Willems, T. Holz, and K. Rieck, “A malware instruction set for behavior-based analysis,” 2009.
  • [24] Y. Qiao, Y. Yang, L. Ji, and J. He, “Analyzing malware by abstracting the frequent itemsets in api call sequences,” in 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.   IEEE, 2013, pp. 265–270.
  • [25]

    R. Islam, R. Tian, L. Batten, and S. Versteeg, “Classification of malware based on string and function feature selection,” in

    2010 Second Cybercrime and Trustworthy Computing Workshop.   IEEE, 2010, pp. 9–17.
  • [26] R. Islam, R. Tian, L. M. Batten, and S. Versteeg, “Classification of malware based on integrated static and dynamic features,” Journal of Network and Computer Applications, vol. 36, no. 2, pp. 646–656, 2013.
  • [27] Z. Salehi, M. Ghiasi, and A. Sami, “A miner for malware detection based on api function calls and their arguments,” in The 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012).   IEEE, 2012, pp. 563–568.
  • [28] S. S. Hansen, T. M. T. Larsen, M. Stevanovic, and J. M. Pedersen, “An approach for detection and family classification of malware based on behavioral analysis,” in 2016 ICNC.   IEEE, 2016, pp. 1–5.
  • [29] O. E. David and N. S. Netanyahu, “Deepsign: Deep learning for automatic malware signature generation and classification,” in 2015 IJCNN.   IEEE, 2015, pp. 1–8.
  • [30] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi, “Malware detection with deep neural network using process behavior,” in 2016 IEEE 40th Annual COMPSAC, vol. 2.   IEEE, 2016, pp. 577–582.
  • [31] W. Huang and J. W. Stokes, “Mtnet: a multi-task neural network for dynamic malware classification,” in International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment.   Springer, 2016, pp. 399–418.
  • [32] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu, “Large-scale malware classification using random projections and neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2013, pp. 3422–3426.
  • [33] Cuckoo, “Cuckoo Sandbox - Automated Malware Analysis,” 2019. [Online]. Available: https://cuckoosandbox.org/
  • [34] N. Miramirkhani, M. P. Appini, N. Nikiforakis, and M. Polychronakis, “Spotless sandboxes: Evading malware analysis systems using wear-and-tear artifacts,” in 2017 IEEE Symposium on Security and Privacy (SP).   IEEE, 2017, pp. 1009–1024.
  • [35] K. Berlin, D. Slater, and J. Saxe, “Malicious behavior detection using windows audit logs,” in Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security.   ACM, 2015, pp. 35–44.
  • [36] K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. Smola, “Feature hashing for large scale multitask learning,” arXiv preprint arXiv:0902.2206, 2009.
  • [37] W. Liu, P. Ren, K. Liu, and H.-x. Duan, “Behavior-based malware analysis and detection,” in 2011 First International Workshop on Complexity and Data Mining.   IEEE, 2011, pp. 39–42.
  • [38] L. Yang, Q. Ai, J. Guo, and W. B. Croft, “anmm: Ranking short answer texts with attention-based neural matching model,” in Proceedings of the 25th ACM CIKM.   ACM, 2016, pp. 287–296.
  • [39] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, “A latent semantic model with convolutional-pooling structure for information retrieval,” in Proceedings of the 23rd ACM CIKM.   ACM, 2014, pp. 101–110.
  • [40] J. Tan, X. Wan, and J. Xiao, “A neural network approach to quote recommendation in writings,” in Proceedings of the 25th ACM CIKM.   ACM, 2016, pp. 65–74.
  • [41] P. Khurana, P. Agarwal, G. Shroff, L. Vig, and A. Srinivasan, “Hybrid bilstm-siamese network for faq assistance,” in Proceedings of the 2017 ACM CIKM.   ACM, 2017, pp. 537–545.
  • [42] R. Mehrotra, A. H. Awadallah, M. Shokouhi, E. Yilmaz, I. Zitouni, A. El Kholy, and M. Khabsa, “Deep sequential models for task satisfaction prediction,” in Proceedings of the 2017 ACM CIKM.   ACM, 2017, pp. 737–746.
  • [43] C. Wu and M. Yan, “Session-aware information embedding for e-commerce product recommendation,” in Proceedings of the 2017 ACM CIKM.   ACM, 2017, pp. 2379–2382.
  • [44] J.-Y. Jiang and C.-T. Li, “Forecasting geo-sensor data with participatory sensing based on dropout neural network,” in Proceedings of the 25th ACM CIKM.   ACM, 2016, pp. 2033–2036.
  • [45] D. Uppal, R. Sinha, V. Mehra, and V. Jain, “Malware detection and classification based on extraction of api sequences,” in 2014 ICACCI.   IEEE, 2014, pp. 2337–2342.
  • [46] M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo, “Data mining methods for detection of new malicious executables,” in Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.   IEEE, 2000, pp. 38–49.
  • [47] C. Nicholas, “Malware Analysis in the Large vs. Malware Analysis in the Small,” 2017. [Online]. Available: https://www.csee.umbc.edu/courses/undergraduate/CMSC491malware/cikm2017.html