Towards Building an Intelligent Anti-Malware System: A Deep Learning Approach using Support Vector Machine (SVM) for Malware Classification

12/31/2017
by   Abien Fred Agarap, et al.
Adamson University
0

Effective and efficient mitigation of malware is a long-time endeavor in the information security community. The development of an anti-malware system that can counteract an unknown malware is a prolific activity that may benefit several sectors. We envision an intelligent anti-malware system that utilizes the power of deep learning (DL) models. Using such models would enable the detection of newly-released malware through mathematical generalization. That is, finding the relationship between a given malware x and its corresponding malware family y, f: x y. To accomplish this feat, we used the Malimg dataset (Nataraj et al., 2011) which consists of malware images that were processed from malware binaries, and then we trained the following DL models 1 to classify each malware family: CNN-SVM (Tang, 2013), GRU-SVM (Agarap, 2017), and MLP-SVM. Empirical evidence has shown that the GRU-SVM stands out among the DL models with a predictive accuracy of 84.92 stands to reason for the mentioned model had the relatively most sophisticated architecture design among the presented models. The exploration of an even more optimal DL-SVM model is the next stage towards the engineering of an intelligent anti-malware system.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/30/2020

Classifying Malware Images with Convolutional Neural Network Models

Due to increasing threats from malicious software (malware) in both numb...
01/21/2021

Malware Detection and Analysis: Challenges and Research Opportunities

Malwares are continuously growing in sophistication and numbers. Over th...
09/26/2016

One-Class SVM with Privileged Information and its Application to Malware Detection

A number of important applied problems in engineering, finance and medic...
10/29/2021

Evaluation of an Anomaly Detector for Routers using Parameterizable Malware in an IoT Ecosystem

This work explores the evaluation of a machine learning anomaly detector...
09/16/2018

An investigation of a deep learning based malware detection system

We investigate a Deep Learning based system for malware detection. In th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Effective and efficient mitigation of malware is a long-time endeavor in the information security community. The development of an anti-malware system that can counteract an unknown malware is a prolific activity that may benefit several sectors.
To intercept an unknown malware or even just an unknown variant is a laborious task to undertake, and may only be accomplished by constantly updating the anti-malware signature database. The mentioned database contains the information on all known malware by the particular system(Shelly and Vermaat, 2011), which is then used for malware detection. Consequently, newly-released malware which are not yet included in the database will go undetected.
We envision an intelligent anti-malware system that employs a deep learning (DL) approach which would enable the detection of newly-released malware through its capability to generalize on data. Furthermore, we amend the conventional DL models to use the support vector machine (SVM) as their classification function. We take advantage of the Malimg dataset(Nataraj et al., 2011) which consists of visualized malware binaries, and use it to train the DL-SVM models to classify each malware family.

2. Methodology

2.1. Machine Intelligence Library

Google TensorFlow

(Abadi et al., 2015) was used to implement the deep learning algorithms in this study, with the aid of other scientific computing libraries: matplotlib(Hunter, 2007), numpy(Walt et al., 2011), and scikit-learn(Pedregosa et al., 2011).

2.2. The Dataset

The deep learning (DL) models in this study were evaluated on the Malimg dataset(Nataraj et al., 2011), which consists of 9,339 malware samples from 25 different malware families. Table 1 shows the frequency distribution of malware families and their variants in the Malimg dataset(Nataraj et al., 2011).

Figure 1. Image from (Nataraj et al., 2011). Visualizing malware as a grayscale image.
No. Family Family Name No. of Variants
01 Dialer Adialer.C 122
02 Backdoor Agent.FYI 116
03 Worm Allaple.A 2949
04 Worm Allaple.L 1591
05 Trojan Alueron.gen!J 198
06 Worm:AutoIT Autorun.K 106
07 Trojan C2Lop.P 146
08 Trojan C2Lop.gen!G 200
09 Dialer Dialplatform.B 177
10 Trojan Downloader Dontovo.A 162
11 Rogue Fakerean 381
12 Dialer Instantaccess 431
13 PWS Lolyda.AA 1 213
14 PWS Lolyda.AA 2 184
15 PWS Lolyda.AA 3 123
16 PWS Lolyda.AT 159
17 Trojan Malex.gen!J 136
18 Trojan Downloader Obfuscator.AD 142
19 Backdoor Rbot!gen 158
20 Trojan Skintrim.N 80
21 Trojan Downloader Swizzor.gen!E 128
22 Trojan Downloader Swizzor.gen!I 132
23 Worm VB.AT 408
24 Trojan Downloader Wintrim.BX 97
25 Worm Yuner.A 800

Table 1. Malware families found in the Malimg Dataset(Nataraj et al., 2011).

Nataraj et al. (2011)(Nataraj et al., 2011) created the Malimg dataset by reading malware binaries into an 8-bit unsigned integer composing a matrix . The said matrix may be visualized as a grayscale image having values in the range of , with 0 representing black and 1 representing white.

2.3. Dataset Preprocessing

Similar to what (Garcia and Muga II, 2016) did, the malware images were resized to a 2-dimensional matrix of , and were flattened into a -size array, resulting to a -size array. Each feature array was then labelled with its corresponding indexed malware family name (i.e. ). Then, the features were standardized using Eq. 1.

(1)

where is the feature to be standardized, is its mean value, and

is its standard deviation. The standardization was implemented using

StandardScaler().fit_transform() of scikit-learn(Pedregosa et al., 2011). Granted that the dataset consists of images, and standardization may not be suitable for such data, but take note that the images originate from malware binary files. Hence, the features are not technically images to begin with.

2.4. Computational Models

This section presents the deep learning (DL) models, and the support vector machine (SVM) classifier used in the study.

2.4.1. Support Vector Machine (SVM)

The support vector machine (SVM) was developed by Vapnik(Cortes and Vapnik, 1995)

for binary classification. Its objective is to find the optimal hyperplane

to separate two classes in a given dataset, with features .

SVM learns the parameters and by solving the following constrained optimization problem:

(2)
(3)
(4)

where is the Manhattan norm (also known as L1 norm), is the penalty parameter (may be an arbitrary value or a selected value using hyper-parameter tuning), and is a cost function.

The corresponding unconstrained optimization problem of Eq. 2 is given by Eq. 5.

(5)

where is the actual label, and is the predictor function. This equation is known as L1-SVM, with the standard hinge loss. Its differentiable counterpart, L2-SVM (given by Eq. 6), provides more stable results(Tang, 2013).

(6)

where is the Euclidean norm (also known as L2 norm), with the squared hinge loss.

Despite intended for binary classification, SVM may be used for multinomial classification as well. One approach to achieve this is the use of kernel tricks, which converts a linear model to a non-linear model by applying kernel functions such as radial basis function (RBF). However, for this study, we utilized the linear L2-SVM for the multinomial classification problem. We then employed the

one-versus-all (OvA) approach, which treats a given class as the positive class, and others as negative class.
Take for example the following classes: airplane, boat, car. If a given image belongs to the airplane class, it is taken as the positive class, which leaves the other two classes the negative class.
With the OvA approach, the L2-SVM serves as the classifier of each deep learning model in this study (CNN, GRU, and MLP). That is, the learning parameters weight and bias of each model is learned by the SVM.

2.4.2. Convolutional Neural Network

Convolutional Neural Networks (CNNs) are similar to feedforward neural networks for they also consist of hidden layers of neurons with “learnable” parameters. These neurons receive inputs, performs a dot product, and then follows it with a non-linearity such as

or . The whole network expresses the mapping between raw image pixels and class scores , . For this study, the CNN architecture used resembles the one laid down in (ten, 2017):

  1. INPUT:

  2. CONV5:

    size, 36 filters, 1 stride

  3. LeakyReLU:

  4. POOL: size, 1 stride

  5. CONV5: size, 72 filters, 1 stride

  6. LeakyReLU:

  7. POOL: size, 1 stride

  8. FC: 1024 Hidden Neurons

  9. LeakyReLU:

  10. DROPOUT:

  11. FC: 25 Output Classes

The modification introduced in the architecture design was the size of layer inputs and outputs (e.g. input of instead of , and output of 25 classes), the use of LeakyReLU instead of ReLU, and of course, the introduction of L2-SVM as the network classifier instead of the conventional Softmax function. This paradigm of combining CNN and SVM was actually proposed by Tang (2013)(Tang, 2013).

2.4.3. Gated Recurrent Unit

Agarap (2017)(Agarap, 2017)

proposed a neural network architecture combining the gated recurrent unit (GRU)

(Cho et al., 2014) variant of a recurrent neural network (RNN) and the support vector machine (SVM)(Cortes and Vapnik, 1995) for the purpose of binary classification.

(7)
(8)
(9)
(10)

where and are the update gate and reset gate of a GRU-RNN respectively, is the candidate value, and is the new RNN cell state value(Cho et al., 2014). In turn, the

is used as the predictor variable

in the L2-SVM predictor function (given by ) of the network instead of the conventional Softmax classifier.

2.4.4. Multilayer Perceptron

The perceptron model was developed by Rosenblatt (1958)(Rosenblatt, 1958) based on the neuron model by McCulloch & Pitts (1943)(McCulloch and Pitts, 1943). A perceptron may be represented by a linear function (given by Eq. 11

), which is then passed to an activation function such as

sigmoid , sign, or tanh. These activation functions introduce non-linearity (except for the sign function) to represent complex functions.
As the term itself implies, a multilayer perceptron (MLP) is a neural network that consists of hidden layers of perceptrons. In this study, the activation function used was the LeakyReLU(Maas et al., 2013) function (given by Eq. 12).

(11)
(12)

The learning parameters weight and bias

for each DL model were learned by the L2-SVM using the loss function given by Eq.

6. The computed loss is then minimized through Adam(Kingma and Ba, 2014) optimization. Then, the decision function produces a vector of scores for each malware family. In order to get the predicted labels for a given data , the function is used (see Eq. 13).

(13)

The function shall return the indices of the highest scores across the vector of predicted classes .

2.5. Data Analysis

There were two phases of experiment for this study: (1) training phase, and (2) test phase. All the deep learning algorithms described in Section 2.4 were trained and tested on the Malimg dataset(Nataraj et al., 2011). The dataset was partitioned in the following fashion: 70% for training phase, and 30% for testing phase.
The variables considered in the experiments were the following:

  1. Test Accuracy (the predictive accuracy on unseen data)

  2. Epochs (number of passes through the entire dataset)

  3. F1 score (harmonic mean of

    precision and recall, see Eq. 14)

  4. Number of data points

  5. Precision (Positive Predictive Value, see Eq. 15)

  6. Recall (True Positive Rate, see Eq. 16)

(14)
(15)
(16)

The classification measures F1 score, precision, and recall were all computed using the classification_report() function of sklearn.metrics(Pedregosa et al., 2011).

3. Results

All experiments in this study were conducted on a laptop computer with Intel Core(TM) i5-6300HQ CPU @ 2.30GHz x 4, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 960M 4GB DDR5 GPU. Table 2 shows the hyper-parameters used by the DL-SVM models in the conducted experiments. Table 3 summarizes the experiment results for the presented DL-SVM models.

Hyper-parameters CNN-SVM GRU-SVM MLP-SVM
Batch Size 256 256 256
Cell Size N/A [256 5] [512, 256, 128]
No. of Hidden Layers 2 5 3
Dropout Rate 0.85 0.85 None
Epochs 100 100 100
Learning Rate 1e-3 1e-3 1e-3
SVM C 10 10 0.5

Table 2. Hyper-parameters used in the DL-SVM models.
Variables CNN-SVM GRU-SVM MLP-SVM
Accuracy 77.2265625% 84.921875% 80.46875%
Data points 256000 256000 256000
Epochs 100 100 100
F1 0.79 0.85 0.81
Precision 0.84 0.85 0.83
Recall 0.77 0.85 0.80

Table 3. Summary of experiment results on the DL-SVM models.

As opposed to what (Garcia and Muga II, 2016) did on dataset partitioning, the relative populations of each malware family were not considered in the splitting process. All the DL-SVM models were trained on 70% of the preprocessed Malimg dataset(Nataraj et al., 2011), i.e. 6400 malware family variants (), for 100 epochs. On the other hand, the models were tested on 30% of the preprocessed Malimg dataset(Nataraj et al., 2011), i.e. 2560 malware family variants (), for 100 epochs.

Figure 2. Plotted using matplotlib(Hunter, 2007). Training accuracy of the DL-SVM models on malware classification using the Malimg dataset(Nataraj et al., 2011).

Figure 2 summarizes the training accuracy of the DL-SVM models for 100 epochs (equivalent to 2500 steps, since ). First, the CNN-SVM model accomplished its training in 3 minutes and 41 seconds with an average training accuracy of 80.96875%. Meanwhile, the GRU-SVM model accomplished its training in 11 minutes and 32 seconds with an average training accuracy of 90.9375%. Lastly, the MLP-SVM model accomplished its training in 12 seconds with an average training accuracy of 99.5768229%.

Figure 3. Plotted using matplotlib(Hunter, 2007)

. Confusion Matrix for CNN-SVM testing results, showing its predictive accuracy for each malware family described in Table

1.

Figure 3 shows the testing performance of CNN-SVM model in multinomial classification on malware families. The mentioned model had a precision of 0.84, a recall of 0.77, and a F1 score of 0.79.

Figure 4. Plotted using matplotlib(Hunter, 2007). Confusion Matrix for GRU-SVM testing results, showing its predictive accuracy for each malware family described in Table 1.

Figure 4 shows the testing performance of GRU-SVM model in multinomial classification on malware families. The mentioned model had a precision of 0.85, a recall of 0.85, and a F1 score of 0.85.

Figure 5. Plotted using matplotlib(Hunter, 2007). Confusion Matrix for MLP-SVM testing results, showing its predictive accuracy for each malware family described in Table 1.

Figure 5 shows the testing performance of MLP-SVM model in multinomial classification on malware families. The mentioned model had a precision of 0.83, a recall of 0.80, and a F1 score of 0.81.

As shown in the confusion matrices, the DL-SVM models had better scores for the malware families with the high number of variants, most notably, Allaple.A and Allaple.L. This may be pointed to the omission of relative populations of each malware family during the partitioning of the dataset into training data and testing data. However, unlike the results of (Garcia and Muga II, 2016), only Allaple.A and Allaple.L had some misclassifications between them.

4. Discussion

It is palpable that the GRU-SVM model stands out among the DL-SVM models presented in this study. This finding comes as no surprise as the GRU-SVM model did have the relatively most sophisticated architecture design among the presented models, most notably, its 5-layer design. As explained in (Goodfellow et al., 2016), the number of layers of a neural network is directly proportional to the complexity of a function it can represent. In other words, the performance or accuracy of a neural network is directly proportional to the number of its hidden layers. By this logic, it stands to reason that the less number of hidden layers that a neural network has, the less its performance or accuracy is. Hence, the findings in this study corroborates the literature explanation as the MLP-SVM came second (having 80.47% test accuracy) to GRU-SVM with a 3-layer design, and the CNN-SVM came last (having 77.23% test accuracy) with a 2-layer design.
The reported test accuracy of 84.92% clearly states that the GRU-SVM model has the strongest predictive performance among the DL-SVM models in this study. This is attributed to the fact that the GRU-SVM model has the relatively most complex design among the presented models. First, its 5-layer design allows it to represent increasingly complex mappings between features and labels, i.e. function mappings . Second, its capability to learn from data of sequential nature, in which an image data belongs. This nature of the GRU-RNN comes from its gating mechanisms, given by equations in Section 2.4.3. Through the mentioned mechanisms, the GRU-RNN solves the problem of vanishing gradients and exploding gradients(Cho et al., 2014). Thus, being able to connect information with a considerable gap. However, as indicated by the training summary given by Figure 2, the GRU-SVM has the caveat of relatively longer computing time. Having finished its training in 11 minutes and 32 seconds, it was the slowest among the DL-SVM models. From a high-level inspection of the presented equations of each DL-SVM model (CNN-SVM in Section 2.4.2, GRU-SVM in Section 2.4.3, and MLP-SVM in Section 2.4.4), it was a theoretical implication that the GRU-SVM would have the longest computing time as it had more non-linearities introduced in its computation. On the other hand, with the least non-linearities (having only used LeakyReLU), it was also theoretically implied that the MLP-SVM model would have the shortest computing time.
From the literature explanation(Goodfellow et al., 2016) and empirical evidence, it can be inferred that increasing the complexity of the architectural design (e.g. more hidden layers, better non-linearities) of the CNN-SVM and MLP-SVM models may catapult their predictive performance, and would be more on par with the GRU-SVM model. In turn, this implication warrants a further study and exploration that may be prolific to the information security community.

5. Conclusion and Recommendation

We used the Malimg dataset prepared by (Nataraj et al., 2011), which consists of malware images for the purpose of malware family classification. We employed deep learning models with the L2-SVM as their final output layer in a multinomial classification task. The empirical data shows that the GRU-SVM model by (Agarap, 2017) had the highest predictive accuracy among the presented DL-SVM models, having a test accuracy of 84.92%.
Improving the architecture design of the CNN-SVM model and MLP-SVM model by adding more hidden layers, adding better non-linearities, and/or using an optimized dropout, may provide better insights on their application on malware classification. Such insights may reveal an information as to which architecture may serve best in the engineering of an intelligent anti-malware system.

6. Acknowledgment

We extend our statement of gratitude to the open-source community, especially to TensorFlow. An appreciation as well to Lakshmanan Nataraj, S. Karthikeyan, Gregoire Jacob, and B.S. Manjunath for the Malimg dataset(Nataraj et al., 2011).
We would also like to express our appreciation to friends who took time reading parts of this paper during its development, and who gave their support to this endeavor. F.J.H. Pepito gives his appreciation to Hyacinth Gasmin, and A.F. Agarap gives his appreciation to Ma. Pauline de Ocampo, Abqary Alon, Rhea Jude Ferrer, and Julius Luis H. Diaz.

References