Effective and efficient mitigation of malware is a long-time endeavor in the information security community. The development of an anti-malware system that can counteract an unknown malware is a prolific activity that may benefit several sectors.
To intercept an unknown malware or even just an unknown variant is a laborious task to undertake, and may only be accomplished by constantly updating the anti-malware signature database. The mentioned database contains the information on all known malware by the particular system(Shelly and Vermaat, 2011), which is then used for malware detection. Consequently, newly-released malware which are not yet included in the database will go undetected.
We envision an intelligent anti-malware system that employs a deep learning (DL) approach which would enable the detection of newly-released malware through its capability to generalize on data. Furthermore, we amend the conventional DL models to use the support vector machine (SVM) as their classification function. We take advantage of the Malimg dataset(Nataraj et al., 2011) which consists of visualized malware binaries, and use it to train the DL-SVM models to classify each malware family.
2.1. Machine Intelligence Library
2.2. The Dataset
The deep learning (DL) models in this study were evaluated on the Malimg dataset(Nataraj et al., 2011), which consists of 9,339 malware samples from 25 different malware families. Table 1 shows the frequency distribution of malware families and their variants in the Malimg dataset(Nataraj et al., 2011).
|No.||Family||Family Name||No. of Variants|
Nataraj et al. (2011)(Nataraj et al., 2011) created the Malimg dataset by reading malware binaries into an 8-bit unsigned integer composing a matrix . The said matrix may be visualized as a grayscale image having values in the range of , with 0 representing black and 1 representing white.
2.3. Dataset Preprocessing
Similar to what (Garcia and Muga II, 2016) did, the malware images were resized to a 2-dimensional matrix of , and were flattened into a -size array, resulting to a -size array. Each feature array was then labelled with its corresponding indexed malware family name (i.e. ). Then, the features were standardized using Eq. 1.
where is the feature to be standardized, is its mean value, and
is its standard deviation. The standardization was implemented usingStandardScaler().fit_transform() of scikit-learn(Pedregosa et al., 2011). Granted that the dataset consists of images, and standardization may not be suitable for such data, but take note that the images originate from malware binary files. Hence, the features are not technically images to begin with.
2.4. Computational Models
This section presents the deep learning (DL) models, and the support vector machine (SVM) classifier used in the study.
2.4.1. Support Vector Machine (SVM)
The support vector machine (SVM) was developed by Vapnik(Cortes and Vapnik, 1995)
for binary classification. Its objective is to find the optimal hyperplaneto separate two classes in a given dataset, with features .
SVM learns the parameters and by solving the following constrained optimization problem:
where is the Manhattan norm (also known as L1 norm), is the penalty parameter (may be an arbitrary value or a selected value using hyper-parameter tuning), and is a cost function.
where is the actual label, and is the predictor function. This equation is known as L1-SVM, with the standard hinge loss. Its differentiable counterpart, L2-SVM (given by Eq. 6), provides more stable results(Tang, 2013).
where is the Euclidean norm (also known as L2 norm), with the squared hinge loss.
Despite intended for binary classification, SVM may be used for multinomial classification as well. One approach to achieve this is the use of kernel tricks, which converts a linear model to a non-linear model by applying kernel functions such as radial basis function (RBF). However, for this study, we utilized the linear L2-SVM for the multinomial classification problem. We then employed theone-versus-all (OvA) approach, which treats a given class as the positive class, and others as negative class.
Take for example the following classes: airplane, boat, car. If a given image belongs to the airplane class, it is taken as the positive class, which leaves the other two classes the negative class.
With the OvA approach, the L2-SVM serves as the classifier of each deep learning model in this study (CNN, GRU, and MLP). That is, the learning parameters weight and bias of each model is learned by the SVM.
2.4.2. Convolutional Neural Network
Convolutional Neural Networks (CNNs) are similar to feedforward neural networks for they also consist of hidden layers of neurons with “learnable” parameters. These neurons receive inputs, performs a dot product, and then follows it with a non-linearity such asor . The whole network expresses the mapping between raw image pixels and class scores , . For this study, the CNN architecture used resembles the one laid down in (ten, 2017):
size, 36 filters, 1 stride
POOL: size, 1 stride
CONV5: size, 72 filters, 1 stride
POOL: size, 1 stride
FC: 1024 Hidden Neurons
FC: 25 Output Classes
The modification introduced in the architecture design was the size of layer inputs and outputs (e.g. input of instead of , and output of 25 classes), the use of LeakyReLU instead of ReLU, and of course, the introduction of L2-SVM as the network classifier instead of the conventional Softmax function. This paradigm of combining CNN and SVM was actually proposed by Tang (2013)(Tang, 2013).
2.4.3. Gated Recurrent Unit
Agarap (2017)(Agarap, 2017)
proposed a neural network architecture combining the gated recurrent unit (GRU)(Cho et al., 2014) variant of a recurrent neural network (RNN) and the support vector machine (SVM)(Cortes and Vapnik, 1995) for the purpose of binary classification.
where and are the update gate and reset gate of a GRU-RNN respectively, is the candidate value, and is the new RNN cell state value(Cho et al., 2014). In turn, the
is used as the predictor variablein the L2-SVM predictor function (given by ) of the network instead of the conventional Softmax classifier.
2.4.4. Multilayer Perceptron
The perceptron model was developed by Rosenblatt (1958)(Rosenblatt, 1958) based on the neuron model by McCulloch & Pitts (1943)(McCulloch and Pitts, 1943). A perceptron may be represented by a linear function (given by Eq. 11
), which is then passed to an activation function such assigmoid , sign, or tanh. These activation functions introduce non-linearity (except for the sign function) to represent complex functions.
As the term itself implies, a multilayer perceptron (MLP) is a neural network that consists of hidden layers of perceptrons. In this study, the activation function used was the LeakyReLU(Maas et al., 2013) function (given by Eq. 12).
The learning parameters weight and bias
for each DL model were learned by the L2-SVM using the loss function given by Eq.6. The computed loss is then minimized through Adam(Kingma and Ba, 2014) optimization. Then, the decision function produces a vector of scores for each malware family. In order to get the predicted labels for a given data , the function is used (see Eq. 13).
The function shall return the indices of the highest scores across the vector of predicted classes .
2.5. Data Analysis
There were two phases of experiment for this study: (1) training phase, and (2) test phase. All the deep learning algorithms described in Section 2.4 were trained and tested on the Malimg dataset(Nataraj et al., 2011). The dataset was partitioned in the following fashion: 70% for training phase, and 30% for testing phase.
The variables considered in the experiments were the following:
The classification measures F1 score, precision, and recall were all computed using the classification_report() function of sklearn.metrics(Pedregosa et al., 2011).
All experiments in this study were conducted on a laptop computer with Intel Core(TM) i5-6300HQ CPU @ 2.30GHz x 4, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 960M 4GB DDR5 GPU. Table 2 shows the hyper-parameters used by the DL-SVM models in the conducted experiments. Table 3 summarizes the experiment results for the presented DL-SVM models.
|Cell Size||N/A||[256 5]||[512, 256, 128]|
|No. of Hidden Layers||2||5||3|
As opposed to what (Garcia and Muga II, 2016) did on dataset partitioning, the relative populations of each malware family were not considered in the splitting process. All the DL-SVM models were trained on 70% of the preprocessed Malimg dataset(Nataraj et al., 2011), i.e. 6400 malware family variants (), for 100 epochs. On the other hand, the models were tested on 30% of the preprocessed Malimg dataset(Nataraj et al., 2011), i.e. 2560 malware family variants (), for 100 epochs.
Figure 2 summarizes the training accuracy of the DL-SVM models for 100 epochs (equivalent to 2500 steps, since ). First, the CNN-SVM model accomplished its training in 3 minutes and 41 seconds with an average training accuracy of 80.96875%. Meanwhile, the GRU-SVM model accomplished its training in 11 minutes and 32 seconds with an average training accuracy of 90.9375%. Lastly, the MLP-SVM model accomplished its training in 12 seconds with an average training accuracy of 99.5768229%.
Figure 3 shows the testing performance of CNN-SVM model in multinomial classification on malware families. The mentioned model had a precision of 0.84, a recall of 0.77, and a F1 score of 0.79.
Figure 4 shows the testing performance of GRU-SVM model in multinomial classification on malware families. The mentioned model had a precision of 0.85, a recall of 0.85, and a F1 score of 0.85.
Figure 5 shows the testing performance of MLP-SVM model in multinomial classification on malware families. The mentioned model had a precision of 0.83, a recall of 0.80, and a F1 score of 0.81.
As shown in the confusion matrices, the DL-SVM models had better scores for the malware families with the high number of variants, most notably, Allaple.A and Allaple.L. This may be pointed to the omission of relative populations of each malware family during the partitioning of the dataset into training data and testing data. However, unlike the results of (Garcia and Muga II, 2016), only Allaple.A and Allaple.L had some misclassifications between them.
It is palpable that the GRU-SVM model stands out among the DL-SVM models presented in this study. This finding comes as no surprise as the GRU-SVM model did have the relatively most sophisticated architecture design among the presented models, most notably, its 5-layer design. As explained in (Goodfellow
et al., 2016), the number of layers of a neural network is directly proportional to the complexity of a function it can represent. In other words, the performance or accuracy of a neural network is directly proportional to the number of its hidden layers. By this logic, it stands to reason that the less number of hidden layers that a neural network has, the less its performance or accuracy is. Hence, the findings in this study corroborates the literature explanation as the MLP-SVM came second (having 80.47% test accuracy) to GRU-SVM with a 3-layer design, and the CNN-SVM came last (having 77.23% test accuracy) with a 2-layer design.
The reported test accuracy of 84.92% clearly states that the GRU-SVM model has the strongest predictive performance among the DL-SVM models in this study. This is attributed to the fact that the GRU-SVM model has the relatively most complex design among the presented models. First, its 5-layer design allows it to represent increasingly complex mappings between features and labels, i.e. function mappings . Second, its capability to learn from data of sequential nature, in which an image data belongs. This nature of the GRU-RNN comes from its gating mechanisms, given by equations in Section 2.4.3. Through the mentioned mechanisms, the GRU-RNN solves the problem of vanishing gradients and exploding gradients(Cho et al., 2014). Thus, being able to connect information with a considerable gap. However, as indicated by the training summary given by Figure 2, the GRU-SVM has the caveat of relatively longer computing time. Having finished its training in 11 minutes and 32 seconds, it was the slowest among the DL-SVM models. From a high-level inspection of the presented equations of each DL-SVM model (CNN-SVM in Section 2.4.2, GRU-SVM in Section 2.4.3, and MLP-SVM in Section 2.4.4), it was a theoretical implication that the GRU-SVM would have the longest computing time as it had more non-linearities introduced in its computation. On the other hand, with the least non-linearities (having only used LeakyReLU), it was also theoretically implied that the MLP-SVM model would have the shortest computing time.
From the literature explanation(Goodfellow et al., 2016) and empirical evidence, it can be inferred that increasing the complexity of the architectural design (e.g. more hidden layers, better non-linearities) of the CNN-SVM and MLP-SVM models may catapult their predictive performance, and would be more on par with the GRU-SVM model. In turn, this implication warrants a further study and exploration that may be prolific to the information security community.
5. Conclusion and Recommendation
We used the Malimg dataset prepared by (Nataraj et al., 2011), which consists of malware images for the purpose of malware family classification. We employed deep learning models with the L2-SVM as their final output layer in a multinomial classification task. The empirical data shows that the GRU-SVM model by (Agarap, 2017) had the highest predictive accuracy among the presented DL-SVM models, having a test accuracy of 84.92%.
Improving the architecture design of the CNN-SVM model and MLP-SVM model by adding more hidden layers, adding better non-linearities, and/or using an optimized dropout, may provide better insights on their application on malware classification. Such insights may reveal an information as to which architecture may serve best in the engineering of an intelligent anti-malware system.
We extend our statement of gratitude to the open-source community, especially to TensorFlow. An appreciation as well to Lakshmanan Nataraj, S. Karthikeyan, Gregoire Jacob, and B.S. Manjunath for the Malimg dataset(Nataraj et al., 2011).
We would also like to express our appreciation to friends who took time reading parts of this paper during its development, and who gave their support to this endeavor. F.J.H. Pepito gives his appreciation to Hyacinth Gasmin, and A.F. Agarap gives his appreciation to Ma. Pauline de Ocampo, Abqary Alon, Rhea Jude Ferrer, and Julius Luis H. Diaz.
- ten (2017) 2017. Deep MNIST for Experts. (Nov 2017). https://www.tensorflow.org/get_started/mnist/pros
- Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015). http://tensorflow.org/ Software available from tensorflow.org.
- Agarap (2017) Abien Fred Agarap. 2017. A Neural Network Architecture Combining Gated Recurrent Unit (GRU) and Support Vector Machine (SVM) for Intrusion Detection in Network Traffic Data. arXiv preprint arXiv:1709.03082 (2017).
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
- Cortes and Vapnik (1995) C. Cortes and V. Vapnik. 1995. Support-vector Networks. Machine Learning 20.3 (1995), 273–297. https://doi.org/10.1007/BF00994018
- Garcia and Muga II (2016) Felan Carlo C. Garcia and Felix P. Muga II. 2016. Random Forest for Malware Classification. arXiv preprint arXiv:1609.07770 (2016).
- Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
- Hunter (2007) J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing In Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55
- Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Maas et al. (2013) Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, Vol. 30.
- McCulloch and Pitts (1943) Warren S McCulloch and Walter Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 5, 4 (1943), 115–133.
- Nataraj et al. (2011) Lakshmanan Nataraj, S Karthikeyan, Gregoire Jacob, and BS Manjunath. 2011. Malware images: visualization and automatic classification. In Proceedings of the 8th international symposium on visualization for cyber security. ACM, 4.
- Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
- Rosenblatt (1958) Frank Rosenblatt. 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review 65, 6 (1958), 386.
- Shelly and Vermaat (2011) Gary B Shelly and Misty E Vermaat. 2011. Discovering Computers, Complete: Your Interactive Guide to the Digital World. Cengage Learning.
- Tang (2013) Yichuan Tang. 2013. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013).
- Walt et al. (2011) Stéfan van der Walt, S Chris Colbert, and Gael Varoquaux. 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering 13, 2 (2011), 22–30.