1. Introduction
Effective and efficient mitigation of malware is a longtime endeavor in the information security community. The development of an antimalware system that can counteract an unknown malware is a prolific activity that may benefit several sectors.
To intercept an unknown malware or even just an unknown variant is a laborious task to undertake, and may only be accomplished by constantly updating the antimalware signature database. The mentioned database contains the information on all known malware by the particular system(Shelly and
Vermaat, 2011), which is then used for malware detection. Consequently, newlyreleased malware which are not yet included in the database will go undetected.
We envision an intelligent antimalware system that employs a deep learning (DL) approach which would enable the detection of newlyreleased malware through its capability to generalize on data. Furthermore, we amend the conventional DL models to use the support vector machine (SVM) as their classification function.
We take advantage of the Malimg dataset(Nataraj et al., 2011) which consists of visualized malware binaries, and use it to train the DLSVM models to classify each malware family.
2. Methodology
2.1. Machine Intelligence Library
Google TensorFlow
(Abadi et al., 2015) was used to implement the deep learning algorithms in this study, with the aid of other scientific computing libraries: matplotlib(Hunter, 2007), numpy(Walt et al., 2011), and scikitlearn(Pedregosa et al., 2011).2.2. The Dataset
The deep learning (DL) models in this study were evaluated on the Malimg dataset(Nataraj et al., 2011), which consists of 9,339 malware samples from 25 different malware families. Table 1 shows the frequency distribution of malware families and their variants in the Malimg dataset(Nataraj et al., 2011).
No.  Family  Family Name  No. of Variants 

01  Dialer  Adialer.C  122 
02  Backdoor  Agent.FYI  116 
03  Worm  Allaple.A  2949 
04  Worm  Allaple.L  1591 
05  Trojan  Alueron.gen!J  198 
06  Worm:AutoIT  Autorun.K  106 
07  Trojan  C2Lop.P  146 
08  Trojan  C2Lop.gen!G  200 
09  Dialer  Dialplatform.B  177 
10  Trojan Downloader  Dontovo.A  162 
11  Rogue  Fakerean  381 
12  Dialer  Instantaccess  431 
13  PWS  Lolyda.AA 1  213 
14  PWS  Lolyda.AA 2  184 
15  PWS  Lolyda.AA 3  123 
16  PWS  Lolyda.AT  159 
17  Trojan  Malex.gen!J  136 
18  Trojan Downloader  Obfuscator.AD  142 
19  Backdoor  Rbot!gen  158 
20  Trojan  Skintrim.N  80 
21  Trojan Downloader  Swizzor.gen!E  128 
22  Trojan Downloader  Swizzor.gen!I  132 
23  Worm  VB.AT  408 
24  Trojan Downloader  Wintrim.BX  97 
25  Worm  Yuner.A  800 
Nataraj et al. (2011)(Nataraj et al., 2011) created the Malimg dataset by reading malware binaries into an 8bit unsigned integer composing a matrix . The said matrix may be visualized as a grayscale image having values in the range of , with 0 representing black and 1 representing white.
2.3. Dataset Preprocessing
Similar to what (Garcia and Muga II, 2016) did, the malware images were resized to a 2dimensional matrix of , and were flattened into a size array, resulting to a size array. Each feature array was then labelled with its corresponding indexed malware family name (i.e. ). Then, the features were standardized using Eq. 1.
(1) 
where is the feature to be standardized, is its mean value, and
is its standard deviation. The standardization was implemented using
StandardScaler().fit_transform() of scikitlearn(Pedregosa et al., 2011). Granted that the dataset consists of images, and standardization may not be suitable for such data, but take note that the images originate from malware binary files. Hence, the features are not technically images to begin with.2.4. Computational Models
This section presents the deep learning (DL) models, and the support vector machine (SVM) classifier used in the study.
2.4.1. Support Vector Machine (SVM)
The support vector machine (SVM) was developed by Vapnik(Cortes and Vapnik, 1995)
for binary classification. Its objective is to find the optimal hyperplane
to separate two classes in a given dataset, with features .SVM learns the parameters and by solving the following constrained optimization problem:
(2) 
(3)  
(4) 
where is the Manhattan norm (also known as L1 norm), is the penalty parameter (may be an arbitrary value or a selected value using hyperparameter tuning), and is a cost function.
where is the actual label, and is the predictor function. This equation is known as L1SVM, with the standard hinge loss. Its differentiable counterpart, L2SVM (given by Eq. 6), provides more stable results(Tang, 2013).
(6) 
where is the Euclidean norm (also known as L2 norm), with the squared hinge loss.
Despite intended for binary classification, SVM may be used for multinomial classification as well. One approach to achieve this is the use of kernel tricks, which converts a linear model to a nonlinear model by applying kernel functions such as radial basis function (RBF). However, for this study, we utilized the linear L2SVM for the multinomial classification problem. We then employed the
oneversusall (OvA) approach, which treats a given class as the positive class, and others as negative class.Take for example the following classes: airplane, boat, car. If a given image belongs to the airplane class, it is taken as the positive class, which leaves the other two classes the negative class.
With the OvA approach, the L2SVM serves as the classifier of each deep learning model in this study (CNN, GRU, and MLP). That is, the learning parameters weight and bias of each model is learned by the SVM.
2.4.2. Convolutional Neural Network
Convolutional Neural Networks (CNNs) are similar to feedforward neural networks for they also consist of hidden layers of neurons with “learnable” parameters. These neurons receive inputs, performs a dot product, and then follows it with a nonlinearity such as
or . The whole network expresses the mapping between raw image pixels and class scores , . For this study, the CNN architecture used resembles the one laid down in (ten, 2017):
INPUT:

CONV5:
size, 36 filters, 1 stride

LeakyReLU:

POOL: size, 1 stride

CONV5: size, 72 filters, 1 stride

LeakyReLU:

POOL: size, 1 stride

FC: 1024 Hidden Neurons

LeakyReLU:

DROPOUT:

FC: 25 Output Classes
The modification introduced in the architecture design was the size of layer inputs and outputs (e.g. input of instead of , and output of 25 classes), the use of LeakyReLU instead of ReLU, and of course, the introduction of L2SVM as the network classifier instead of the conventional Softmax function. This paradigm of combining CNN and SVM was actually proposed by Tang (2013)(Tang, 2013).
2.4.3. Gated Recurrent Unit
Agarap (2017)(Agarap, 2017)
proposed a neural network architecture combining the gated recurrent unit (GRU)
(Cho et al., 2014) variant of a recurrent neural network (RNN) and the support vector machine (SVM)(Cortes and Vapnik, 1995) for the purpose of binary classification.(7) 
(8) 
(9) 
(10) 
where and are the update gate and reset gate of a GRURNN respectively, is the candidate value, and is the new RNN cell state value(Cho et al., 2014). In turn, the
is used as the predictor variable
in the L2SVM predictor function (given by ) of the network instead of the conventional Softmax classifier.2.4.4. Multilayer Perceptron
The perceptron model was developed by Rosenblatt (1958)(Rosenblatt, 1958) based on the neuron model by McCulloch & Pitts (1943)(McCulloch and Pitts, 1943). A perceptron may be represented by a linear function (given by Eq. 11
), which is then passed to an activation function such as
sigmoid , sign, or tanh. These activation functions introduce nonlinearity (except for the sign function) to represent complex functions.As the term itself implies, a multilayer perceptron (MLP) is a neural network that consists of hidden layers of perceptrons. In this study, the activation function used was the LeakyReLU(Maas et al., 2013) function (given by Eq. 12).
(11) 
(12) 
The learning parameters weight and bias
for each DL model were learned by the L2SVM using the loss function given by Eq.
6. The computed loss is then minimized through Adam(Kingma and Ba, 2014) optimization. Then, the decision function produces a vector of scores for each malware family. In order to get the predicted labels for a given data , the function is used (see Eq. 13).(13) 
The function shall return the indices of the highest scores across the vector of predicted classes .
2.5. Data Analysis
There were two phases of experiment for this study: (1) training phase, and (2) test phase. All the deep learning algorithms described in Section 2.4 were trained and tested on the Malimg dataset(Nataraj et al., 2011). The dataset was partitioned in the following fashion: 70% for training phase, and 30% for testing phase.
The variables considered in the experiments were the following:
(14) 
(15) 
(16) 
The classification measures F1 score, precision, and recall were all computed using the classification_report() function of sklearn.metrics(Pedregosa et al., 2011).
3. Results
All experiments in this study were conducted on a laptop computer with Intel Core(TM) i56300HQ CPU @ 2.30GHz x 4, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 960M 4GB DDR5 GPU. Table 2 shows the hyperparameters used by the DLSVM models in the conducted experiments. Table 3 summarizes the experiment results for the presented DLSVM models.
Hyperparameters  CNNSVM  GRUSVM  MLPSVM 

Batch Size  256  256  256 
Cell Size  N/A  [256 5]  [512, 256, 128] 
No. of Hidden Layers  2  5  3 
Dropout Rate  0.85  0.85  None 
Epochs  100  100  100 
Learning Rate  1e3  1e3  1e3 
SVM C  10  10  0.5 
Variables  CNNSVM  GRUSVM  MLPSVM 

Accuracy  77.2265625%  84.921875%  80.46875% 
Data points  256000  256000  256000 
Epochs  100  100  100 
F1  0.79  0.85  0.81 
Precision  0.84  0.85  0.83 
Recall  0.77  0.85  0.80 
As opposed to what (Garcia and Muga II, 2016) did on dataset partitioning, the relative populations of each malware family were not considered in the splitting process. All the DLSVM models were trained on 70% of the preprocessed Malimg dataset(Nataraj et al., 2011), i.e. 6400 malware family variants (), for 100 epochs. On the other hand, the models were tested on 30% of the preprocessed Malimg dataset(Nataraj et al., 2011), i.e. 2560 malware family variants (), for 100 epochs.
Figure 2 summarizes the training accuracy of the DLSVM models for 100 epochs (equivalent to 2500 steps, since ). First, the CNNSVM model accomplished its training in 3 minutes and 41 seconds with an average training accuracy of 80.96875%. Meanwhile, the GRUSVM model accomplished its training in 11 minutes and 32 seconds with an average training accuracy of 90.9375%. Lastly, the MLPSVM model accomplished its training in 12 seconds with an average training accuracy of 99.5768229%.
Figure 3 shows the testing performance of CNNSVM model in multinomial classification on malware families. The mentioned model had a precision of 0.84, a recall of 0.77, and a F1 score of 0.79.
Figure 4 shows the testing performance of GRUSVM model in multinomial classification on malware families. The mentioned model had a precision of 0.85, a recall of 0.85, and a F1 score of 0.85.
Figure 5 shows the testing performance of MLPSVM model in multinomial classification on malware families. The mentioned model had a precision of 0.83, a recall of 0.80, and a F1 score of 0.81.
As shown in the confusion matrices, the DLSVM models had better scores for the malware families with the high number of variants, most notably, Allaple.A and Allaple.L. This may be pointed to the omission of relative populations of each malware family during the partitioning of the dataset into training data and testing data. However, unlike the results of (Garcia and Muga II, 2016), only Allaple.A and Allaple.L had some misclassifications between them.
4. Discussion
It is palpable that the GRUSVM model stands out among the DLSVM models presented in this study. This finding comes as no surprise as the GRUSVM model did have the relatively most sophisticated architecture design among the presented models, most notably, its 5layer design. As explained in (Goodfellow
et al., 2016), the number of layers of a neural network is directly proportional to the complexity of a function it can represent. In other words, the performance or accuracy of a neural network is directly proportional to the number of its hidden layers. By this logic, it stands to reason that the less number of hidden layers that a neural network has, the less its performance or accuracy is. Hence, the findings in this study corroborates the literature explanation as the MLPSVM came second (having 80.47% test accuracy) to GRUSVM with a 3layer design, and the CNNSVM came last (having 77.23% test accuracy) with a 2layer design.
The reported test accuracy of 84.92% clearly states that the GRUSVM model has the strongest predictive performance among the DLSVM models in this study. This is attributed to the fact that the GRUSVM model has the relatively most complex design among the presented models. First, its 5layer design allows it to represent increasingly complex mappings between features and labels, i.e. function mappings . Second, its capability to learn from data of sequential nature, in which an image data belongs. This nature of the GRURNN comes from its gating mechanisms, given by equations in Section 2.4.3. Through the mentioned mechanisms, the GRURNN solves the problem of vanishing gradients and exploding gradients(Cho et al., 2014). Thus, being able to connect information with a considerable gap. However, as indicated by the training summary given by Figure 2, the GRUSVM has the caveat of relatively longer computing time. Having finished its training in 11 minutes and 32 seconds, it was the slowest among the DLSVM models. From a highlevel inspection of the presented equations of each DLSVM model (CNNSVM in Section 2.4.2, GRUSVM in Section 2.4.3, and MLPSVM in Section 2.4.4), it was a theoretical implication that the GRUSVM would have the longest computing time as it had more nonlinearities introduced in its computation. On the other hand, with the least nonlinearities (having only used LeakyReLU), it was also theoretically implied that the MLPSVM model would have the shortest computing time.
From the literature explanation(Goodfellow
et al., 2016) and empirical evidence, it can be inferred that increasing the complexity of the architectural design (e.g. more hidden layers, better nonlinearities) of the CNNSVM and MLPSVM models may catapult their predictive performance, and would be more on par with the GRUSVM model. In turn, this implication warrants a further study and exploration that may be prolific to the information security community.
5. Conclusion and Recommendation
We used the Malimg dataset prepared by (Nataraj et al., 2011), which consists of malware images for the purpose of malware family classification. We employed deep learning models with the L2SVM as their final output layer in a multinomial classification task. The empirical data shows that the GRUSVM model by (Agarap, 2017) had the highest predictive accuracy among the presented DLSVM models, having a test accuracy of 84.92%.
Improving the architecture design of the CNNSVM model and MLPSVM model by adding more hidden layers, adding better nonlinearities, and/or using an optimized dropout, may provide better insights on their application on malware classification. Such insights may reveal an information as to which architecture may serve best in the engineering of an intelligent antimalware system.
6. Acknowledgment
We extend our statement of gratitude to the opensource community, especially to TensorFlow. An appreciation as well to Lakshmanan Nataraj, S. Karthikeyan, Gregoire Jacob, and B.S. Manjunath for the Malimg dataset(Nataraj et al., 2011).
We would also like to express our appreciation to friends who took time reading parts of this paper during its development, and who gave their support to this endeavor. F.J.H. Pepito gives his appreciation to Hyacinth Gasmin, and A.F. Agarap gives his appreciation to Ma. Pauline de Ocampo, Abqary Alon, Rhea Jude Ferrer, and Julius Luis H. Diaz.
References
 (1)
 ten (2017) 2017. Deep MNIST for Experts. (Nov 2017). https://www.tensorflow.org/get_started/mnist/pros
 Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: LargeScale Machine Learning on Heterogeneous Systems. (2015). http://tensorflow.org/ Software available from tensorflow.org.
 Agarap (2017) Abien Fred Agarap. 2017. A Neural Network Architecture Combining Gated Recurrent Unit (GRU) and Support Vector Machine (SVM) for Intrusion Detection in Network Traffic Data. arXiv preprint arXiv:1709.03082 (2017).
 Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
 Cortes and Vapnik (1995) C. Cortes and V. Vapnik. 1995. Supportvector Networks. Machine Learning 20.3 (1995), 273–297. https://doi.org/10.1007/BF00994018
 Garcia and Muga II (2016) Felan Carlo C. Garcia and Felix P. Muga II. 2016. Random Forest for Malware Classification. arXiv preprint arXiv:1609.07770 (2016).
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
 Hunter (2007) J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing In Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 Maas et al. (2013) Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, Vol. 30.
 McCulloch and Pitts (1943) Warren S McCulloch and Walter Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 5, 4 (1943), 115–133.
 Nataraj et al. (2011) Lakshmanan Nataraj, S Karthikeyan, Gregoire Jacob, and BS Manjunath. 2011. Malware images: visualization and automatic classification. In Proceedings of the 8th international symposium on visualization for cyber security. ACM, 4.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikitlearn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
 Rosenblatt (1958) Frank Rosenblatt. 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review 65, 6 (1958), 386.
 Shelly and Vermaat (2011) Gary B Shelly and Misty E Vermaat. 2011. Discovering Computers, Complete: Your Interactive Guide to the Digital World. Cengage Learning.
 Tang (2013) Yichuan Tang. 2013. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013).
 Walt et al. (2011) Stéfan van der Walt, S Chris Colbert, and Gael Varoquaux. 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering 13, 2 (2011), 22–30.
Comments
There are no comments yet.