Malware threats are a serious problem for computer security, and the ability to detect and classify malware is critical for maintaining the security level of a computer. Malware analysis has a static analysis approach and a dynamic analysis approach. Static analysis approaches disassemble malware code, reveal its execution logic and find patterns that trigger attack behavior. The dynamic analysis approach, on the other hand, runs malware in a virtual environment and obtains a report that traces the movement to identify the characteristics of the attack behavior. The static analysis approach provides a complete picture of the program structure by reading the malware code from start to finish. However, in general, Malware authors obfuscate program code by packing, which hinders this approach . Although the dynamic analysis approach does not matter if the program code is obfuscated, typically only a single execution path is examined  and this can lead to an incomplete comprehension of malware activity.
In addition to the approach described above, many researchers are investigating technique for classifying malware families using malware visualization images . This technique converts the structure of the malware binary sample into a two-dimensional grayscale image and uses the image features for classification. As a result, even if the malware code is obfuscated, it can generate images without being obstructed . Moreover, there’s no need to execute malware programs.
In the field of image recognition, the Convolutional Neural Network (CNN), which is recognized as one of the representative methods of Deep Learning , is widely applied. For example, CNN has been applied to the classification of fruit images , medical images , criminal investigation images , tread pattern images , and face images .
Besides, there have been many reports that applied CNN model to malware visualization image classification      . Unfortunately,it has not been revealed the way how to pick out a CNN model that fits a given malware dataset and achieves higher classification accuracy.
We propose a strategy to select a Deep learning model that fits the malware visualization images. Our strategy uses the fine-tuning method for the pre-trained CNN model and a dataset that solves the imbalance problem. We have selected the VGG19 model based on the proposed strategy to classify the Malimg dataset. Experimental results show that the classification accuracy is 99.72%, which is higher than other previously proposed malware classification methods.
Our contribution of the research can be organized as follows:
We proposed a strategy to select a CNN model that fits the malware visualization images using fine-tuning method.
We solved the problem of sample data imbalance between malware families using the undersampling technique.
We selected the VGG16 fine-tuning model and achieved high accuracy of classifying malware family.
This paper is organized as following: Section 2 presents related work on malware detection and classification; Section 3 introduces design for stategy comparing the performance between CNN models using fine-tuning; Section 4 shows experimental settings; Section 5 provides experimental results; Section 6 concludes the paper.
2 Related Work
we present the related research regarding malware detection and classification including static analysis approach, dynamic analysis approach, malware visualization image in this section.
2.1 Static Analysis Approach
In relation to static analysis approach, several methods have been proposed for analyzing the code of malware. For example, Lo et al.  developed a static analysis tool called Malicious Code Filter (MCF). The tool used some program property features to determine if the program is malicious or not. Masud et al. 
combined three types of features, binary n-grams, assembly instruction sequences, and dynamic link libraries (DLLs) to detect malicious executables. Then, a classifier for malware detection was builed based on SVM and Boosted J48. Yakura et al. used CNN attention maps to extract byte sequences that characterize malware families from malware binary data. Jung et al.  disassembled the malware binary file and extracted the bytecode. Next, the bytecode was converted to an image, and malware was classified based on the convolutional neural network. Sewak et al.  investigated a deep learning-based system for malware detection. They prepared one and three layers of Auto-Encoders and two, four and seven layers of Deep Neural Networks, and evaluated the results of each combination. Dam and Touili 
proposed a tool called STAMAD. The tool automatically classified malware and benign programs using either extracting API graphs that represent malicious behavior, or machine learning based on SVM.
2.2 Dynamic Analysis Approach
Regarding the dynamic analysis approach, a number of methods have been proposed to execute malware and analyze its behavior. For example, Rhode et al. 
proposed a model for detecting malicious files based on a few seconds of the initial action sequence executed by the malware. The set of API calls obtained by executing the malware PE file in the Cuckoo Sandbox was sent to the recurrent neural network for analysis. Xiaofeng et al.
proposed an architecture for detecting malware by combining machine learning (random forest) and deep learning (LSTM) using an API call sequence. S.L and CD proposed a CNN-based windows malware detector that detects and classifies malware using the behavior of Portable Executable (PE) files. The proposed method executed malware PE files in Sandbox, and obtains N-grams of API calls. Liu and Wang  implemented a malware detection system based on deep learning and API calls. They used the Cuckoo Sandbox to extract the API calling sequence of a malicious program and evaluated it with BLSTM. Li et al.  proposed a technique for detecting kernel-resident malware using the location of the page global directory and the instruction set of the processor. They implemented the technique in a tool called Fluorescence and showed that 200 virtual machines could be inspected in about an hour.
2.3 Generating hostile attacks against malware detection
In recent years, research results have been reported that generate not only malware detection methods but also attack methods that invalidate malware detection. Grosse et al.  has extended existing hostile sample creation algorithms to build effective attacks against malware detection models. This approach works in the discrete binary input domain. Rosenberg et al.  generated a hostile example of an RNN malware classifier based on API calls. They have shown that the proposed attack is feasible by implementing a black box attack.
2.4 Malware Visualization Approach
As for the method of malware classification using malware visualization image, many researchers have reported results using the Malimg dataset. Nataraj et al. 
selected GIST to extract image features and classified them using k-nearest neighbors. As a result, they obtained a classification accuracy of 97.18% . Nonetheless, the confusion matrix showed that there was some confusion between "C2Lop.P" and "C2Lop.gen! G". Then there was a similar confusion between "Swizzor.gen! I" and "Swizzor.gen! E". As the name implied, each malware family had similar characteristics. Therefore, a new approach was needed to classify these families. Kosmidis and Kalloniatis
classified malware images using machine learning techniques such as Decision Tree, Nearest Centroid, Stochastic Gradient, Perceptron, Multilayer Perceptron, and Random Forest. Of these techniques, the best result was Random Forest with 91.6% accuracy. Cui et al.
used CNN to extract image features and classify malware families. In the paper, the number of data belonging to each family is equalized before learning based on data augmentation. As a result of the experiment, the classification accuracy reached 94.5%. Furthermore, Precision and Recall were almost the same. This indicates that not only malware families with large data samples were classified correctly, but also that malware families with small data ware classified with a certain degree of accuracy.
In addition, various methods using deep learning models have been reported. Rezende et al. 
prepared a ResNet50 model in which all convolutional layer parameters were transferred from a model previously trained on an ImageNet dataset. Experimental results indicate that malware families could be classified with an accuracy of 98.62%. Mourtaji et al. report that the VGG16 model was used for classification with an accuracy of 97.02%. Kalash et al.  state that malware families were classified with an accuracy of 98.52% using M-CNN based on VGG16. Lo et al. 
indicate that the Xception model achieved a classification accuracy of 99.03%. These four reports, nevertheless, do not sufficiently reveal how to pick out a model that fits a given malware dataset. Moreover, since these reports show neither confusion matrices nor evaluation metrics such as Precision and Recall, it is not clear how accurately these proposed models classified malware families containing only small data samples.
3 Strategy Design
This section reveals our strategy design.
Our strategy uses the fine-tuning method for the pre-trained CNN model and a dataset that solves the imbalance problem.
3.1 CNN fine-tuning model
We classify malware visualization images using a convolutional neural network pre-trained using ImageNet. ImageNet  is a database that stores over 14 million color images and is intended for use by researchers. As shown in figure 1 , the classification accuracy of each CNN model has been clarified.
As an example of a strategy for classifying malware visualized images, you can select a model with as high a classification accuracy as possible. However, note that this metric does not indicate general performance for all datasets. We believe that the compatibility of the model with the dataset is an important factor when choosing a model. Therefore, we decided to check the compatibility between the dataset and the deep learning model by evaluating the performance of learning the dataset that solved the imbalance problem.
Fine-tuning is widely used technique for model reuse, and it has several unfrozen layers in the pre-trained model . Our strategy uses fine-tuning with unified conditions to compare the performance between CNN models. The condition is to freeze 80 % of the convolutional layers and train the remaining 20 % on image features. The main reason for setting this condition is that Tajbakhsh et al.  evaluated the adaptation between fine-tuning and the dataset, and they found that neither shallow tuning nor deep tuning was the optimal select for a particular image dataset.
Subsequently, we discard the original fully-connected layer prepared for classifying 1000 classes from each fine-tuning models. Then we designed new fully-conneced layers to classify the 25 classes and added them to each model. We also train the parameters of this new fully-connected layer by image features. Our proposed model architecture is shown in Figure 1. The fully-conneced layers shown in Figure 1 are used in all our experiments.
3.2 Approach to imbalanced dataset
In general, models generated by directly learning imbalanced data have poor classification accuracy for classes with a small number of samples. Therefore, there are many research reports on the problem of data imbalance. The class imbalance approaches can be divided into two main categories,1)data-level,and 2)algorithm-level approaches . The data-level approaches improve the distribution of data such as oversampling the minority classes or undersampling the majority classes. On the other hand, the algorithm level approaches adjust the classifier itself based on cost-oriented learning.
We use the Malimg dataset  created by Nataraj et al.  to classify malware visualization images. Malimg is a dataset of 9339 images, including 25 malware families. According to a report from the author, the malware based on this dataset is wild malware submitted to the Anubis analysis system. Each malware is labeled by Microsoft Security Essentials and classified into different malware families. In this paper, the number of malware image samples is imbalanced, as shown in Figure 2 . As for the number of images belonging to each family, Allaple.A has 2949 images, while Skintrim.N, WinTrim.BX and Autorun.K have only 80, 97 and 106 images respectively. Among the several approaches to imbalanced data, we adopt the undersampling method. The main reason is that most families in the Malimg dataset have roughly similar sample sizes, except for the three families Allaple.A, Allaple.L, and Yuner.A.
4 Experimental Settings
This section describes our experimental settings. The contents are composed of Dataset, Evaluation metrics, and Evaluation environment.
We undersample the Malimg dataset and set the maximum number of samples belonging to each malware family to 80 so that all classes have the same number of samples. Yet, as a result of undersampling, important features may be lost. so, we also prepare groups that set the maximum number of samples to 160, 240, and 320. Expample of the number of image samples belonging to each class is as shown in the table 2. We kept 90% of the dataset as training data and the rest as test data in all experiments. Note that we did not use data augmentation in the training process in all experiments.
4.2 Evaluation metrics
We use the accuracy, precision, and recall for evaluation metrics. These evaluation metrics are adopted frequently in the research community to provide comprehensive assessments of imbalanced learning problems . These metrics are defined as follows:
where true positive (TP) and false positive (FP) are the number of samples correctly and wrongly classified as malware. Likewise, true negative (TN) and false negative (FN) are the number of samples correctly and wrongly classified as benign.
4.3 Evaluation environment
|CPU||Intel Core i9-9900K|
|GPU||Nvidia GeForce RTX 2080Ti|
|OS||Ubuntu 18.04 LTS 64bit|
|Code||Keras on top of TensorFlow|
|Loss Function||Categorical Cross Entropy|
5 Experimental Result
In this section, we shows our experimental results. We first evaluated the models using a dataset that solved the data imbalance problem and Fine-tuning models pretrained on ImageNet. Next, we investigated the optimal Fine-tuning and undersampling for the model we selected, and showed the results that classified the entire Malimg dataset. Finally, we compared our experimental results with previous reports.
5.1 Selecting models that fits the dataset
We have prepared models pre-trained on ImageNet: NasNet , DenseNet201 , Xception , ResNet50 , VGG19 , VGG16 . We fine-tuned each model and frozen 80% of the convolutional layers. The training and validation results are as shown in the figure 3 and figure 4.
Figure 3 shows that ACC and LOSS showed almost identical curves for all models. On the other hand, as shown figure 4, Val_acc value increased and Val_loss value decreased only in the VGG16 and VGG19 models. Of course, depending on the design of some parameters, it is possible that other models could improve the verification capabilities. However, in our strategy, these models have not been able to obtain sufficient validation capabilities using the Max80 dataset. Table shows the classification accuracy when the verification was performed. We selected VGG16 model because it was slightly more accurate than VGG19 model.
Table 5 shows the classification accuracy using 9339 images of the whole Malimg dataset. Because the classification accuracy between VGG16 and VGG19 was very small, we chose both models at this stage.
5.2 Investigating optimal undersampling and fine-tuning
We investigated the optimal fine-tuning of selected VGG16 and VGG19 models. We prepared Max80, Max160, Max240 and Max320 dataset for the investigation. First, We trained and validated Max80, Max160, Max240, Max320 using VGG19 frozen 80%. The validation performance results were shown in Figure 6 and Figure 6. Comparing Figure 6 and Figure 6, the former with a larger number of samples had higher Accuracy and a lower Loss. This result shows that reducing the sample data extremely by undersampling reduces the classification accuracy. Referring to Figure 6 (a), since Max240 and Max320 drew almost the same curve, we conjecture that these numbers would be the optimal value for undersampling. In both Figure 6 (b) and Figure 6 (b), the loss is dropping steadily, which means that there is no tendency for over-fitting.
Next, We trained and validated Max240 and Max320 using frozen 80%, frozen 60%, frozen 40% and frozen 20%, and the top of the results are as shown in Table 6. Figure 7 and Figure 8 show the confusion matrices generated by the VGG19 and VGG16 models. All classes are observed to be able to classify malware without major errors.
|VGG19 Frozen60% trained by Max320||99.72||99.72||99.47||99.72||99.76|
|VGG16 Frozen40% trained by Max320||99.72||99.72||99.44||99.72||99.74|
|VGG16 Frozen40% trained by Max240||99.68||99.68||99.31||99.66||99.56|
|VGG19 Frozen60% trained by Max240||99.65||99.65||99.21||99.65||99.58|
5.3 Comparison with other reports
We compared the classification accuracy of our chosen model with that of other reports for malware visualization image classification. Table 7 shows that the classification accuracy of our selected model is higher than any previously reported method using Malimg dataset and that we can effectively classify the Malimg dataset.
We proposed a strategy to select a Deep learning model that fits the malware visualization images. First,we solved the problem of sample data imbalance between malware families using the undersampling technique. Second, we selected a CNN model that fits the malware visualization images using fine-tuning method. Finally, we selected the VGG16 fine-tuning model and achieved high accuracy of classifying malware family. In the future, there is an approach to improve the proposed strategy by preparing a dataset with a larger malware sample.
-  (2019) EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices. In The 3rd International Workshop on Deep Learning for Mobile Systems and A pplications, EMDL ’19, New York, NY, USA, pp. 1–6. External Links: Cited by: §3.1.
-  (2014) Malimg Dataset. Note: https://sarvamblog.blogspot.com/2014/08/ Cited by: §3.2.
-  (2017-07) Xception: Deep Learning With Depthwise Separable Convolutions. In , Cited by: §5.1.
-  (2018-07) Detection of Malicious Code Variants Based on Deep Learning. IEEE Transactions on Industrial Informatics 14 (7), pp. 3187–3196. External Links: Cited by: §1, §2.4, §3.2, §4.2, Table 7.
-  (2019) STAMAD: A STAtic MAlware Detector. In Proceedings of the 14th International Conference on Availability, Reliability and Security, ARES ’19, New York, NY, USA. External Links: Cited by: §2.1.
-  (2009-06) ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. External Links: Cited by: §3.1.
-  (2017) Deep learning with Python. Manning Publications Company. Cited by: §3.1.
-  (2017) Adversarial examples for malware detection. In European Symposium on Research in Computer Security, pp. 62–79. Cited by: §2.3.
-  (2019) Max-margin Class Imbalanced Learning with Gaussian Affinity. CoRR abs/1901.07711. External Links: Cited by: §3.2.
-  (2016-06) Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1.
-  (2017-07) Densely Connected Convolutional Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1.
-  (2018) Malware Classification Using Byte Sequence Information. In Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems, RACS ’18, New York, NY, USA, pp. 143–148. External Links: Cited by: §2.1.
-  (2018-02) Malware Classification with Deep Convolutional Neural Networks. In 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), Vol. , pp. 1–5. External Links: Cited by: §1, §2.4, Table 7.
-  (2018) Stanford university cs231n: Convolutional neural networks for visual recognition. Note: http://cs231n.stanford.edu/syllabus.html Cited by: §1.
Fine-tuning Approach to NIR Face Recognition. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 2337–2341. External Links: Cited by: §1.
-  (2017) Machine Learning and Images for Malware Detection and Classification. In Proceedings of the 21st Pan-Hellenic Conference on Informatics, PCI 2017, New York, NY, USA. External Links: Cited by: §2.4.
-  (2015-05) Deep learning. Nature 512 (), pp. 436–444. External Links: Cited by: §1.
-  (2019-09) Fluorescence: Detecting Kernel-Resident Malware in Clouds. In 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019), Chaoyang District, Beijing, pp. 367–382. External Links: Cited by: §2.2.
-  (2018) . In Proceedings of the 3rd International Conference on Multimedia and Image Processing, ICMIP 2018, New York, NY, USA, pp. 68–72. External Links: Cited by: §1.
-  (2018) An Effective Tread Pattern Image Classification Algorithm Based on Transfer Learning. In Proceedings of the 3rd International Conference on Multimedia Systems and Signal Processing, ICMSSP ’18, New York, NY, USA, pp. 51–55. External Links: Cited by: §1.
-  (2019-03) A Robust Malware Detection System Using Deep Learning on API Calls. In 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Vol. , pp. 1456–1460. External Links: Cited by: §2.2.
-  (1995) MCF: A malicious code filter. Vol. 14, Elsevier. Cited by: §2.1.
-  (2019-06) An Xception Convolutional Neural Network for Malware Classification with Transfer Learning. In 2019 10th IFIP International Conference on New Technologies, Mobility and Security (NTMS), Vol. , pp. 1–5. External Links: Cited by: §1, §2.4, Table 7.
A scalable multi-level feature extraction technique to detect malicious executables. Vol. 10, Springer. Cited by: §2.1.
-  (2019) Intelligent Framework for Malware Detection with Convolutional Neural Network. In Proceedings of the 2nd International Conference on Networking, Information Systems & Security, NISS19, New York, NY, USA. External Links: Cited by: §1, §2.4.
-  (2011) Malware Images: Visualization and Automatic Classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security, VizSec ’11, New York, NY, USA. External Links: Cited by: §1, §1, §1, §2.4, §3.2, Table 7.
-  (2017-12) Malicious Software Classification Using Transfer Learning of ResNet-50 Deep Neural Network. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Vol. , pp. 1011–1014. External Links: Cited by: §1, §2.4, Table 7.
-  (2018-05) Early-stage malware prediction using recurrent neural networks. Computers & Security 77 (), pp. 578–594. Cited by: §2.2.
-  (2011) Automatic analysis of malware behavior using machine learning. Vol. 19, IOS Press. Cited by: §1.
-  (2018) Generic Black-Box End-to-End Attack Against State of the Art API Call Based Malware Classifiers. In Research in Attacks, Intrusions, and Defenses, M. Bailey, T. Holz, M. Stamatogiannakis, and S. Ioannidis (Eds.), Cham, pp. 490–510. External Links: Cited by: §2.3.
-  (2019) Windows malware detector using convolutional neural network based on visualization images. IEEE Transactions on Emerging Topics in Computing (), pp. 1–1. External Links: Cited by: §2.2.
-  (2018) An Investigation of a Deep Learning Based Malware Detection System. In Proceedings of the 13th International Conference on Availability, Reliability and Security, ARES 2018, New York, NY, USA. External Links: Cited by: §2.1.
-  (2019) Effectiveness of Transfer Learning and Fine Tuning in Automated Fruit Image Classification. In Proceedings of the 2019 3rd International Conference on Deep Learning Technologies, ICDLT 2019, New York, NY, USA, pp. 91–100. External Links: Cited by: §1.
-  (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. External Links: Cited by: §5.1.
-  (2016) Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning?. IEEE Transactions on Medical Imaging 35 (5), pp. 1299–1312. Cited by: §3.1.
-  (2018) Classification of Focal Liver Lesions Using Deep Learning with Fine-Tuning. In Proceedings of the 2018 International Conference on Digital Medicine and Image Processing, DMIP ’18, New York, NY, USA, pp. 56–60. External Links: Cited by: §1.
-  (2018) ASSCA: API based sequence and statistics features combined malware detection architecture. Vol. 129, Elsevier. Cited by: §2.2.
-  (2018) Malware Analysis of Imaged Binary Samples by Convolutional Neural Network with Attention Mechanism. In Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy, CODASPY ’18, New York, NY, USA, pp. 127–134. External Links: Cited by: §2.1.
-  (2018-06) Learning Transferable Architectures for Scalable Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1.