Malicious software (malware) are computer programs that compromise hosts for various reasons. They could take control of the host for ransom (ransomware) or they could be used for launching attacks, typically Denial of Service (DoS), against other hosts and networks. The latter type depends on a large number of hosts being compromised, but the increase in the popularity of IoT devices has made the task much easier .
Traditional malware detection and anti-virus systems use signature-based method. These methods fail in the presence of polymorphic or mutable code. Furthermore, a large number of malware seen on the internet are small variations of few known malware where the difference between the ”new” and ”old” malware being as low as 2%  and signature-based method fail to detect those also.
Recently, machine learning methods were used to classify malware because of their accuracy in detecting and classifying similar patterns, such is the case in malware variants. One such approach converts binary files into grayscale images and uses Deep Convolution Neural Networks (DCNN) which are known to be very successful in image classification and object detection.
However, the training of DCNN can take weeks. Therefore, it is important to find a solution to speed up the training. One such solution is the use of transfer learning. By using known pre-trained networks, one could cut the training time drastically. In this paper we investigate the efficacy of transfer learning for malware classification. We do this by performing experiments on four widely used DCNN for image classification: ResNet50 and Resnet152 , MobileNet, and VGG16 on the same dataset.
2 Related Work
Malware classification methods can be divided into dynamic and static. Dynamic methods execute the malware and observe its behavior. Static methods extract features from the malware without actually executing it. During the last decade many machine learning methods for malware classification were proposed. The work in 10].
first to propose to use image classification techniques based on the grayscale image representation of malware is . In similar approach, but using transfer learning  use ResNet50 for classification of the Microsoft Big Challenge. Also, a variation of VGG16 is used in  for malware classification.
All the above mentioned methods obtained a classification accuracy between 90-99%. Rather than optimize for very high accuracy our interest is in the training time of of the networks. In particular, to investigate which of the well known network architectures can be trained for a few epochs, as opposed to hundreds, and yet obtain an accuracy above 95%.
3 Dataset and Methodology
In this paper we use the Microsoft Malware Classification Challenge dataset . The dataset contains 10868 labeled samples from 9 malware families with two files associated with each sample: a binary file (’.byte’) and assembly file (’.asm’). The distribution of the samples on the 9 different families is shown in Table 1.
we convert binary files to grayscale image that serve as an input to the neural network classifiers. Towards that end, each binary sample (’.byte’) is converted into a grayscale image as follows. Each file contains 16 hex numbers per line where each byte was considered as a grayscale pixel value in the range 0-255. Then the images were resized to dimension (256,256). In Fig.1 we show three such samples from the Gatak family.
The preprocessed images are used as input for a neural network with two connected parts, one after the other. The first part is one of the well known architectures pre-trained on the ImageNet dataset with the last, classification layer removed. The second part consists of a fully connected layer with 1024 neurons followed by a 9 node softmax classification layer.
Training is also done in two phases. In the first phase, which lasts for 15 epochs, the pre-trained network is frozen and only the added two layers are optimized. The second, fine tuning phase with 10 epochs, continues where the first left off, the pre-trained network is unfrozen and the complete network is trained with a very small learning rate (10e-5).
Since our goal is to assess the viability of tranfer learning not just a single architecture we have performed the above training with the following list of pre-trained networks.
All the above architectures were pre-trained on the ImageNet dataset. Furthermore, to gauge the time saving component of transfer learning we ran the same experiments using a deep convolution neural network (CNN) shown in Fig. 2.
The CNN contains 13 convolution and 5 max-pooling layers and the last two layers are fully connected. The activation function for all layers except the last, which uses a softmax activation, are ReLU. In Fig.2 the yellow boxes are convolution layers and the orange ones are max-pooling layers.
The experiments were performed on Kaggle which uses a NVIDIA Tesla P100 GPU with 16GB or RAM. The code was written in Python using Tensorflow/Keras. The 10686 samples were randomly divided into 9000 for training and 1868 for testing. The training was performed using the Adam optimizer, with a learning rate of 0.01, and a batch size of 32 samples. The results for the two phase transfer learning described above are shown in Fig. 3.
The accuracy of the test data set for the different networks is shown in Fig. 3. As can be seen from the figure all the architectures gave similar results. Furthermore, the results are excellent considering the short training duration and there was no hyper-parameter optimization was performed. Also, after just a single epoch all of them give an accuracy above 94%.
To illustrate the time saving feature of transfer learning we conducted the same experiment using the convolution network shown in Fig. 2. To compare with the pre-trained networks the training was done for 25 epochs using the same hyper-parameters. A comparison of the test accuracy between vgg16 are shown in Fig. 4. Note that the network is very similar to vgg16 and needs to be run for more than 75 epochs to reach a comparable accuracy of what the pre-trained model can obtained in 25 epochs.
The obtained accuracy does not reflect the prediction accuracy of individual malware families. Toward that end we have computed the confusion matrix which is shown Fig. 5 for ResNet152. All the others are similar. In particular, the low accuracy of the prediction of Simda is mostly due to the very small number of samples.
In this work we investigated the efficacy of transfer learning for malware classification. Toward that end, we have performed experiments on four pre-trained networks for the purpose of classifying malware. In particular the classification was performed on the Microsoft Classification Challenged dataset which were converted to grayscale images. All network architectures gave more than 95% accuracy using very few training epochs. This is very promising since they were trained on ImageMet. This shows that transfer learning is reliable since all the different networks gave more or less the same behavior.
The question of loss of information due to the resizing of the images is worth investigating . Another aspect worth investigating which we leave to future work, is the inference time of the studied network architectures on small computer-on-chip devices such as the NVIDIA Jetson Nano.
-  Antonakakis et al. Understanding the Mirai botnet. In 26th USENIX security symposium, pages 1093–1110, 2017.
-  M. Bat-Erdene, H. Park, H. Li, H. Lee, and M.-S. Choi. Entropy analysis to classify unknown packing algorithms for malware detection. International Journal of Information Security, 16(3):227–248, 2017.
-  S. Greengard. Cybersecurity gets smart. Communications of the ACM, 59(5):29–31, Apr. 2016.
-  K. S. Han, J. H. Lim, B. Kang, and E. G. Im. Malware analysis using visualized images and entropy graphs. International Journal of Information Security, 14(1):1–14, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861 [cs], Apr. 2017. arXiv: 1704.04861.
-  M. Kalash, M. Rochan, N. Mohammed, N. D. B. Bruce, Y. Wang, and F. Iqbal. Malware Classification with Deep Convolutional Neural Networks. In 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), pages 1–5, Feb. 2018. ISSN: 2157-4960.
-  D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], Jan. 2017. arXiv: 1412.6980.
-  R. Lyda and J. Hamrock. Using entropy analysis to find encrypted and packed malware. IEEE Security & Privacy, 5(2):40–45, 2007.
-  B. N. Narayanan, O. Djaneye-Boundjou, and T. M. Kebede. Performance analysis of machine learning and pattern recognition algorithms for malware classification. In 2016 IEEE National Aerospace and Electronics Conference (NAECON) and Ohio Innovation Summit (OIS), pages 338–342. IEEE, 2016.
-  L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath. Malware images: visualization and automatic classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security, VizSec ’11, pages 1–7, New York, NY, USA, July 2011. Association for Computing Machinery.
-  E. Rezende, G. Ruppert, T. Carvalho, F. Ramos, and P. de Geus. Malicious Software Classification Using Transfer Learning of ResNet-50 Deep Neural Network. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 1011–1014, Dec. 2017.
-  R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi. Microsoft Malware Classification Challenge. arXiv:1802.10135 [cs], Feb. 2018. arXiv: 1802.10135.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs], Apr. 2015. arXiv: 1409.1556 version: 4.
-  B. Yuan, J. Wang, D. Liu, W. Guo, P. Wu, and X. Bao. Byte-level malware classification based on markov images and deep learning. Computers and Security, 92:101740, May 2020. Publisher: Elsevier Ltd.
-  D. Yuxin and Z. Siyi. Malware detection based on deep learning algorithm. Neural Computing and Applications, 31(2):461–472, Feb. 2019.