Imbalanced Malware Images Classification: a CNN based Approach

08/27/2017 ∙ by Songqing Yue, et al. ∙ University of Wisconsin-Platteville 0

Deep convolutional neural networks (CNNs) can be applied to malware binary detection through images classification. The performance, however, is degraded due to the imbalance of malware families (classes). To mitigate this issue, we propose a simple yet effective weighted softmax loss which can be employed as the final layer of deep CNNs. The original softmax loss is weighted, and the weight value can be determined according to class size. A scaling parameter is also included in computing the weight. Proper selection of this parameter has been studied and an empirical option is given. The weighted loss aims at alleviating the impact of data imbalance in an end-to-end learning fashion. To validate the efficacy, we deploy the proposed weighted loss in a pre-trained deep CNN model and fine-tune it to achieve promising results on malware images classification. Extensive experiments also indicate that the new loss function can fit other typical CNNs with an improved classification performance.



There are no comments yet.


page 2

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Malware binary, usually with a file name extension of “.exe” or “.bin”, is a malicious program that could harm computer operating systems. Sometimes, it may have many variations with highly reused basic pattern. This implies malware binaries could be categorized into multiple families (classes), and each variation inherits the characteristics of its own family. Therefore, it is important to effectively detect malware binary and recognize possible variations [1, 2].

However, this is non-trivial but challenging. A malware binary file can be visualized to a digital gray image [3]

. After visualization, the malware binary detection turns into a multi-class image classification problem, which has been well studied in deep learning. One can manually extract features from malware images and feed them into classifiers such as SVM (support vector machine) or KNN (k-nearest neighbors algorithm) to detect malware binaries through classification. To be more discriminative, one can utilize CNN to automatically extract features as Razavian et al. did in

[4] and perform classification in an end-to-end fashion. However, most deep CNNs are trained by properly designed balanced data [5, 6], while malware images dataset may be highly imbalanced: some malware has many variations while some other only has few variations. For instance, the dataset [3] used in our paper contains 25 classes, and some class contains more than 2000 images while some other has only 80 images or so. As a result, even reputed pre-trained CNN models [7, 8, 9, 10] may perform poorly in our senario. Furthermore, pre-trained CNN models are originally designed for specific vision tasks, and they cannot be applied to malware binary detection directly.

One may argue that data augmentation could be a possible approach to balance the data, such as oversampling the minority classes and/or down-sampling the majority classes. It is, however, not suitable for our problem due to two reasons. First, down-sampling may miss many representative malware variations. Second, simply jittering the data cannot generate images corresponding to real malware binaries. Therefore, we aim to investigate how to train a CNN model with the imbalanced data in hand.

To solve the above challenges, inspired by the work in [11, 12] which designed new loss function for CNN to improve training performance, we propose a weighted softmax loss for deep CNN on malware images classification. Based on the error rate given by softmax loss, we weight misclassifications by different values corresponding to class size. Intuitively, misclassification of minority class should be amplified, and that of majority class needs to be suppressed. Our weighted loss can achieve this goal and guide the CNN to update filters in a proper direction. We adopt a pre-trained verydeep-19 model [8] from VGG family (Visual Geometry Group at Oxford), and retrain it to achieve a promising result on the malware images classification. Once the proposed loss has been proven feasible with VGG models, it can be extended to other models, such as GoogleNet [9] and ResNet [10]. In a word, the contributions of this work are two-fold. First, we propose a weighted softmax loss to address the data imbalance issue in CNN training. Second, we apply the proposed loss to a pre-trained CNN model and fine tune it to solve the malware images classification problem.

The rest of paper is organized as follows. Related work is discussed in Section II. Section III introduces the proposed weighted softmax loss. In Section IV, we describe how to deploy the proposed loss and fine-tune a deep CNN for malware images classification. We evaluate our method in Section V, and conclude this paper in Section VI.

Ii Related Work

Malware Images: Nataraj et al. [3] proposed a way of converting malware binaries into digital gray images. They created 25 classes of malware images that are highly imbalanced. Two sample malware images from ‘Adialer.C’ class and ‘Skintrim.N’ class are shown in Fig. 1. Microsoft Malware Classification Challenge111 also provides imbalanced training and testing data which includes 9 classes only. Therefore, that task is not as challenging as ours in terms of the malware diversity (the dataset used in our work contains 25 classes).

(a) Malware image from family: Adialer.C
(b) Malware image from family: Skintrim.N
Fig. 1: Two sample malware images.

Deep Learning: Deep learning recently gained remarkable success in computer vision tasks. Krizhevsky et al. and their work in

[7] opened a new era for deep CNN and its application on image classification. Many works have been inspired. Simonyan and Zisserman proposed a very deep CNN in [8], and they significantly improved the performance by increasing the depth and using small convolutional filters of size . Szegedy et al. [9] designed a 22 layers GoogLeNet which increased the width of the network, and achieved the new state-of-the-art performance for image classification. He et al. [10] eased the training of a 152 layers deep CNN by presenting a residual learning framework. Besides image classification, deep learning has also been applied in fundamental image processing tasks, such as denoising [13] and contour detection [14]. Saxe and Berlin [15] discussed the feasibility of using deep CNN on malware detection. Hand-crafted features such as ‘PE import’ are extracted prior to the model training, which is not the typical manner of an end-to-end learning process. Huang et al. proposed a quintuplet based triple header hinge loss for extracting discriminative features from imbalanced data in [12]

. Nonetheless, their work requires features extracted in advance in a hand-crafted manner or by another pre-trained CNN model. Meanwhile, clustering algorithm such as k-means is needed prior to training. These steps were not integrated into an end-to-end fashion like typical CNNs. Recently, Xu et al.

[16] have applied deep learning in medical CT physics, and consider scatter correction also by designing new loss function.

Iii Weighted Softmax Loss

In this section, we introduce how to weight softmax loss according to class size.

Iii-a Softmax Loss

Softmax loss is a combination of softmax regression and entropy loss, used in multi-class classification problems. Given a -classes training set containing m images: {, } where is an image, and is the ground truth label and . Let (

) be the output unit from the last fully connected layer of CNN, then the probability that the label of

is can be given by


Typical deep CNNs aim to minimize the entropy loss function:


where is the batchsize, and

is the total number of classes in the dataset. 1(.) is an indicator function. 1(true) gives 1, and 1(false) gives 0. In typical CNNs, convolutional filters are updated by stochastic gradient descent (SGD) algorithm. Traditional softmax loss treats misclassification of each class equally, which is reasonable for balanced data, but will lead to poor performance on imbalanced data classification, such as malware images. The weighted softmax loss is proposed to address this issue.

Iii-B Weighted Softmax Loss

Our approach can be formulated as follows,


where is a weight value determined by


where is the ground truth class label of the image. is the size of the largest class in the data set, and is the size of an arbitrary class in the data set, and is a parameter that controls the scaling of the weighted loss. Our empirical preference of is 20. Extensive experiments about this parameter will be shown in section V. It is worth noting that the minority classes will be assigned a larger weight value whereas the majority classes will be lightly weighted. The weight value will not dramatically affect the loss, thus it can be regarded as a subtle fine-tuning, that boosts the classification performance as well as avoids overfitting. In practice, training procedure only needs to choose a according to the ground truth label of the sample in the mini-batch.

Iv Classification by CNN

In this section, we discuss how to fine-tune vgg-verydeep-19 model for malware images classification.

Iv-a Network Architecture

Simonyan and Zisserman in [8] proved that classification accuracy can be improved through going deep of the network. Vgg-f model [17] contains 21 layers, whereas vgg-verydeep-19 model contains 43 layers. Convolutional filters of size

are used in vgg-verydeep-19 model. They correspond to smaller receptive fields, but are still able to extract discriminative features. We fine-tune vgg-verydeep-19 model since it outperforms other methods in vgg family on classification task. Other pre-trained models, such as ‘vgg-face’, ‘fcn’, and ‘fast-rcnn’, were designed for tasks including face recognition, semantic segmentation, and object detection. Therefore, the availability of those methods on our problem has not been further exploited in this work. In addition, Simonyan and Zisserman

[8] argued that the local response normalization (LRN) [18]

can be ignored in CNN. In order to boost the classification performance, we add Batch Normalization (BN) layer


between convolutional layer and ReLU (rectified linear unit) layer. However, the very first convolutional layer is followed by a ReLU layer directly, and the fully connected layer is directly followed by a ReLU layer as well.

A big potential threat for deep learning is overfitting, especially when the training set is not large. Srivastava et al. [20] invented ‘dropout’, a simple yet powerful approach to avoid overfitting in CNN. Units in layers are randomly dropped and the corresponding connections are also removed temporarily. It works as randomly training the different networks in multiple rounds. We place two dropout layers (with probability ) between the three fully connected layers to prevent overfitting (by default, the downloaded vgg-verydeep-19 model does not contain dropout layer). In testing phase, the dropout layers will be removed.

Iv-B Weighted Softmax Loss Layer

We append the proposed loss as the last layer of our model. The final structure contains 60 layers including the added dropout layers and BN layers. For the purpose of simplification, only the fully connected layers, and the added dropout layers and the weighted loss layer are shown in Fig. 2.

Fig. 2: Fully connected, dropout and the loss layers of our fine-tuned CNN.

V Experiments

This section describes experiments showing the effectiveness of our proposed loss on malware images classification. Vgg-f,m,s models [17] are also used to validate the general fit of the new loss function. We also analyze the value selection of the scaling parameter and recommend an empirical option. Top-1 validation error is utilized to evaluate the classification performance.

Experiments are conducted on MatConvNet222 framework [18], which is an open source library for deep learning in matlab. One Nvidia Geforce TITAN X GPU is used to accelerate the mini-batch processing.

V-a DataSet and General Settings

The dataset [3] used in our work contains 25 classes which are highly imbalanced. The name and the size of each class is listed in Table I333 We partition the data as follows: the first 60% images in each class are used for training, and the following 20% for validation, and the last 20% for testing. No additional data augmentation method except mean value subtraction is applied on each image.

No. Type Family Name # of Img
1 Worm Allaple.L 1591
2 Worm Allaple.A 2949
3 Worm Yuner.A 800
4 PWS lolyda.AA 1 213
5 PWS lolyda.AA 2 184
6 PWS lolyda.AA 3 123
7 Trojan C2Lop.P 146
8 Trojan C2Lop.gen!G 200
9 Dialer Instantaccess 431
10 Trojan Downloader Swizzor.gen!I 132
11 Trojan Downloader Swizzor.gen!E 128
12 Worm VB.AT 408
13 Rogue Fakerean 381
14 Trojan Alueron.gen!J 198
15 Trojan Malex.gen!J 136
16 PWS Lolyda.AT 159
17 Dialer Adialer.C 125
18 Trojan Downloader Wintrim.BX 97
19 Dialer Dialplatform.B 177
20 Trojan Downloader Dontovo.A 162
21 Trojan Downloader Obfuscator.AD 142
22 Backdoor Agent.FYI 116
23 Worm:AutoIT Autorun.K 106
24 Backdoor Rbot!gen 158
25 Trojan Skintrim.N 80
TABLE I: Names and sizes of the 25 imbalanced classes of malware images (Img stands for Images).

V-B Effects of the Weighted Loss

To validate the general fit, we deploy the proposed loss in 3 pre-trained CNN models: vgg-f,m,s. The default structures are entirely preserved and the number of filters 1000 in the last fully connected layer is changed to 25, which corresponds to the 25 classes in the malware dataset. We retrain the three networks, and the top-1 validation errors with and without the weighted loss are shown in Fig. 3. It can be seen that our method effectively decreases the top-1 validation error, and keeps the curves stable. The test performances are illustrated in Table II.

(a) Vgg-f model
(b) Vgg-m model
(c) Vgg-s model
Fig. 3: Top-1 validation error of vgg-f,m,s models with and without weighted softmaxloss.
Original loss Weighted loss
Vgg-verydeep-19 97.32% 98.63%
Vgg-f 94.48% 95.36%
Vgg-m 95.90% 96.39%
Vgg-s 96.23% 96.89%
TABLE II: Classification accuracy on the test images.

V-C Fine-tune Vgg-verydeep-19 Model with the Weighted Loss

In order to achieve a better result, we fine-tune vgg-verydeep-19 model as discussed in section IV. We initialize the filter weights using MSRA [10]. The momentum is set to 0.9, and the training is regularized by weight decay with L2 penalty multiplier set to 0.0005 as suggested by Simonyan and Zisserman in [8]

. The learning rate is set to 0.0001, which is our empirical best option. Dynamic adjusting is not considered since our data size does not require many epochs to converge. Batch size is set to 80. We compare top-1 validation error of the models with and without the weighted loss in Fig. 

4. As can be seen, the weighted loss effectively decreases top-1 validation error compared to the original loss. In testing, we remove the two added dropout layers, and the results are listed in table 2. In the dataset, ‘Autorun.K’, ‘Malex.gen!J’, ‘Rbot!gen’, ‘VB.AT’, and ‘Yuner.A’ classes are from the same pack (UPX-Ultimate Packer for eXecutables). However, we treat them as individual families. Otherwise, the test error can be further decreased.

Fig. 4: Top-1 validation error of our fine-tuned vgg-verydeep-19 model with and without weighted softmaxloss.

Moreover, CNN can extract image features automatically. In our fine-tuned model, a 4096-dimensional feature vector for an arbitrary image can be captured at layer 41 (the second fully connected layer). We generate a feature map for each malware class by extracting features for all images in that class, and combining all feature vectors to form a matrix and visualizing it. Therefore, the dimension of any feature map is: 4096, where is the class size. Such a feature map can reflect characteristics of the corresponding malware class. We give the feature maps of 5 malware classes in Fig. 5. Since the classes are highly imbalanced, is different for each class. To attain the best display effect, we use ‘imagesc‘ command in matlab to visualize a feature map, which will automatically scale the map.

(a) Allaple.L
(b) Adialer.C
(c) Yuner.A
(d) lolyda.AA 3
(e) Autorun.K
Fig. 5: Feature maps of 5 classes in the malware dataset.
Fig. 6: Impact of the scaling parameter.

V-D Scaling Parameter

In Eq. (4), the parameter is needed to compute the weighted loss value. Empirically, 20 is the best option for . To show the influence of this parameter, we further investigate three values on the fine-tuned vgg-verydeep-19 model and report top-1 validation error in Fig. 6. As can be seen, the curve with is smoother than the others and it also converges at a lower error rate. It indicates that an appropriate selection of is also critical to the network training.

Vi Conclusion

We proposed a weighted softmax loss for convolutional neural networks on imbalanced malware images classification. By imposing a weight, the classification error for different classes can be treated unequally. The principle of weighting the loss has very clear intuition, and our experiments have shown its feasibility of working with existing CNN models. We also fine-tuned vgg-verydeep-19 model with the proposed loss to achieve a satisfactory classification result. In addition, we have experimentally given an option of the scaling parameter in computing the weighted loss. It indicates that an appropriate selection of this parameter is indispensable for the training success.