Machine learning is becoming one of the more prestigious fields in computer science due to its rapid growth and record-shattering fame. In the past ten years only it has been integrated into several areas of study, such as physics [11, 4, 12], chemistry [9, 7]19], cloud computing [2, 3], network congestion control [13, 14]18, 6], ocean engineering [23, 22] etc., as a central or one of many pieces in the process of decision-making. The field of cybersecurity is not likely to be different . Machine learning has produced a major paradigm shift in the field, being at the center of some innovative techniques for detecting and preventing infections by malicious software (malware). Most malware intend to somehow cause damage by executing malicious code on an infected machine. They do so by disguising themselves as good files. One of the most common such file is the Portable Executable (PE) file. The PE format is a standard file format for Windows executables, DLLs, object code, etc.
PE files contain several sections with information on how to map the file into memory. The files usually also come with one or more associated icons, embedded within them, and while many approaches ignore icons when performing malware analysis, we believe that information can be used to aid in the detection of malware. There are many ways this task can be achieved. One could, for example, simply train a model on pixel values. Experience and practical results show that this approach doesn’t yield the best results. Most of the icons used in malware have slight blurriness, or color shifting, which is done with the exact purpose of defeating a naive approach like the one mentioned above. So, we need an approach that is robust against those subtle changes and can still provide useful information to the classifiers.
The objective of this work is to ultimately use that information to classify a PE file as good or bad. Using solely the icon to perform this task is also a bad idea. It is not uncommon for malware to use the same icon as known good files, such as a Microsoft Word document icon or an Adobe Reader PDF file icon. If we used the icon information only, we would have a dataset with conflicting labels, which would directly interfere with the performance of our classifiers. On the other hand, when combined with other features, such as size, source, content, among others, the icon information increases the accuracy of classifiers built to separate good and bad files. In the following sections, we describe in detail how we performed feature extraction and how we used that information on the classifiers.
2 Feature Extraction
The simplest way to use features from icons or images, in general, is to use raw pixel values from the three color channels – red, green, and blue (RGB). The main problem with this approach is that it is very susceptible to noise in the images and it does not provide sufficient information about the file. This is one of the reasons why so many icons used by malware have some perturbation in the color channels. From occluded parts to blurriness in the image, and sometimes even in the form of slight increase or decrease in the RGB channel values throughout the image, which would be imperceptible to the naked eye but can cause a classifier to incorrectly classify malware as benign files . These are done mainly to avoid direct matching used by some systems. In order to harvest the knowledge of these icons but at the same time circumvent these issues we decided to build three sets of features that have a great deal of information about the icon file while staying resilient against these kind of perturbations.
2.1 Manually Created (MC) Features
Despite the issues that including RGB values may cause there is definitively valuable information present in these values, and we must harvest their potential somehow. To do so, we used a different approach: rather than having only pixel values, we use means and standard deviations over different sections of an image to preserve that information without being too affected by slight color variations.
To achieve this, we use the mean and standard deviation (std) of the pixel values for the whole image across all the channels (2 features), then we take the mean and std of the different RGB channels (6 features). Lastly, we split the original image into nine different sections, like in Figure 1, and compute the mean and std of the pixel values on each one of the different sections across all the channels (18 features). This method produces a total of 26 MC features.
The particular choice of splitting the image into nine regions is empirical to the problem we are trying to solve. These images are icons of files, so most of these are very small figures, usually pixel images. Increasing this grid to or brings us back to the problem we are trying to avoid by using these features: to weaken the effect of small variations in the image since the regions would be too small. Decreasing the grid to , on the other hand, makes little sense since we would be losing too much information about the image and would, in fact, be creating only eight features, which would also potentially be very similar to our first 2 MC features.
2.2 Histogram of Oriented Gradients (HOG) Features
HOG features have been created with the idea to detect object shapes, like a hand or a person, regardless of colors . The idea of using these features is to keep information despite the color fuzziness that can be seen in malware icons. HOG features capture contour, silhouette, and some texture information while providing further resistance to illumination and color variations. A small window slides over the image and computes the gradient of the image within the window .
One further step that had to be done here was to ensure that the number of returned features from the HOG was always the same. Otherwise, this would affect the models during the next phase of the process, since we would not be able to guarantee all icons would produce the same number of features. After experimentation, we decided to have the HOG-parsed image be of size 111Most icons are either , , or . If we decided to use as the default size we would be waisting too much information from the larger icons, and if we used we would be doubling the size of the small ones, leading to the opposite problem: fabricating too much information to perform the rescaling., which means we had a total of 576 features at the end. To get this, we used cell sizes of -th of the image in both axes and resized the image whenever necessary.
2.3 Autoencoder (AE) Features
Up until now, both feature sets we are using involve feature engineering: thinking of ways we can analyze an image and extract meaningful information from it. Another approach to this task is rather than creating the features yourself, letting a neural network create those features. We incorporated this behavior into our model by using features generated by a convolutional autoencoder neural network, which is a neural network that models the input to itself, compressing it and decompressing it in the process. In order to decompress the information, the network has to learn what are the most important features that it has to keep for each image to make it possible to recreate them with a certain accuracy. We trained the AE with a large set of icons to make it more robust and a good generalizer .
At the end of our compression step, we have a total of 512 AE features for each image. That brings us to a total of 26 (MC features) + 576 (HOG features) + 512 (AE features) = 1114 features total. We could stop the project right here and just use those features on the classifier and let it figure it out what is more useful, but we can still gather more information and reduce the number of features we would introduce to the model. One could, for example, use dimensionality reduction methods, such as random projections, principal component analysis, t-SNE, etc., to reduce the number of features from 1114 to 100, while still retaining valuable information. We wanted to take this to an extreme so we decided to tackle the problem under a different light.
We clustered the icons and use their cluster ids as the variable the classifier would use. Similar icons tend to be close to each other even in a high dimensional space, so a cluster id is informative of the type of icons that are within the same region, thus within the same cluster. One could also perform a dimensionality reduction before the clustering, but we have not done that in this work. So, in the end, rather than using 1114 features we can use the cluster id as mentioned above in the classifier (explained in detail in sections 3 and 4).
for this task, namely k-means, mean shift, affinity propagation, density-based spatial clustering of applications with noise (DBSCAN), hierarchical DBSCAN (HDBSCAN), and also different hierarchical clustering techniques, like average, complete and single linkage. The ones that had more promising results were the two density-based methods: DBSCAN and HDBSCAN.
They also have an attractive property for our purposes: they can detect outliers on the dataset. HDBSCAN outperformed DBSCAN in the quality of clusters, which were really tight and well separated. We used a silhouette score to measure their quality. An advantage of HDBSCAN is also that we do not have to provide an expected number of clusters nor a radius to look for them, like we have to for k-means or DBSCAN.
While HDBSCAN did an excellent job of finding the densest clusters, it was still having issues labeling a significant portion of our dataset as outliers. So to capture HDBSCAN’s good properties we decided to use it and then use another clustering algorithm on the outlier set. Basically what that means is that we are removing the super dense areas where extremely similar icons fall together and then clustering the remaining icons with another algorithm. We used a k-means algorithm to cluster the outliers.
4 New Sample Classification
Given a new icon, the goal is to turn it into two features: cluster id and outlier flag. The whole process is as follows:
Transform the icon into features. This is the process described in section 2.
Get a cluster prediction. This is a non-trivial problem with HDBSCAN. The algorithm does not provide a prediction function. So, to perform this step, we need to make use of a classifier. A feed-forward neural network could be used in this step, but we decided to use a k-nearest neighbors (KNN) model for its simplicity and accuracy. We perform the KNN model fit using the HDBSCAN labels only, so when a new sample comes in, the model looks to its k-nearest neighbors’ labels and decide by majority vote what cluster id to assign to the new sample. If a label other than -1 is assigned to the new sample, then we return said cluster id andfalse
for outlier detection. Getting a label ofmeans that HDBSCAN was likely to call this sample an outlier, which means we have to perform the prediction with the k-means model that we trained on the outlier data. If the label is assigned to the new sample, then we set true for outlier detection and run a subsequent prediction with k-means to get the cluster id.
In order to test the efficacy of our proposed method in terms of enhancement in malware prediction, we use a balanced sample of publicly available PE files obtained from virustotal.com with 1,138 benign and 1,138 malware files. In order to visualize the icons we use in the experiment (Figure 2), we use t-SNE  on the raw icon pixels. Despite the fact that Figure 2 shows malware and benign icons are well mixed, yet we will show in this section that our approach is capable of using the information in the icons to better predict detect malware.
Using our proposed method, we initially generate icon features (MC, HOG, and AE features) and then we cluster icons. Further, per each PE file and by using the publicly available python package PEfile222 PEfile is a multi-platform Python module to parse and work with PE files. It’s available at https://github.com/erocarrera/pefile, we generate “entropy”, “Misc_VirtualSize”, and “SizeOfRawData” features from the three sections of “.text”, “.data”, and “.rsrc”, we shall refer to these features as PEfile
features. In order to test the effectiveness of icon clusters generated using our proposed method in better detecting malware, we build three prediction models: 1) Lasso Logistic Regression (L1), 2) Ridge Logistic Regression (L2), and 3) Linear Support Vector Machine (SVM). Each model is then fit once using only the PEfile features and another time with both PEfile features and one-hot encoded icon cluster feature333code for reproducing the experiment results is available at https://github.com/CylanceSPEAR/improving-malware-detection-accuracy-by-extracting-icon-information.
In order to better estimate out-of-sample accuracy of the models, the original data is randomly split into train data (80% of the data) and test data (20% of the data). The division of the data into train and test is being done using a stratified sampling method which guarantees balanced labels in the generated train and test data. Test data remain untouched during the model fitting process and are solely used for the final out-of-sample accuracy evaluation of the model. To avoid overfitting, all of our proposed models are regularized (either L1 or L2). Regularization parameters are tuned using a stratified 4-fold cross-validation process. As an example, Figure3 shows results from the optimization process of the regularization parameter of the lasso logistic regression model using stratified 4-fold cross-validation.
Table 1 shows the results of fitting the three models both with icon cluster feature and without icon cluster feature . The table shows the regularization parameter, , that is optimized using a stratified 4-fold cross-validation. The table also includes measures of accuracy at the optimized
value: cross-validation accuracy and its standard error, cross-validation true positive rate and the standard error, and cross-validation true negative rate and its standard deviation. Finally, the fitted models are tested using the test data, and we report test accuracy, test true positive rate, test true negative rate, and the area under the curve.
As one can see, adding the icon cluster feature to the feature set has consistently boosted accuracy and area under the curve across all three models. This is a clear indication that our proposed icon clustering technique can help improve malware detection in PE files. Figure 4 also supports this conclusion. In this figure, we show the ROC curves associated with our three candidate models. The black curve corresponds to the model with icon cluster feature and the red curve corresponds to the model without the icon cluster feature.
In this paper, we proposed a new approach to incorporate information from icons of PE files into prediction models in order to better detect malware. Rather than using the raw icon pixel values, we proposed extracting features using a combination of manually created features, a histogram of gradients, and autoencoder-generated features, which led to 1,114 features. Using the extracted features, we cluster icons. This process is synonymous to reducing the extracted 1,114 features to 1 feature444Alternatively, one may also use the boolean outlier flag discussed in section 4 as a feature, but we decided not to use it for this paper., and yet still retains meaningful information about the image.
Using publicly available data, we ran experiments testing the effectiveness of our proposed icon cluster on better predicting malware. Our experiments showed a significantly higher area under the curve of the ROC plot (Figure 4) as well as an average increase of 10% in accuracy of predicting malware when we use our proposed icon clusters inside the prediction model. Table 1 also adds a compelling argument in favor of our proposed method where we show that not only the accuracy but also the true positive and true negative rates have increased in the models with the icon cluster. This work has shown that PE icons contain useful knowledge when performing malware detection. This paper, along with many others in the field, once again shows that malware leave traces in unexpected places  where discovery of those hidden codes can improve the accuracy of malware prediction models.
-  Bengio, Yoshua. Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
-  J. Bhimani, N. Mi, M. Leeser, and Z. Yang. Fim: Performance prediction for parallel computation in iterative data processing applications. In Cloud Computing (CLOUD), 2017 IEEE 10th International Conference on, pages 359–366. IEEE, 2017.
-  J. Bhimani, Z. Yang, M. Leeser, and N. Mi. Accelerating big data applications using lightweight virtualization framework on enterprise cloud. In 21st IEEE High Performance Extreme Computing Conference (HPEC 2017), 2017.
-  F. Brockherde, L. Vogt, L. Li, M. E. Tuckerman, K. Burke, and K.-R. Müller. By-passing the kohn-sham equations with machine learning. Nature Communications (accepted), 2016.
-  R. J. G. B. Campello, D. Moulavi, and J. Sander. Density-Based Clustering Based on Hierarchical Density Estimates. In Knowledge-Based Intelligent Information and Engineering Systems, pages 160–172. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
-  B. Chen, S. Escalera, I. Guyon, V. Ponce-López, N. Shah, and M. O. Simón. Overcoming calibration problems in pattern labeling with pairwise ratings: application to personality traits. In Computer Vision–ECCV 2016 Workshops, pages 419–432. Springer, 2016.
-  F. Faber, L. Hutchinson, H. Bing, J. Gilmer, S. Schoenholz, G. Dahl, O. Vinyals, S. Kearnes, P. Riley, and A. von Lilienfeld. Prediction errors of molecular machine learning models lower than hybrid dft error. Journal of Chemical Theory and Computation, 2017.
-  W. T. Freeman and M. Roth. Orientation histograms for hand gesture recognition. Technical Report TR94-03, MERL - Mitsubishi Electric Research Laboratories, Cambridge, MA 02139, Dec. 1994.
-  J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv.org, Dec. 2015.
-  L. Li, T. E. Baker, S. R. White, and K. Burke. Pure density functional for strong correlations and the thermodynamic limit from machine learning. Phys. Rev. B, 94(24):245129, 2016.
-  L. Li, J. C. Snyder, I. M. Pelaschier, J. Huang, U.-N. Niranjan, P. Duncan, M. Rupp, K.-R. Müller, and K. Burke. Understanding machine-learned density functionals. International Journal of Quantum Chemistry, 116(11):819–833, 2016.
-  W. Li, F. Zhou, W. Meleis, and K. Chowdhury. Learning-based and data-driven tcp design for memory-constrained iot. In Distributed Computing in Sensor Systems (DCOSS), 2016 International Conference on, pages 199–205. IEEE, 2016.
W. Li, F. Zhou, W. Meleis, and K. Chowdhury.
Dynamic generalization kanerva coding in reinforcement learning for tcp congestion control design.In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 1598–1600. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
-  L. v. d. Maaten and G. E. Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
-  L. McInnes, J. Healy, and S. Astels. hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11), mar 2017.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
-  V. Ponce-López, B. Chen, M. Oliu, C. Corneanu, A. Clapés, I. Guyon, X. Baró, H. J. Escalante, and S. Escalera. Chalearn lap 2016: First round challenge on first impressions-dataset and results. In Computer Vision–ECCV 2016 Workshops, pages 400–418. Springer, 2016.
-  H. Shao, S. Chen, J.-y. Zhao, W.-c. Cui, and T.-s. Yu. Face recognition based on subset selection via metric learning on manifold. Frontiers of Information Technology & Electronic Engineering, 16(12):1046–1058, 2015.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv.org, Dec. 2013.
-  S. van der Walt, J. L. Schönberger, J. Nunez-Iglesias, F. Boulogne, J. D. Warner, N. Yager, E. Gouillart, T. Yu, and the scikit-image contributors. scikit-image: image processing in Python. PeerJ, 2:e453, 6 2014.
-  L.-p. Wang, B. Chen, J.-f. Zhang, and Z. Chen. A new model for calculating the design wave height in typhoon-affected sea areas. Natural hazards, 67(2):129–143, 2013.
-  L.-p. Wang, B.-y. Chen, C. Chen, Z.-s. Chen, and G.-l. Liu. Application of linear mean-square estimation in ocean engineering. China Ocean Engineering, 30(1):149–160, 2016.
-  M. Wojnowicz, G. Chisholm, and M. Wolff. Suspiciously structured entropy: Wavelet decomposition of software entropy reveals symptoms of malware in the energy spectrum. In FLAIRS Conference, pages 294–298, 2016.