The increasing sophistication and diversity of malware has long ago surpassed the capabilities of malware analysts. Vast amounts of data need to be analyzed every day in search of potential threats, but the current techniques are simply not enough to comply with the high demand . Malware writers use different methods, such as polymorphism and obfuscation, to outperform detection systems. Typical approaches often depend on signature-based detection, where a file’s signature is compared to a list of known malicious ones. Though such approaches are computationally less demanding, they are susceptible to simple evasive techniques and zero-day exploits. Common detection techniques rely on static and dynamic analysis. Static analysis is computationally efficient and safer than other methods as it does not require the code’s execution to allow being analyzed; however, it is also easily affected by evasive techniques. On the other hand, dynamic analysis has higher chances of detecting unseen samples but heavily affects the system performance since the file is analyzed in a virtual environment.
To overcome the above-mentioned problems, in this paper we investigate and evaluate an efficient detection mechanism relying on visual representations of malware. The proposed approach utilises binary visualisation, converting a file’s binary data into an image, and self-organizing incremental neural networks (SOINN) for the analysis and detection of malicious payloads. By doing so, the likelihood of detecting obfuscated code in the file is increased . The algorithm chosen for the binary representation uses Hilbert space-filling curves to cluster the data, ensuring that the data is grouped in an optimal way. Moreover, the use of SOINN in the detection mechanism, which is an unsupervised machine learning algorithm known for its incremental abilities, provides the ability of learning fast by exploiting only the information needed for building the neural network (NN) and removing redundant nodes. These properties of SOINN are inherited from self-organizing maps (SOM) and growing neural gas (GNG) . Our contribution is the novel combination of binary visualization with SOINN whose robustness to noise makes it lightweight and faster than other approaches, and hence ideal for very large datasets allowing the provisioning of efficient detection mechanisms as a service. A prototype was developed for analyzing detection performance, where malware samples from various categories and file types were included to test the model in different scenarios. The experimental results showed a high overall detection accuracy, which attained its maximum value % for particular malware types.
The paper is organized as follows: in Section II we discuss related work in the area of binary visualization and machine learning, as well as other approaches combining both methods. Sections III, IV describe the proposed methodology and experimental results. Finally, a discussion of the results obtained is presented in Section V before concluding in the last section.
Ii Related work
Analysis and detection of malware have been a persistent challenge for security professionals. Traditional attempts are focused on static and dynamic analysis, but the rapid growth and evolution of malicious code have compelled researchers in order to derive novel analysis and detection solutions. Machine learning (ML) is among the innovative technologies that have been employed towards that direction. Many works have focused on building frameworks for analysis [7, 20, 26], acquiring static features [15, 38]
and classifying malware families[21, 28, 30]. A system relying on ML to detect an unknown malicious payload, without the need for removing obfuscation, was presented in support vector machine (SVM), on detecting malicious application programming interface (API) call sequences was perfomed in , where the superiority of the SVM classifier was demonstrated. The main weakness of ML-based detection systems is that they rely on a virtual environment to analyze samples; this not only affects their run-time performance, but also endangers the whole system since the samples need to be executed. In addition, the ability of malware to adapt its behavior to the execution environment, the excessive number of features that need to be extracted per sample, and the high volume of malware instances being reported, reduce the chances of accurate detection [33, 34].
Alternative approaches exploring the use of samples’ visual representation have been proposed; most efforts consider the clustering of malware by families or similarity [8, 11, 23, 31], whereas other works focus on evaluating the sample-to-image transformation with the majority utilizing grayscale images [12, 13]. Examples of solutions proposed include the visualization of executables for reversing & analysis (VERA) framework , which represents on a 3-dimensional space a program’s flow to identify obfuscating parts of its code, and the similarity evidence explorer for malware (SEEM) tool , which provides the ability to visually compare a given binary with classified samples to identify the similar ones. Although this line of research is quite promising, certain challenges need to be tackled. More precisely, most approaches are based on malware classification and not on its detection that is still performed by existing tools. In addition, the proposed approaches are often not robust to commonly used techniques that attackers could use, like data relocation and data redundancy, to circumvent the detection mechanism .
Techniques investigating the combination of both aforementioned categories for malware analysis have not received much attention. Computer vision techniques and SVMs were used in for malware detection achieving an accuracy of % on a dataset with 37K samples. A combination of ML, visualization and steganalysis was proposed in  to detect packed malware; both SVMs and a variation of the -nearest neighbour
(KNN) were used, with the later achieving a classification accuracy of%. A similarity detection framework with 1.2M samples was used in  for malware classification; an accuracy of
% was achieved (for about half of the samples), but the use of similarity considerably lowers the probability of detecting unknown malware with dissimilar structure. Unfortunately, the above approaches are also susceptible to common attackers’ techniques. Thus, it seems that binary representation and ML still pose challenges, and resemble more than a modern addition to current systems rather than a replacement.
Iii The proposed method
This section presents the methodology used for developing the proposed malware detection system illustrated in Fig. 1
; it relies on visualizing a file’s binary onto a two-dimensional space and then using SOINN, after having performed feature extraction, to distinguish between benign and malicious files. A dataset of binary images was first generated from the sample files (SectionIII-A). The images are next pre-processed, where prominent features containing valuable information about the image are automatically extracted to form a vector of length (Section III-B). The vectors are then provided to SOINN (Section III-C), which performs clustering and classification; the training data are stored in a database for subsequent use during analysis and detection. A similar process is followed during the testing phase where, after having pre-processed the input file and generated the feature vector, the outcome is evaluated against the training data. The selection of the nearest vector (to the input given) and the class type is based on the Euclidean distance.
Iii-a Binary visualisation
Binary visualization transforms the binary contents of a file to another domain that can be visually represented (normally a 2-dimensional space). Our visual representation algorithm is based on binvis.io , an online tool that uses colour schemes to represent different binary or ASCII values. The clustering algorithm that is employed in Fig. 1 is based on the Hilbert space-filling curve, which outperforms other curves in preserving the locality between objects in multi-dimensional spaces [5, 17], thus creating a much more appropriate imprint of the image. The above combination is useful for detecting obfuscated code in malicious files, which could go unnoticed otherwise, since they tend to be grouped together. To convert a file into an image, its data are seen as a byte string, where each byte’s value is compared against the ASCII table and is attributed to a color according to the division it belongs:
blue color, if the character is printable;
green color, if the character is control; and
red color, if the character is extended.
Besides the above divisions in the ASCII table, two more classes were added, namely 0x00 (colored black) and 0xFF (colored white), to represent null and (non-breaking) spaces respectively —an example visualization is given in Fig. 2.
Iii-B Feature extraction
During the pre-processing stage important features, capable of indicating the presence of malicious payload, are extracted from the image so as to be used in the classification process. The images’ size is an important factor for the algorithm’s performance. Indeed, by having images as small as pixels, a vector of length is obtained that might just not be large enough to include meaningful information for the distinction of malicious and benign files. On the other hand, this might still be the case even when the size is increased to pixels. It is thus reasonable to assume that not all the values are relevant to distinguish the classes, since specific locations and patterns carry more information than other image areas. To have a valid analysis and increase the probability of detecting peculiarities, it is necessary to first detect the regions of interest (RoI) on the image where specific malware patterns may appear.
Similar to our methodology,  is clustering images using SOINN, but the work concerned the identification of persons over non-overlapping cameras. In order to obtain a reduced number of features, while minimizing the impact on system’s performance, images were divided into stripes to account for regions having different characteristics, since this was found to provide improved results during identification ; different colour spaces and texture features were utilized on each stripe prior to conversion into histograms [10, 40], thus significantly reducing the feature vector’s length.
The same approach is adopted in our case for obtaining the feature vector from the images, with the difference of having the images divided into parts —top, bottom, upper and lower middle— and using only the RGB color space to compose the histogram. These regions were chosen due to the differences occurring in the majority of patterns obtained from each. The overall histogram, which serves as the feature vector of length , is the result of concatenating the individual regions’ histograms, each having length (the RGB color space). It was shown during the experimentation that the number of features did not have a noticeable impact on the classification accuracy or the system performance.
Iii-C Self-organizing incremental neural network
The feature vectors extracted from the images are given to SOINN, which has two layers; the first layer aims at learning the topological structure of the NN, whereas the second layer determines the number of clusters based on the input received from the first layer. Among other things, SOINN maintains a set of nodes, the set of connections between the nodes, and the set of weights (i.e. feature vectors) associated with each node. The algorithm used for training the NN is employed at both layers, with the difference that the similarity threshold determining if a new node should be added in is adaptive (resp. constant) at the first (resp. second) layer. The algorithm is briefly described below; the details can be found in .
The algorithm is initialized by setting , with weights chosen uniformly at random from the feature vectors, and to be the empty set. For each input vector , the algorithm computes the closest nodes
referred to as the winner and second winner respectively, based on the Euclidean distance metric . If no similarity exists between the input vector and either or , the new node is inserted into the network ; otherwise, a connection between the winners is added to , if not present, and its age is set to zero, i.e. . Next, the age of all connections with as a starting point is increased by . Since this process would create regions of highly connected nodes (corresponding to the formed classes), the algorithm is using an age threshold in order to determine if a connection must be dropped. More precisely, if we have that a connection is such that , then its importance is low compared to others and can be omitted. The similarity threshold used (at the first layer) above is node-specific, and is determined by the following function
where is either the set of neighbors of node , if this set is nonempty, or it is equal to otherwise.
Noise removal is one of the important aspects of SOINN that separates it from other algorithms and essentially allows the NN to maintain only the valuable pieces of information. To eliminate noisy nodes the algorithm is first provided with a value that determines the number of iterations the algorithm has to go through before searching for nodes that do not give important data. If a node has no neighbors, this means there is a possibility of distorting the pattern; nodes with two or less neighbors are removed only if considered insignificant. After the NN has been completely trained, the Euclidean distance is applied again to classify a file; the argument corresponding to the shortest distance is recognized as the winning node and its class is used to classify the input vector.
Iv Experimental results
In this section, we present the performance analysis results of the prototype that was implemented based on the above methodology. The dataset that was used contained a total of K benign files, obtained from trusted, common and portable applications, and a total of K malicious files collected from the VirusShare website and included classes like viruses, worms, backdoors, rootkits, and trojans. Both the benign and malicious samples admitted the same distribution into various file types (with a predominance of executable files), which is illustrated in Table I.
The performance of SOINN was assessed for various values of the parameters and in terms of accuracy rate. For each setup, Monte Carlo simulations were performed, with the same dataset, in order to get the average accuracy rate and the number of false positives (FP) and false negatives (FN); the results obtained are shown in Fig. 3. The algorithm behaves well in distinguishing malicious from benign files in all cases with the exception of text files (.txt and .htm). The fact that benign .doc and .pdf files have less number of clusters in the images than the malicious ones (this was marked by the algorithm as a feature for malware), along with the fact that they do not just contain printable ASCII characters, has improved detection performance. A general remark is that the strong differences in the format of files do not allow obtaining a single solution fitting all cases, therefore having an impact on system’s performance. A solution to this problem is to take a file’s type into account, allowing the algorithm to properly adjust its behavior in each case. The impact that and have on accuracy performance is shown in Fig. 4. It is seen that increasing excessively has an adverse effect in removing noise from the NN, and the time that the system needs for processing information increases with . The best accuracy rate during testing was achieved for and . The training of the system was extremely fast, taking about seconds to compute a dataset with K feature vectors.
Moving to binary visualization, the images acquired reveal that it is possible to identify malware files through images. By representing a file’s binary contents as an image, the potential patterns and discrepancies in data start to emerge. In Fig. 2, the binary information composing examples of malicious and benign files is shown. Malicious files have a tendency for often including ASCII characters of various categories, presenting a colorful image, while benign files have a cleaner picture and distribution of values. Another technique to identify malicious files is to search for green and black areas since malware has a high predominance of control and null characters. In contrast, benign files are mostly made of printable and ASCII extended characters. A further aspect to note is that most benign files have a few clusters at the top and bottom of the image whereas malicious files have a higher variety of clusters. An insight on the color distribution of each class (i.e. benign vs. malware) is given in Fig. 5. Although the green color is observed in about the same percentage of pixels, their spreading in the images differs with malware admitting regions of high concentration. The difference in black pixels is evident, since the use of null characters to transport and hide data or to exploit a system is quite common [3, 39].
An experiment was first performed, based on the dataset of
K samples, to evaluate SOINN’s detection accuracy and to obtain a quick estimate of its performance. To have more accurate information about its behavior when presented with high data volumes, another dataset ofK samples was also created. The algorithm was found to be highly efficient and the time needed to complete the training phase of the last dataset was under minutes in a personal workstation equipped with an Intel Core i5 processor.
Using visual representation techniques allows to obtain an insight into the structural differences between malicious and benign files. Such differences are harder to identify in some file types; to be more precise, although .txt files are mostly composed of printable characters (blue), this is not the case in a (large) number of samples due to the existence of control characters, something that also holds for infected .txt files thus resulting in increased difficulty during analysis. To solve this problem, more representative (of a malware’s features) samples could be used having a mixture of colors and clusters in the image. Executable files exhibit a more diverse color distribution as they include different categories from the ASCII table. Hence, in contrast to text files, it is unlikely that a high percentage of blue pixels will appear in a benign executable file’s image representation, something that in turn increases the chances of being malware. Due to the above, it is evident why training the same NN for all file types drastically decreases the detection accuracy rate by %. This however could be overcome if the files’ extension is also taken as input during the analysis and training of the neural network. By analyzing Fig. 3, it is readily seen that the proposed algorithm is better in identifying text-based files than other file types. This means that our approach can accurately detect ransomware that has recently become one of the preferred and most consequential forms of attack. To the best of our knowledge, no other work has tested the proposed methodology for different file types; instead, solely executables are being used for measuring the performance of a detection mechanism.
Based on our experiments and the colour distribution, it was observed that malicious samples had a larger number of black colored regions than benign files. These regions represent null characters and various motives exist behind their use when the malware is written. Such characters, apart from representing the end of a string or being used for padding[14, 35], are also used by cyber-criminals to hide data or instructions in the registry [22, 36], a practice sought after by malware writers; however, it equally likely that these characters indicate (possibly malicious) unknown files that are being created at a host . Several attacks are possible just by using null characters, e.g. to circumvent security mechanisms and load an infected file in order to allow remotely controling a system . More details about the use of null characters for launching cyber-attacks can be found at the website of common attack pattern enumeration and classification (CAPEC) community. Regarding the patterns created, malware images tend to have more compact clusters and a variety of colors, whereas benign files are generally represented by almost static images with only a few and small-scale clusters. It was also noticed that the top and bottom half of the images carry information that is more crucial to the analysis and detection of malware than the rest of the areas, since this is where different clusters, from picture to picture, are often perceived.
Regarding the algorithm’s performance, SOINN achieved an accuracy of % for just over K samples. As Fig. 7 shows, the accuracy rate (resp. false positives and false negatives) increases (resp. decrease) with the number of samples, and thus it would be possible to get much better results provided that more samples are added during the training phase. SOINN is capable of reducing its internal structure without having to abdicate from new or old knowledge (stability-plasticity dilemma). Its noise removal properties ensure that the network is not limited in capacity, a problem that is found in nearest neighbours algorithm . SVMs have similar drawbacks; the major problem are the highly complex and extensive memory requirements, as well as, the slow response during the testing. Ont the other hand, SOINN is lightweight and extremely fast, the whole network was trained in under seconds for about K iterations.
In this paper, an image-based malware detection technique was proposed that uses unsupervised learning. The study tested if malicious files could be differentiated from benign files by focusing on features extracted from their visual representation. The proposed approach utilizes feature vectors of lengthand does not employ any computationally intensive technique or algorithm, which allowed fast processing and training times. The experimental tests were conducted for different file types and malware categories and revealed that the algorithm, whose accuracy is improved with the available number of samples, can be successfully used for malware detection. In particular, the algorithm achieved an overall average detection rate of about % with % false positives and % false negatives. Ongoing work, aims at extending the approach presented to mobile platforms as a lightweight cloud-based antivirus. Since there is great potential in improving the results obtained by enhancing feature extraction, a number of options is currently being considered, including Gabor texture filters and multi-resolution analysis techniques among others.
-  C. Burgess, F. Kurugollu, S. Sezer, and K. McLaughlin, “Detecting packed executables using steganalysis,” in proc. 5th European Workshop on Visual Information Processing (EUVIP), pp. 1–5, 2014.
-  A. Cortesi, “binvis.io: Visual Analysis of Binary Files,” 2016. Available at: http://binvis.io/#/
-  M. Dadkhah, A. Dadkhah, and J. Deng, “Malware Contaminated Website Detection by Scanning Page Links,” Int’l J. Inform. Technology & Electrical Engineering (ITEE), vol. 3, no. 6, pp. 1–7, 2014.
-  G. Dahl and J. D. L. Stokes, “Large-scale malware classification using random projections and neural networks,” in proc. 2013 IEEE Int’l Conf. Acoustics, Speech and Signal Process. (ICASSP), pp. 3422–3426, 2013.
-  C. Faloutsos and Y. Rong, “DOT: A Spatial Access Method Using Fractals,” in proc. 7th Int’l Conf. Data Engineering (ICDE), pp. 152–159, 1991.
-  J. Fonseca, M. Vieira, and H. Madeira, “The Web Attacker Perspective: A Field Study,” in proc. 21st IEEE Int’l Symp. Software Reliability Engineering, pp. 299–308, 2010.
D. Gavrilut, M. Cimpoesu, D. Anton, and L. Ciortuz, “Malware Detection Using Perceptrons and Support Vector Machines,” in proc.2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns, pp. 283–288, 2009.
-  R. Gove, et al., “SEEM: a scalable visualization for comparing multiple large sets of attributes for malware analysis,” in proc. 11th Workshop on Visualization for Cyber Security (VizSec), pp. 72–79, 2014.
-  K.-P. Grammatikakis, A. Ioannou, S. Shiaeles, and N. Kolokotronis, “Are cracked applications really free? An empirical analysis on Android devices,” in proc. 16th IEEE Int’l Conf. Dependable, Autonomic and Secure Computing (DASC), pp. 730–735, 2018.
-  D. Gray and H. Tao, “Viewpoint Invariant Pedestrian Recognition with an Ensemble of Localized Features,” in proc. European Conference on Computer Vision (ECCV), LNCS vol. 5302, pp. 262–275, 2008.
-  K. S. Han, B. J. Kang, and E. G. Im, “Malware Analysis Using Visualized Image Matrices,” The Scientific World Journal, vol. 2014, Art. ID 132713, pp. 1–15, 2014.
-  K. S. Han, J. H. Lim, B. Kang, and E. G. Im, “Malware analysis using visualized images and entropy graphs,” Int’l J. Inform. Security, vol. 14, no. 1, pp. 1–14, 2015.
-  X. Han, J. Sun, W. Qu, and X. Yao, “Distributed malware detection based on binary file features in cloud computing environment,” in proc. 26th Chinese Control and Decision Conf. (CCDC), pp. 4083–4088, 2014.
-  S. H. Huseby, “Common Security Problems in the Code of Dynamic Web Applications,” Web Appl. Security Consortium, pp. 1–5, 2005.
R. Islam, R. Tian, L. Batten, and S. Versteeg, “Classification of Malware Based on String and Function Feature Selection,” in proc.2nd Cybercrime and Trustworthy Computing Workshop (CTC), pp. 9–17, 2010.
-  H. V. Jagadish, “Linear clustering of objects with multiple attributes,” in proc. 1990 ACM SIGMOD Int’l Conf. Management of Data, pp. 332–342, 1990.
-  H. V. Jagadish, “Analysis of the Hilbert curve for representing two-dimensional space,” Inform. Processing Letters, vol. 62, no. 1, pp. 17–22, 1997.
-  K. Kancherla and S. Mukkamala, “Image visualization based malware detection,” in proc. 2013 IEEE Symp. Computational Intelligence in Cyber Security (CICS), pp. 40–44, 2013.
-  D. Kirat, L. Nataraj, G. Vigna, and B. S. Manjunath, “SigMal: a static signal processing based malware triage,” in proc. 29th Annual Computer Security Applications Conf. (ACSAC), pp. 89–98, 2013.
-  J. Z. Kolter and M. A. Maloof, “Learning to detect malicious executables in the wild,” in proc. 10th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 470–478, 2004.
-  A. Lakhotia, A. Walenstein, C. Miles, and A. Singh, “VILO: a rapid learning nearest-neighbor classifier for malware triage,” J. Computer Virology and Hacking Techniques, vol. 9, no. 3, pp. 109–123, 2013.
-  M. Ligh, S. Adair, B. Hartstein, and M. Richard, Malware Analyst’s Cookbook and DVD, Wiley Publishing, Inc., 2011.
-  A. Long, J. Saxe, and R. Gove, “Detecting malware samples with similar image sets,” in proc. 11th Workshop on Visualization for Cyber Security (VizSec), pp. 88–95, 2014.
-  C. Malin, E. Casey, and J. Aquilina, Malware Forensics Field Guide For Windows Systems: Digital Forensics Guides, Elsevier, 2012.
-  L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, “Malware images: visualization and automatic classification,” in proc. 8th Int’l Symp. Visualization for Cyber Security (VizSec), Art. no. 4, pp. 1–7, 2011.
N. Nissim, R. Moskovitch, L. Rokach, and Y. Elovici, “Novel active learning methods for enhanced PC malware detection in windows OS,”Expert Systems with Applications, vol. 41, no. 13, pp. 5843–5857, 2014.
U. Park, et al., “ViSE: Visual Search Engine Using Multiple Networked Cameras,” in proc.
18th Int’l Conf. Pattern Recognition (ICPR), pp. 1204–1207, 2006.
-  R. S. Pirscoveanu, et al., “Analysis of Malware behavior: Type classification using machine learning,” in proc. 2015 Int’l Conf. Cyber Situational Awareness, Data Analytics and Assessment (CyberSA), pp. 1–7, 2015.
-  D. A. Quist and L. M. Liebrock, “Visualizing compiled executables for malware analysis,” in proc. 6th Int’l Workshop on Visualization for Cyber Security, pp. 27–32, 2009.
-  K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic Analysis of Malware Behavior using Machine Learning,” J. Computer Security, vol. 19, no. 4, pp. 639–668, 2011.
-  J. Saxe, D. Mentis, and C. Greamo, “Visualization of shared system call sequence relationships in large malware corpora,” in proc. 9th Int’l Symp. Visualization for Cyber Security (VizSec), pp. 33–40, 2012.
-  F. Shen and O. Hasegawa, “An Incremental Network for On-line Unsupervised Classification and Topology Learning,” Neural Networks, vol. 19, no. 1, pp. 90–106, 2006.
-  S. Shiaeles, V. Katos, A. Karakos, and B. Papadopoulos, “Real time DDoS detection using fuzzy estimators,” Computers & Security, vol. 31, no. 6, pp. 782–790, 2012.
-  S. Shiaeles, and M. Papadaki, “FHSD: An Improved IP Spoof Detection Method for Web DDoS Attacks,” The Computer Journal, vol. 58, no. 4, pp. 892–903, 2014.
-  M. Sikorski and A. Honig, Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software, No starch press, 2012.
-  A. Sovarel, D. Evans, and N. Paul, “Where’s the FEEB? The Effectiveness of Instruction Set Randomization,” in proc. 14th USENIX Security Symposium, pp. 145–160, 2005.
-  Y. Sun, H. Liu, and Q. Sun, “Online learning on incremental distance metric for person re-identification,” in proc. 2014 IEEE Int’l Conf. Robotics and Biomimetics (ROBIO), pp. 1421–1426, 2014.
-  D. Uppal, R. Sinha, V. Mehra, and V. Jain, “Malware Detection and Classification Based on Extraction of API Sequences,” in proc. 2014 Int’l Conf. Advances in Computing, Communications and Informatics (ICACCI), pp. 2337–2342, 2014.
-  C. Wressnegger, K. Freeman, F. Yamaguchi, and K. Rieck, “From Malware Signatures to Anti-Virus Assisted Attacks,” Computer Science Report, Report No. 2016-03, Technische Univ. Braunschweig, 2016.
-  W. Zheng, S. Gong, and T. Xiang, “Reidentification by Relative Distance Comparison,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, no. 3, pp. 653–668, 2013.