A Large-Scale Database for Graph Representation Learning

11/16/2020 ∙ by Scott Freitas, et al. ∙ Microsoft Georgia Institute of Technology 0

With the rapid emergence of graph representation learning, the construction of new large-scale datasets are necessary to distinguish model capabilities and accurately assess the strengths and weaknesses of each technique. By carefully analyzing existing graph databases, we identify 3 critical components important for advancing the field of graph representation learning: (1) large graphs, (2) many graphs, and (3) class diversity. To date, no single graph database offers all of these desired properties. We introduce MalNet, the largest public graph database ever constructed, representing a large-scale ontology of software function call graphs. MalNet contains over 1.2 million graphs, averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families. Compared to the popular REDDIT-12K database, MalNet offers 105x more graphs, 44x larger graphs on average, and 63x the classes. We provide a detailed analysis of MalNet, discussing its properties and provenance. The unprecedented scale and diversity of MalNet offers exciting opportunities to advance the frontiers of graph representation learning—enabling new discoveries and research into imbalanced classification, explainability and the impact of class hardness. The database is publically available at www.mal-net.org.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: MalNet: Advancing State-of-the-Art Graph Databases. MalNet contains function call graphs averaging nodes and edges per graph, across a hierarchy of types and families.
Nodes Edges Avg. Degree
Type # graphs # families min mean max std min mean max std min mean max std
Adware 884K 250 7 14K 211K 16K 4 31K 605K 38K 0.50 2.21 6.24 0.36
Trojan 179K 441 5 15K 228K 18K 4 34K 530K 42K 0.58 2.05 6.74 0.52
Benign 79K 1 5 35K 552K 30K 3 79K 2M 74K 0.58 2.13 5.30 0.31
Riskware 32K 107 5 12K 173K 16K 4 30K 334K 39K 0.58 2.16 5.42 0.56
Addisplay 17K 38 37 13K 98K 15K 37 28K 246K 34K 0.92 1.97 4.38 0.37
Spr 14K 46 12 28K 169K 21K 7 67K 369K 52K 0.58 2.27 4.70 0.44
Spyware 7K 19 12 5K 55K 6K 7 11K 121K 14K 0.58 1.95 4.27 0.46
Exploit 6K 13 19 24K 102K 14K 14 45K 250K 30K 0.74 1.88 3.34 0.33
Downloader 5K 7 37 20K 107K 28K 37 46K 321K 63K 0.96 1.68 3.53 0.66
Smssend++Trojan 4K 25 16 34K 147K 19K 13 82K 387K 48K 0.81 2.39 3.78 0.23
Table 1: Descriptive statistics for the 10 largest graph types in MalNet. See Appendix Table 5 for all graph statistics.

The emergence of graph data across many scientific fields has led to intense interest in the development of representation learning techniques that encode structured information into low dimensional space for a variety of important downstream tasks (e.g., toxic molecule detection, community clustering, malware detection). However, recent research focusing on developing graph kernels, neural networks and spectral methods to capture graph topology has revealed a number of shortcomings of existing benchmark datasets 

[cai2018simple, errica2019fair, schulz2019necessity, shchur2018pitfalls], which often contain graphs that are relatively: (1) limited in number; (2) smaller in scale in terms of nodes and edges; and (3) restricted in class diversity. The state of graph representation benchmarks (e.g., PROTEINS [borgwardt2005protein]

, IMDB 

[yanardag2015deep], REDDIT [yanardag2015deep]

) is analogous to MNIST 

[lecun1998mnist]

at its height—a staple of the computer vision community, and often the first dataset researchers would evaluate their methods on. The graph representation community is at a similar inflection point, as it is increasingly difficult for current databases to characterize and differentiate modern graph representation techniques 

[cai2018simple, errica2019fair].

To address these issues, we introduce a new graph database called MalNet, a large-scale ontology of software function call graphs (FCGs). Each FCG represents calling relationships between functions in a program, where nodes are functions and edges indicate inter-procedural calls. Through MalNet, we make three major contributions:

  • [topsep=0mm, itemsep=0mm, parsep=1mm, leftmargin=*]

  • MalNet: Largest Database for Graph Representation Learning. MalNet contains 1.2 million function call graphs, averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families (Figure 1). This makes MalNet the largest public graph database constructed to date, offering 105 more graphs, 44 larger graphs on average, and 63 more classes compared to the popular REDDIT-12K database.

  • Revealing New Discoveries. The unprecedented scale of MalNet enables new and important discoveries that were previously not possible. Leveraging the function call graphs in MalNet, we study popular graph representation learning techniques in depth, and reveal: (1) the significant challenges they face in terms of scalability and their ability to handle large class imbalance; (2) that simple baselines can be surprisingly effective at the scale of MalNet; and (3) the algorithms’ critical dependencies on training data.

  • Enabling New Research Directions. MalNet offers unique opportunities to advance the frontiers of graph representation learning by enabling research into imbalanced classification, explainability and the impact of class hardness. We believe the diversity, scale and natural imbalance of MalNet

    will enable it to become a benchmark dataset to meet the future research needs of the graph representation community. By open-sourcing

    MalNet, we hope to inspire and invite more researchers to contribute to this exciting new resource.

2 Properties of MalNet

We begin by analyzing 5 key properties of the MalNet database—(1) scale (number of graphs, average graph size), (2) class hierarchy (3) class diversity, (4) class imbalance and (5) cybersecurity applications. In Section 2.1 we compare MalNet against common graph classification datasets, summarizing the differences in Table 2.

Figure 2: Example of the graph type “worm” and its 7 families. Each graph type can share multiple families.

Scale. MalNet contains 1,262,024 function call graphs across 47 types and 696 families of malware. When stored on disk, MalNet takes over 443 GB of space in edge list format, with each graph containing 17,242 nodes and 39,043 edges, on average. This makes MalNet the largest public graph dataset constructed to date in terms of number of graphs, average graph size and number of classes. In Table 1, we provide descriptive statistics on the number of nodes, edges, and average degree of ten of the largest graph types (see Appendix Table 5 for a full comparison). We believe that this scale of data is crucial to the future development of graph representation techniques as current databases are too small to effectively differentiate and benchmark techniques on non-attributed graphs [cai2018simple, errica2019fair, schulz2019necessity, shchur2018pitfalls].

Hierarchy. Function call graphs are assigned a general type (e.g., Worm) and specialized family label (e.g., Spybot) using the Euphony [hurier2017euphony] classification structure (see Figure 2). To generate these labels, Euphony takes a VirusTotal [total2012virustotal]

report containing up to 70 labels across a variety of antivirus vendors and unifies the labeling process by learning the patterns, structure and lexicon of vendors over time. While Euphony provides state-of-the-art performance, this task is considered an open-challenge due to both naming disagreements 

[hurier2016lack, kantchelian2015better] and a lack of adopted naming standards [hurier2017euphony] across vendors. To help address this issue, we collect and release the raw VirusTotal reports containing up to 70 antivirus labels for each graph.

Diversity & Imbalance. MalNet offers 47 types and 696 families of function call graphs following a long tailed distribution with imbalance ratios of 7,827 and 16,901, respectively! To put this in perspective, MalNet’s smallest class contains only 113 samples of the Click graph, while 884,455 of the Adware type. Models learning from long-tailed distributions tend to favor the majority class, leading to poor generalization performance on rare classes. While class imbalance is traditionally solved by resampling the data (undersampling, oversampling) [chawla2002smote, peng2019trainable]

, reshaping the loss function (loss reweighting, regularization) 

[cao2019learning, cui2019class] or accounting for input-hardness [duggal2020elf], it is largely unexplored in the graph domain. We hope that MalNet can serve as a source of data to spark novel research in this critical area.

Cybersecurity Applications. A majority of malware samples are polymorphic in nature, meaning that subtle source code changes in the original malware variant can result in significantly different compiled code (e.g., instruction reordering, branch inversion, register allocation) [you2010malware, dullien2005graph]. Cybercriminals frequently take advantage of this to evade signature based detection, a predominant form of malware detection [sathyanarayan2008signature]. Fortunately, these subtle source code changes have minimal effect on the control flow of the executable, which can be represented with a function call graph (see Figure 3). Research has demonstrated that function call graphs (FCGs) can effectively defeat the polymorphic nature of malware through techniques like graph matching [gallagher2006matching, hu2009large, kinable2011malware, kostakis2011improved, kong2013discriminant] and representation learning [gascon2013structural, jiang2018dlgraph]. Unfortunately, prior to the release of MalNet, no large-scale FCG datasets have been made publically available largely due to the proprietary nature of the data.

Application Dataset Graphs Classes Ratio Avg. Node Avg. Edge Hierarchy
Cybersecurity MalNet 1,262,024 696 16,901 17,242 39,043
CGD [ranveer2015comparative] 1,361 2 1.49 -
1-8[.7pt/1.5pt]

Small molecule PCBA [ramsundar2015massively] 437,929 2 - -
MUV [rohrer2009maximum] 93,087 2 - -
YEAST [yan2008mining] 79,601 2 1.26 -
HIV [wu2018moleculenet] 41,127 2 - -
NCI1 [wale2008comparison] 4,110 2 1 -
PTC-MR [toivonen2003statistical] 344 2 1.26 -
MUTAG [debnath1991structure] 188 2 1.98 -
1-8[.7pt/1.5pt]

Computer vision Fingerprint [riesen2008iam] 2,800 4 276.5 -
Letter-low [riesen2008iam] 2,250 15 1 -
Letter-med [riesen2008iam] 2,250 15 1 -
Letter-high [riesen2008iam] 2,250 15 1 -
FIRSTMM-DB [neumann2013graph] 41 11 3 -
1-8[.7pt/1.5pt]

Bioinformatic DD [dobson2003distinguishing] 1,178 2 1.42 -
PROTEINS [borgwardt2005protein] 1,113 2 1.47 -
ENZYMES [borgwardt2005protein] 600 6 1 -
1-8[.7pt/1.5pt]

Social network Reddit-T [rozemberczki2020api] 203,088 2 15 -
Twitch-E [rozemberczki2020api] 127,094 2 1.16 -
Github-S [rozemberczki2020api] 12,725 2 1.15 -
REDDIT-12K [yanardag2015deep] 11,929 11 5.05 -
Deezer-E [rozemberczki2020api] 9,629 2 1.32 -
COLLAB [leskovec2005graphs] 5,000 3 3.35 -
REDDIT-5K [yanardag2015deep] 4,999 5 1 -
REDDIT-B [yanardag2015deep] 2,000 2 1 -
IMDB-M [yanardag2015deep] 1,500 3 1 -
IMDB-B [yanardag2015deep] 1,000 2 1 -
Table 2: Comparison of MalNet properties with common graph classification datasets. MalNet offers over million graphs averaging nodes and edges with a hierarchical class structure containing types and families. This makes MalNet the largest public graph database constructed to date, offering 105 more graphs, 44 larger graphs on average, and 63 the classes compared to the popular REDDIT-12K database.

2.1 MalNet: Advancing the State-of-the-Art

A number of well labeled small datasets have served as training and evaluation benchmarks for most of today’s graph representation learning techniques As the field advances, larger and more challenging datasets are needed for the next generation of algorithms. MalNet offers 105 more graphs, 44 larger graphs on average, and 63 the classes, compared to the popular REDDIT-12K database. We compare MalNet with other graph datasets and summarize the differences in Table 2.

Figure 3: Software function call graph from the Banker++Trojan type, and Acecard family. Each FCG represents calling relationships between functions in a program, where nodes represent functions and edges indicate inter-procedural calls. Highlighted in blue is one potential execution path.

Cybersecurity datasets. Aside from MalNet, CGD [ranveer2015comparative] is the only publicly available cybersecurity dataset we could identify for the task of graph classification; containing 1,361 function call graphs and two classes. In surveying the extensive FCG malware detection literature [gallagher2006matching, gascon2013structural, hu2009large, jiang2018dlgraph, kinable2011malware, kostakis2011improved, kong2013discriminant] we observed that almost all data is closed-source; likely due to a combination of security concerns and issues regarding private company data.

Small molecule datasets. There are numerous small molecule datasets, including: HIV [wu2018moleculenet], MUTAG [debnath1991structure], MUV [rohrer2009maximum], PCBA [ramsundar2015massively], NCI1 [wale2008comparison], PTC-MR [toivonen2003statistical], and YEAST [yan2008mining]. The HIV dataset, introduced by the Drug Therapeutics Program AIDS Antiviral Screen [dtpaids], tests the ability of 41,127 chemical compounds to inhibit HIV replication into one of three classes. MUTAG contains 188 chemical compound graphs divided into two classes according to their mutagenic effect on bacterium. MUV and PCBA are constructed from PubChem BioAssay [wang2012pubchem], and contain 93,087 compounds across 17 tasks and 437,929 compounds across 128 tasks, where each task is a binary classification problem. NCI1 contains 4,110 chemical compounds, screened for their ability to inhibit the growth of a panel of human tumor cell lines. PTC-MR contains 344 graphs across 2 classes, reporting the effects of chemical compound carcinogenicity on rats. YEAST contains 79,601 molecule graphs screened for anti-cancer tests, with the binary classification of active or inactive.

Bioinformatic datasets. Three popular bioinformatic datasets are: DD [dobson2003distinguishing], ENZYMES [borgwardt2005protein] and PROTEINS [borgwardt2005protein]. DD is a data set containing 1,178 protein structures grouped into 2 categories (enzyme and non-enzyme). ENZYMES contains 600 graphs of protein tertiary enzyme structures with the task of assigning each enzyme to one of 6 levels. Similarly, PROTEINS

contains 1,113 protein graphs classified into either enzyme or non-enzyme.

Computer vision datasets. Three common computer vision datasets are: Fingerprint [riesen2008iam], FIRSTMM-DB [neumann2013graph] and Letter (low, med, high) [riesen2008iam]. Fingerprint contains 2,800 fingerprint graphs across four classes: arch, left, right, and whorl. FIRSTMM-DB contains 41 object point clouds belonging to an object ontology of 11 categories. Letter contains 3 datasets and 15 character classes with varying levels of distortion (low, med, high) added to 2,250 letter graphs.

Social network datasets. Common social network datasets include: COLLAB [leskovec2005graphs], Deezer Ego-Nets [rozemberczki2020api], Github Stargazers [rozemberczki2020api], IMDB (BINARY, MULTI) [wei2017deep], REDDIT (BINARY, 5K, 12K, Threads) [wei2017deep, rozemberczki2020api] and Twitch Ego-Nets [rozemberczki2020api]

. COLLAB is a collaboration dataset of 5,000 ego-networks across 3 domains of physics. Deezer Ego-Nets contains 203,088 user ego-nets across 2 genders from the Deezer music service. Github Stargazers contains 12,725 graphs of developers who starred either machine learning or web development repositories. IMDB BINARY contains 1,000 ego-network graphs representing actors and their collaborations across 2 movie genres. IMDB MULTI extends IMDB BINARY with 1,500 graphs and 3 movie genres. REDDIT-BINARY contains 2,000 thread graphs across two content classes (discussion and QA based). REDDIT MULTI-5K contains 4,999 thread graphs across 5 Reddit thread types. REDDIT MULTI-12K extends REDDIT-5K, containing 11,929 online discussion thread graphs across 11 classes. REDDIT Threads contains 203,088 thread graphs across 2 graph classes (discussion, non-discussion). Twitch Ego-Nets contains 127,094 ego graphs across 2 classes of Twitch users.

3 Constructing MalNet

MalNet is an ambitious project to collect and process over 1.2 million function call graphs. Below, we describe the data provenance and construction of MalNet.

3.1 Collecting Candidate Graphs

The first step in constructing MalNet was to identify a source of graph containing the desired properties outlined in Section 2. We determined that the natural abundance, large graph size, and class diversity provided by function call graphs (FCGs) make them an ideal source of graphs. While FCGs, which represent the control flow of programs (see Figure 3), can be statically extracted from many types of software (e.g., EXE, PE, APK), we use the Android ecosystem due to its large market share [popper2017], easy accessibility [li2017androzoo++] and diversity of malicious software [nokia2019]. With the generous permission of the AndroZoo repository [allix2016androzoo, li2017androzoo++], we collected 1,262,024 Android APK files, specifically selecting APKs containing both a family and type label obtained from the Euphony classification structure [hurier2017euphony]. This process took about a week to download and 10TB in storage space when using the maximum allowed concurrent downloads. In addition, we spent about month collecting raw VirusTotal (VT) reports to release with MalNet, through VT’s academic access, which allows 20k queries per day. Each VT report contains up to antivirus labels per graph.

3.2 Processing the Graphs

Once the APK files and labels were collected, we extract the function call graphs by running the files through Androguard [desnos2011android], which statically analyzes the APK’s DEX file. Distributed across Google Clouds General-purpose (N2) machine with 16 cores running 24 hours a day, the process took about 1 week to extract the graphs. We leave each graph in its original state—retaining its edge directionality, disconnected components and node isolates (i.e., single nodes with no incident edges). On average, each graph has nodes and edges; and typically contains a single giant connected component, many small disconnected components, and numerous node isolates. Table 1 describes the 10 graph types (out of 47) that have the highest number of graphs. Appendix Table 5 provides a full analysis on all graph type. Each graph is stored in a standard edge list format for its wide support, readability, and ease of use. In total, the graphs’ edge list files consume over 443 GB of hard disk space. Since we are dealing with highly malicious software, our goal is to mitigate the risk of releasing information that could potentially be used to reverse engineer malware. Thus, we numerically relabel the nodes of each graph, removing any associated attribute information.

Figure 4: MalNet Explorer. An exploration panel on the left allows users to select from the available graph types and families. Users can then visually explore each function call graph on the right. Our goal is to enable users to easily study the data without installation or download.

3.3 Online Exploration of the Data

To assist researchers and practitioners in exploring MalNet, we have designed and developed MalNet Explorer, an interactive graph exploration and visualization tool. It runs on most modern web browsers (Chrome, Firefox, Safari, and Edge), platforms (Windows, Mac OS, Linux), and devices (Android and iOS). Our goal is to enable users to easily explore the data before downloading. Figure 4 shows MalNet Explorer’s desktop web interface and its main components—(1) a hierarchical exploration panel on the left that allows the user to select from the available graph types and families; and (2) the function call graph visualization on the right. MalNet Explorer’s user interface uses a responsive design that automatically adjusts its component layout, based on the users’ device types and screen resolutions. MalNet Explorer is available online at: www.mal-net.org.

4 MalNet for New Research & Discoveries

MalNet is substantially larger than any existing graph database used for graph representation learning research, with many more graphs, much larger graphs, and many more classes of graphs. Such unprecedented advancements provides exciting opportunities to make new discoveries and explore new research directions previously not possible. In this section, we present our findings thus far, to demonstrate such possibilities. We begin by discussing the experimental setup below, followed by an overview of the graph representation techniques in Section 4.1. Section 4.2 discusses the new discoveries we found by studying MalNet; and Section 4.3 highlights new research directions enabled by MalNet.

Experimental Setup. We divide MalNet into three stratified sets of data: training, validation and test, with a split of //

, respectively; repeated for both graph type and family labels. Each embedding techniques uses a random forest model for the task of graph classification, where we run a grid search across the validation set to identify the number of estimators

and tree depth

. Every model is evaluated on its macro-F1 score, however, we report three performance metrics—macro-F1, precision and recall, as is typical for highly imbalanced datasets 

[duggal2020elf, duggal2020rest].

4.1 Graph Representation Techniques

We present results for 6 strong, recent, scalable, and readily available graph representation techniques [shervashidze2011weisfeiler, cai2018simple, schulz2019necessity, rozemberczki2020feather, tsitsulin2020just]. We perform our experiments in Python3 using an Amazon AWS c5d.metal server equipped with 96 CPU cores and 192 GB of memory. We briefly summarize each method and its configuration below:

  1. [topsep=2mm, itemsep=0mm, parsep=1mm, leftmargin=*]

  2. Weisfeiler-Lehman Subtree Kernel (WL) [shervashidze2011weisfeiler]

    is based on the 1-dimensional WL graph isomorphism heuristic. By iteratively aggregating over local node neighborhoods the kernel summarizes the neighborhood substructures present in a graph. We use 5 iterations of hashing and degree centrality as the label.

  3. LDP [cai2018simple]

    is a simple representation scheme that summarizes each node and its 1-hop neighborhood using using 5 degree statistics. These node features are then aggregated into a histogram where they are concatenated into feature vectors. We use the parameters suggested in

    [cai2018simple].

  4. NoG [schulz2019necessity] ignores the topological graph structure, viewing the graph as a two-dimensional feature vector of the node and edge count.

  5. Feather [rozemberczki2020feather]

    is a more complex representation scheme that uses characteristic functions of node features with random walk weights to describe node neighborhoods. We use the default parameters suggested in

    [rozemberczki2020feather].

  6. Slaq-VNGE [tsitsulin2020just] approximates the spectral distances between graphs based on the Von Neumann Graph Entropy (VNGE), which measures information divergence and distance between graphs [chen2019fast]. We use the default Slaq approximation parameters suggested in [tsitsulin2020just].

  7. Slaq-LSD [tsitsulin2020just] efficiently approximates NetLSD, which measures the spectral distance between graphs based on the heat kernel [tsitsulin2018netlsd]. We use the default Slaq and NetLSD approximation parameters suggested in [tsitsulin2020just].

We tested a number of alternative graph representation techniques and decided to exclude them111Methods based on kernal [vishwanathan2010graph, borgwardt2005shortest, johansson2014global, johansson2014global], spectral [gao2019geometric, tsitsulin2018netlsd, de2018simple, galland2019invariant, verma2017hunt], document embedding [narayanan2017graph2vec, chen2019gl2vec] and neural networks [kipf2016semi, wu2019simplifying, xu2018powerful]., as they were computationally prohibitive for the scale of MalNet, making it infeasible to run the techniques over the full dataset or perform parameter selection. For example, [gao2019geometric] would take weeks to process all million graphs when using 96 cores of the AWS c5d.metal server (which consumes $1.6k USD every 2 weeks), while [narayanan2017graph2vec] and [borgwardt2005shortest] quickly exceed the GB memory capacity of the machine.

Train: 100% (883,416 graphs) 10% (88,341) 1% (8,834)    0.1% (883)
Data Method Macro-F1 Precision Recall Time (h) F1 P R F1 P R F1 P R
Type WL [shervashidze2011weisfeiler] - - - - - - - .18 .36 .16 .11 .06 .07
LDP [cai2018simple] .35 .67 .29 11.8 .30 .59 .25 .21 .42 .17 .09 .24 .07
NoG [schulz2019necessity] .29 .64 .24 1.2 .26 .47 .22 .17 .25 .15 .06 .08 .06
Feather [rozemberczki2020feather] .41 .68 .34 11.2 .33 .59 .28 .22 .37 .19 .10 .22 .08
Slaq-VNGE [tsitsulin2020just] .04 .10 .03 3.6 .04 .05 .03 .04 .04 .03 .03 .04 .03
Slaq-LSD [tsitsulin2020just] .31 .62 .24 3.7 .24 .45 .20 .16 .32 .13 .07 .10 .07
Family WL [shervashidze2011weisfeiler] - - - - - - - .09 .16 .08 .03 .03 .03
LDP [cai2018simple] .34 .56 .28 11.9 .24 .43 .19 .11 .21 .10 .03 .05 .03
NoG [schulz2019necessity] .26 .43 .22 1.3 .17 .24 .15 .08 .08 .08 .02 .02 .02
Feather [rozemberczki2020feather] .34 .54 .29 11.3 .24 .43 .20 .11 .20 .10 .03 .05 .03
Slaq-VNGE [tsitsulin2020just] .01 .01 .01 3.8 .01 .01 .01 .01 .01 .01 .01 .01 .01
Slaq-LSD [tsitsulin2020just] .23 .40 .20 3.7 .17 .26 .14 .08 .10 .08 .02 .02 .03
Table 3: Meta Analysis: Comparison of macro-F1, precision and recall scores achieved by each method on 4 different levels of training data—, , , and —at both the type (low diversity, with classes) and family (high diversity, with classes). Comparing method across malware type and family, the classification task becomes increasingly difficult as diversity and data imbalance increase. We bold the best performing method at each level of training data.
Data Method F1 P R
Type 2,199 graphs WL [shervashidze2011weisfeiler] .82 .84 .81
LDP [cai2018simple] .77 .79 .76
NoG [schulz2019necessity] .74 .74 .74
Feather [rozemberczki2020feather] .81 .85 .80
Slaq-VNGE [tsitsulin2020just] .39 .41 .39
Slaq-LSD [tsitsulin2020just] .77 .79 .76
Table 4: Small-scale analysis of methods on graphs across types. Strong method performance indicates the underlying data is well-defined and labeled.

4.2 Enabling New Discoveries

Current graph representation research uses datasets that are significantly smaller in scale, and much less diverse compared to MalNet. In light of this, we want to study what new discoveries can be made, that were previously not possible due to dataset limitations. For example, what is the impact of class imbalance and diversity in the classification process? How does dataset scale (i.e., number of graphs) impact performance? We synthesized our findings into the following 3 discoveries (D1-D3).

  1. [topsep=2mm, itemsep=0mm, parsep=1mm, leftmargin=*, label=D0.]

  2. Less Diversity, Better Performance. Comparing methods in Table 3 across malware type (low diversity, with classes) and family (high diversity, with classes), the classification task becomes increasingly difficult as diversity and data imbalance increase. This trend is visible across all graph representation methods. For the best performing method, Feather, the macro-F1 score drops from (type) to (family). This matches our intuition from the experiments in Table 4, which also shows strong method performance when evaluating on a small subset of MalNet, containing 10 well-balanced types.

  3. Simple Baselines Surprisingly Effective. Both NoG and LDP use basic graph statistics. Given the simplicity of these methods, they perform remarkably well, often outperforming or matching the performance of more complex methods. For example, in Table 3 we can see that LDP ties for the best performing family classification method, achieving a macro-F1 score of , while beating significantly more complex methods e.g., Slaq-LSD. This trend holds across all levels of family training data. Using small graph databases, earlier work [cai2018simple] suggested the potential merits of considering simpler approaches. For the first time, using the largest graph database to date, our result confirms that current techniques in the literature do not well capture non-attributed graph topology.

  4. Training Data Governs Performance. In Table 3, we observe that for every order of magnitude increase in training data, the macro-F1 scores increase by approximately . This can be seen across methods at both the type and family classification levels. In addition, certain graph types have a large jump in predictive performance (e.g., Virus, Malware++Trj), while others increase linearly (e.g., Trojan) or not at all (e.g., Backdoor). The most dramatic macro-F1 performance jump occurs at the Malware++Trj graph type, where the class is nearly indistinguishable at of the training data, but easily distinguished at . It is important to keep in mind that even at of MalNet’s training data, the Malware++Trj class contains 43 graphs—which is in fact bigger than some graph databases currently used for evaluation (e.g., FIRSTMM-DB [neumann2013graph]). Finally, we note that while the results in Table 4 perform strongly compared to Table 3 (0.1%-1%), this is a result of the small number of well-balanced classes used in Table 4’s experiment.

4.3 Enabling New Research Directions

The unprecedented scale and diversity of MalNet opens up new exciting research opportunities for the graph representation community. Below, we present four promising directions (R1-R4).

Figure 5: Class-wise comparison of model predictions on 4 levels of training data—, , , where each darker cell represents a higher F1 score. We observe that certain classes are more challenging to classify than others. For example, we surprisingly see that Malware++Trj (right of middle) significantly outperforms both Troj and Adsware (left of middle), which contain many more examples.
  1. [topsep=2mm, itemsep=0mm, parsep=1mm, leftmargin=*, label=R0.]

  2. Class Hardness Exploration. Because of MalNet’s large diversity, it is now possible for researchers to explore why certain classes are more challenging to classify than others. For example, Figure 5 shows Malware++Trj significantly outperforming both Troj and Adsware, which contain many more examples. This result is surprising, and provides strong impetus for additional research into class hardness, such as: (a) investigating whether existing methods are flexible enough to represent the diverse graph structures; and (b) inviting researchers to study the similarities across class types (e.g. merge Spr and Spyware). To support further development in this challenging area, we release the raw VirusTotal reports containing up to 70 labels per graph.

  3. Imbalanced Classification Research. The natural world often follows a long-tailed data distribution where only a few classes account for most of the examples [duggal2020elf]. As evidenced in discovery D1, the long-tail often causes classifiers to perform well on the majority class, but poorly on rare ones. Unfortunately, imbalanced classification research in the graph domain has yet to receive much attention, largely because no datasets existed to support the research. By releasing MalNet, the largest naturally imbalanced database to date, we hope foster new interest in this important area.

  4. Reconsidering Merits of Simpler Graph Classification Approaches. Our discovery in D2 indicates that simpler methods can match or outperform more recent and sophisticated techniques, suggesting that current techniques aiming to capture graph topology are not yet well-reflected for non-attributed graphs, echoing results from [cai2018simple]. More broadly, our discovery demonstrates—for the first time—such phenomenon at the unprecedented scale and diversity offered by MalNet. We believe our results will inspire researchers to reconsider the merits of simpler approaches and classic techniques, and to build on them to reap their benefits.

  5. Enabling Explainable Research. In Figure 5, we observe that certain representation techniques better capture particular graph types. For example, Feather significantly outperforms all methods on Clicker++Trojan. This is a highly interesting result, as it could provide insight into when one technique is preferred over another (e.g., local neighborhood structure, global graph structure, graph motifs). We believe that the wide range of graph topology and substructures contained in MalNet’s nearly 700 classes will enable new explainability research.

5 Conclusion

The study of graph representation learning is a critical tool in the characterization and understanding of complex interconnected systems. Currently, no large-scale database exists to accurately assess the strengths and weaknesses of these techniques. To address this, we contribute a new large-scale database—MalNet—containing graphs across a hierarchy of types and families. In the future we plan to: (1) expand the number of graphs in the database, (2) explore alternative graph representations (e.g., dependency graphs, network activity graphs), and (3) examine the effects of file packing on the classification process. We hope MalNet will become a central resource for a broad of range of graph related research. The database is available at www.mal-net.org.

6 Acknowledgements

We want to thank Kevin Allix and AndroZoo colleagues for generously allowing us to use their data in this research; this work was in part supported by NSF grant IIS-1563816, CNS-1704701, GRFP (DGE-1650044) and a Raytheon research fellowship.

References

Appendix

Nodes Edges Avg. Degree

Type
# graphs # families min mean max std min mean max std min mean max std


Adware
884K 250 7 14K 211K 16K 4 31K 605K 38K 0.50 2.21 6.24 0.36

Trojan
179K 441 5 15K 228K 18K 4 34K 530K 42K 0.58 2.05 6.74 0.52

Benign
79K 1 5 35K 552K 30K 3 79K 2M 74K 0.58 2.13 5.30 0.31

Riskware
32K 107 5 12K 173K 16K 4 30K 334K 39K 0.58 2.16 5.42 0.56

Addisplay
17K 38 37 13K 98K 15K 37 28K 246K 34K 0.92 1.97 4.38 0.37

Spr
14K 46 12 28K 169K 21K 7 67K 369K 52K 0.58 2.27 4.70 0.44

Spyware
7K 19 12 5K 55K 6K 7 11K 121K 14K 0.58 1.95 4.27 0.46

Exploit
6K 13 19 24K 102K 14K 14 45K 250K 30K 0.74 1.88 3.34 0.33

Downloader
5K 7 37 20K 107K 28K 37 46K 321K 63K 0.96 1.68 3.53 0.66

Smssend++Trojan
4K 25 16 34K 147K 19K 13 82K 387K 48K 0.81 2.39 3.78 0.23

Troj
3K 36 14 6K 64K 8K 11 15K 115K 18K 0.79 1.98 5.60 0.52

Smssend
3K 12 15 20K 111K 14K 12 49K 337K 38K 0.80 2.34 4.61 0.47

Clicker++Trojan
3K 3 220 6K 29K 3K 471 14K 72K 7K 1.52 2.33 2.92 0.18

Adsware
3K 16 368 11K 53K 13K 564 26K 143K 28K 1.02 2.19 4.27 0.26

Malware
3K 19 6 8K 119K 13K 5 16K 286K 29K 0.83 1.90 3.97 0.67

Adware++Adware
3K 2 192 9K 55K 6K 289 20K 138K 16K 1.49 2.16 3.17 0.27

Rog
2K 22 26 15K 102K 19K 31 35K 232K 46K 0.91 2.05 4.79 0.49

Spy
2K 7 48 22K 107K 15K 44 49K 271K 40K 0.92 2.17 3.07 0.25

Monitor
1K 5 329 4K 41K 5K 580 7K 102K 12K 1.53 1.83 3.09 0.21

Ransom++Trojan
1K 7 556 51K 139K 22K 965 115K 319K 48K 1.59 2.26 2.59 0.21

Banker++Trojan
1K 6 29 33K 103K 16K 36 72K 237K 38K 1.22 2.15 2.99 0.24

Trj
940 18 29 13K 171K 16K 36 30K 402K 39K 1.15 2.20 4.44 0.49

Gray
922 10 51 16K 66K 13K 56 39K 153K 31K 0.88 2.09 4.33 0.58

Adware++Grayware++Virus
835 4 22 6K 84K 13K 20 14K 193K 29K 0.86 2.79 3.17 0.34

Fakeinst++Trojan
718 10 51 15K 94K 17K 58 37K 229K 44K 0.99 2.12 2.84 0.48

Malware++Trj
609 1 52K 52K 56K 596 118K 119K 128K 1K 2.28 2.28 2.29 0

Backdoor
602 10 25 13K 146K 22K 21 33K 427K 57K 0.84 2.19 3.55 0.37

Dropper++Trojan
592 8 47 5K 67K 7K 50 11K 175K 18K 1.06 1.98 3.92 0.70

Trojandownloader
568 7 1K 38K 102K 19K 2K 86K 258K 45K 1.34 2.19 2.54 0.21

Hacktool
542 7 668 17K 41K 9K 2K 37K 92K 20K 1.63 2.21 3.64 0.25

Fakeapp
425 5 24 4K 50K 7K 21 8K 107K 16K 0.88 1.67 2.79 0.37

Clickfraud++Riskware
369 5 2K 18K 20K 2K 4K 38K 43K 5K 1.95 2.13 2.25 0.04

Adload
333 4 2K 19K 53K 18K 4K 48K 149K 48K 1.46 2.29 3.13 0.40

Addisplay++Adware
294 1 3K 20K 50K 9K 6K 41K 108K 20K 1.65 2.03 2.45 0.21

Adware++Virus
274 9 38 15K 59K 15K 38 33K 138K 35K 1 2.22 3.17 0.54

Clicker
265 5 47 3K 75K 7K 43 6K 190K 17K 0.91 1.62 3.32 0.51

Fakeapp++Trojan
256 1 44 21K 72K 15K 39 41K 162K 34K 0.88 1.74 2.30 0.27

Riskware++Smssend
247 7 12 2K 60K 6K 7 5K 154K 14K 0.58 1.68 3 0.45

Rootnik++Trojan
223 5 210 16K 84K 21K 395 39K 197K 50K 1.15 2.59 3.21 0.47

Worm
220 7 64 14K 94K 15K 78 31K 204K 34K 0.99 1.99 3.42 0.40

Fakeangry
211 2 516 6K 98K 11K 946 15K 279K 29K 1.70 2.35 3.29 0.27

Virus
191 3 681 15K 80K 19K 1K 35K 177K 46K 1.32 2.12 3.18 0.33

Trojandropper
178 4 220 20K 78K 18K 236 39K 185K 39K 1.03 1.83 4.36 0.32

Adwareare
152 3 893 26K 57K 14K 2K 60K 144K 32K 1.88 2.25 2.60 0.20

Risktool++Riskware++Virus
152 3 37 16K 65K 16K 37 36K 158K 37K 1 1.92 3.17 0.48

Spy++Trojan
119 5 54 31K 118K 25K 66 75K 293K 61K 1.22 2.31 3.26 0.37

Click
113 1 2K 4K 12K 2K 4K 8K 26K 4K 1.80 2.04 2.74 0.21


Table 5: Descriptive statistics for graph types in MalNet.