Pangolin: An Efficient and Flexible Graph Mining System on CPU and GPU

11/16/2019 ∙ by Xuhao Chen, et al. ∙ The University of Texas at Austin 0

There is growing interest in graph mining algorithms such as motif counting. Generic graph mining systems have been developed to provide unified interfaces for programming these algorithms. However, existing systems take minutes or even hours to mine even simple patterns in moderate-sized graphs, which significantly limits their real-world usability. We present Pangolin, a high-performance and flexible in-memory graph mining framework targeting both shared-memory CPUs and GPUs. Pangolin is the first graph mining system that supports GPU processing. We provide a simple embedding-centric programming interface based on the extend-reduce-filter model, which enables user to specify application-specific knowledge like aggressive enumeration search space pruning and isomorphism test elimination. We also describe novel optimizations that exploit locality, reduce memory consumption, and mitigate overheads of dynamic memory allocation and synchronization. Evaluation on a 28-core CPU demonstrates that Pangolin outperforms Arabesque and RStream, two state-of-the-art graph mining frameworks, by 49x and 88x on average, respectively. Acceleration on a V100 GPU further improves performance of Pangolin by 15x on average. Compared to state-of-the-art hand-optimized mining applications, Pangolin provides competitive performance with much less programming effort.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Applications that use graph data are becoming increasingly important in many fields such as world wide web, advertising, social media, and biology. Graph analytics algorithms such as PageRank and SSSP have been studied extensively and many frameworks have been proposed to provide both high performance and high productivity [Pregel, GraphLab, PowerGraph, Galois, Ligra, Gemini, Gluon]. Another important class of graph problems deals with graph mining or graph pattern mining, which has applications in areas such as chemical engineering [chemical], bioinformatics [Motifs1, Protein], and social sciences [Social]. Graph mining discovers relevant patterns in a given graph. One example is triangle counting, which is used to mine graphs in security applications [Voegele2017]. Another example is motif counting [Motifs2, Motif3]

, which counts the frequency of certain structural patterns; this is useful in evaluating network models or classifying vertex roles.

Fig. 1 illustrates the 3-vertex and 4-vertex motifs.

Figure 1: 3-vertex (top) and 4-vertex (bottom) motifs.

Compared to graph analytics algorithms, graph mining algorithms are more difficult to implement on parallel platforms; for example, unlike graph analytics algorithms, they usually generate enormous amounts of intermediate data. Systems such as Arabesque [Arabesque] and RStream [RStream] have been developed to provide abstractions for improving programmer productivity. Instead of the vertex-centric model used in graph analytics systems [Pregel], Arabesque proposed an embedding-centric programming model. In Arabesque, computation is applied on individual embeddings (i.e., subgraphs) concurrently. It provides a simple programming interface that substantially reduces the complexity of application development. However, existing systems suffer dramatic performance loss compared to hand-optimized implementations. For example, Arabesque and RStream take 98s and 39s respectively to count 3-cliques for the Patent graph with 2.7M vertices and 28M edges, while a custom solver (KClist) [KClique] counts it in only 0.16s. This huge performance gap significantly limits the usability of existing mining frameworks in real-world applications.

The first reason for this poor performance is that existing graph mining systems provide limited support for application-specific customization. The state-of-the-art systems focus on generality and provide high-level abstraction to the user for ease-of-programming. Therefore, they hide as many execution details as possible from the user, which substantially limits the flexibility for algorithmic customization. The complexity of mining algorithms is primarily due to combinatorial enumeration of embeddings and isomorphism tests to find canonical patterns. Hand-optimizing implementations exploit application-specific knowledge to aggressively prune the enumeration search space or elide isomorphism tests or both. Mining frameworks need to support such optimizations to match performance of hand-optimized applications.

The second reason for poor performance is inefficient implementation of parallel operations and data structures. Programming parallel processors requires exploring trade-offs between synchronization overhead, memory management, load balancing, and data locality. However, the state-of-the-art graph mining systems target either distributed or out-of-core platforms, and thus are not well optimized for shared-memory multicore/manycore architectures.

In this paper, we present Pangolin, a high-performance in-memory graph mining framework. Pangolin provides a simple yet flexible embedding-centric programming interface, based on the extend-reduce-filter model, which enables application-specific customization. Application developers can implement aggressive pruning strategies to reduce the enumeration search space, and apply customized pattern classification methods to elide generic isomorphism tests.

To make full use of parallel hardware, we optimize common parallel operations and data structures, and provide helper routines to the users to compose higher level operations. Pangolin is built as a lightweight layer on top of the Galois [Galois] parallel library and LonestarGPU [LonestarGPU] infrastructure, targeting both shared-memory multicore CPUs and GPUs. Pangolin includes novel optimizations that exploit locality, reduce memory consumption, and mitigate overheads of dynamic memory allocation and synchronization.

Experimental results on a 28-core CPU demonstrate that Pangolin outperforms existing state-of-the-art graph mining frameworks, Arabesque and RStream, by 49 and 88, on average, respectively. Furthermore, Pangolin on V100 GPU outperforms Pangolin on 28-core CPU by 15 on average. Pangolin provides performance competitive to state-of-the-art hand-optimized graph mining applications, but with much less programming effort. To mine 4-cliques in a real-world web-crawl graph (gsh) with 988 million vertices and 51 billion vertices, Pangolin takes hours on a 48-core Intel Optane PMM machine [optane] with 6 TB (byte-addressable) memory. To the best of our knowledge, this is the largest graph on which 4-cliques have been mined.

In summary, Pangolin makes the following contributions:

  • We investigate the performance gap between state-of-the-art graph mining systems and hand-optimized approaches, and point out two key application-specific features absent in existing mining systems: pruning enumeration space and eliding isomorphism tests.

  • We present a high-performance in-memory graph mining framework, Pangolin, which enables application-specific optimizations and provides transparent parallelism on CPUs and GPUs. To the best of our knowledge, this is the first graph mining system that supports GPU processing.

  • We propose novel techniques that enable the user to aggressively prune the enumeration search space and elide isomorphism tests.

  • We propose novel optimizations that exploit locality, reduce memory consumption, and mitigate the overheads of dynamic memory allocation and synchronization on both CPUs and GPUs.

  • We evaluate Pangolin on a multicore CPU and a GPU to demonstrate that Pangolin is not only substantially faster than existing mining frameworks but also competitive with hand-optimizing applications, while requiring less programming effort.

The rest of this paper is organized as follows. Section 2 provides background on existing mining frameworks and hand-optimized implementations. Section 3 introduces the Pangolin API and execution model. Section 4 describes how Pangolin enables application-specific optimizations using its API. Section 5 describes Pangolin’s implementation and architectural optimizations. Section 6 presents our evaluation of Pangolin and state-of-the-art mining applications and frameworks. Related work is discussed in Section 7 and conclusions are presented in Section 8.

2 Background and Motivation

In this section, we start with describing graph mining concepts and applications. We discuss algorithmic and architectural optimizations in state-of-the-art hand-optimized mining applications. Lastly, we introduce the performance limitations of existing frameworks.

2.1 Graph Mining Concepts

Fig. 2 illustrates the basic concepts of graph mining. We are given an input graph that needs to be mined. A pattern is a template, while an embedding is an instance of that pattern in the graph (i.e., an appearance of the pattern). For the input graph and pattern in Fig. 2, there are four embeddings, shown on the right of the figure. In a graph mining problem, the user may provide a specific pattern such as triangle or clique, or require all “interesting” patterns to be discovered or counted. What makes a pattern interesting is defined by the user application; for example, in frequent subgraph mining, users are interested in patterns that appear frequently (above a given threshold) in the input graph. A measure of the pattern frequency in the input graph, termed support, is defined by the user.

Figure 2: An example of the graph mining problem. In the input graph, colors represent vertex labels, and numbers denote vertex IDs. The 3-vertex pattern is a blue-red-green chain, and there are four embeddings of this pattern in the input graph. To calculate the minimum image-based (MNI) support, we count the number of distinct mappings for each vertex, and the MNI support of this pattern is .

There are two types of graph mining problems targeting two types of embeddings. In a vertex-induced embedding, a set of vertices is given and the subgraph of interest is obtained from these vertices and the set of edges in the input graph connecting these vertices. Triangle counting uses vertex-induced embeddings. In an edge-induced embedding, a set of edges is given and the subgraph is formed by including all the endpoints of these edges in the input graph. Frequent subgraph mining is an edge-induced mining problem.

A graph mining algorithm enumerates the embeddings of the given pattern(s). If duplicate embeddings exist (automorphism), then the algorithm chooses one of them as the canonical one and collects statistical information about these canonical embeddings such as the total count. Enumeration of embeddings in a graph grows exponentially with the embedding size (number of vertices or edges), which is computationally expensive and consumes lots of memory. In addition, a graph isomorphism test is needed for each embedding to determine whether it is isomorphic to a pattern. Unfortunately, graph isomorphism problem is not solvable in polynomial time [Garey], and implementing an isomorphism test is compute and memory intensive [Bliss].

2.2 Hand-Optimized Mining Applications

We consider 4 applications: triangle counting (TC), clique finding (CF), motif counting (MC), and frequent subgraph mining (FSM). Given the input graph which is undirected, TC counts the number of triangles while CF enumerates all complete subgraphs 111A -vertex complete subgraph is a connected subgraph in which each vertex has degree of (i.e., each vertex is connected to all other vertices in the subgraph). (i.e. cliques) contained in the graph. TC is a special case of CF as it counts 3-cliques. MC counts the number of occurrences (i.e. frequency) of each structural pattern (also known as motif or graphlet). As listed in Fig. 1, -clique is one of the patterns in -motifs.

FSM finds frequent patterns in a labeled graph. A measure of frequency called support is provided by the application developer, and all patterns with support above a given threshold are considered to be frequent and must be discovered. A simple definition of support is the count of the embeddings associated with the pattern (used in by TC, CF, MC). A more widely used support definition is minimum image-based (MNI) support 222MNI support has the anti-monotonic property, which states that the support of a supergraph should not exceed the support of a subgraph; this allows us to stop extending embeddings as soon as they are recognized as infrequent. (a.k.a domain support). As shown in Fig. 2, it is calculated as the minimum number of distinct mappings for any vertex (i.e. domain) in the pattern over all embeddings of the pattern.

Several hand-optimized implementations exist for each of these applications on multicore CPU [Shun, PGD, BeyondTri, Bressan, ParFSM], GPU [Green, MotifGPU, CUDA-MEME, GpuFSM], distributed CPU [Suri, PDTL, DistGraph], and multi-GPU [TriCore, DistTC, Rossi]. They employ application-specific optimizations to reduce algorithm complexity. The complexity of mining algorithms is primarily due to two aspects: combinatorial enumeration and isomorphism test. Therefore, hand-optimized implementations focus on either pruning the enumeration search space or eliding isomorphism test or both. We describe some of these techniques briefly below.

Pruning Enumeration Search Space: The graphs are converted into directed acyclic graphs (DAGs) in state-of-the-art TC [TriCore], CF [KClique] and MC [ESCAPE] solvers, to significantly reduce the search space. In general mining applications, new embeddings are generated by extending existing embeddings and then they can be discarded because they are either not interesting or a duplicate (automorphism). However, in some applications like CF [KClique], duplicate embeddings can be detected eagerly before extending current embeddings, based on properties of the current embeddings. We term this optimization as eager pruning. Eager pruning can significantly reduce the search space.

Eliding Isomorphism Test: In most hand-optimized TC, CF and MC solvers, isomorphism test is completely avoided by taking advantage of the pattern characteristics. For example, a parallel MC solver, PGD [PGD], uses an ad-hoc method for a specific . Since it only counts 3-vertex and 4-vertex motifs, all the patterns (two 3-motifs and six 4-motifs as shown in Fig. 1) are known in advance. Therefore, some special (and thus easy-to-count) patterns (e.g., cliques333Cliques can be identified by checking connectivity among vertices without generic isomorphism test.) are counted first, and the frequencies of other patterns are obtained in constant time using the relationship among patterns444For example, the count of diamonds can be computed directly from the counts of triangles and 4-cliques [PGD].. In this case, no isomorphism test is needed, which is typically order-of-magnitude faster [PGD].

Summary: Most of the algorithmic optimizations exploit application-specific knowledge, which can only be enabled by application developers. A generic mining framework should be flexible enough to allow users to compose as many of these optimization techniques as possible, and provide parallelization support for ease of programming. Pangolin is the first mining framework to do so.

2.3 Existing Mining Frameworks

Existing graph mining systems target either distributed-memory [Arabesque, Fractal, ASAP] or out-of-core [RStream] platforms, and they make tradeoffs specific for their targeted architectures. None of them target in-memory mining on a multicore CPU or a GPU. Consequently, they do not pay much attention to reducing the synchronization overheads among threads within a CPU/GPU or reducing memory consumption overheads. Due to this, naively porting these mining systems to run on a multicore CPU or GPU would lead to inefficient implementations. We first describe two of these mining systems briefly and then discuss their major limitations.

Arabesque [Arabesque] is a distributed graph mining system. It proposes “think like an embedding” (TLE) programming paradigm, where computation is performed in an embedding-centric manner. It defines a filter-process computation model which consists of two functions: (1) filter, which indicates whether an embedding should be processed and (2) process, which examines an embedding and may produce some output.

RStream [RStream] is an out-of-core graph mining system on a single machine. Its programming model is based on relational algebra. Users specify how to generate embeddings using relational operations like select, join, and aggregate. It stores intermediate data (i.e., the embeddings) on disk while the input graph data is kept in memory for reuse. It streams data (or table) from disk and uses relational operations that may produce more intermediate data, which is stored back on disk.

Limitations in API: Most of the application-specific optimizations like pruning enumeration search space and avoiding isomorphism test are missing in existing mining frameworks, as they focus on providing high-level abstractions but lack support for application-specific customization. The absence of such key optimizations in existing systems results in a huge performance gap when compared to hand-optimized implementations. Moreover, some frameworks like Rstream support only edge-induced embeddings but for applications like CF, the enumeration search space is much smaller using vertex-induced exploration than edge-induced one.

Data Structures for Embeddings: Data structures used to store embeddings in existing mining systems are not efficient. Both Arabesque and RStream store embeddings in an array of structures (AoS), where the embedding structures consists of a vertex set and an edge set. Arabesque also proposes a space efficient data structure called the Overapproximating Directed Acyclic Graph (ODAG), but it requires extra canonical checking for each embedding, which has been demonstrated to be very expensive for large graphs [Arabesque].

Materialization of Data Structures: The list or array of intermediate embeddings in both Arabesque and RStream is always materialized in memory and in disk, respectively. This has significant overheads as the size of such data grows exponentially. Such materialization may not be needed if the embeddings can be filtered or processed immediately.

Dynamic Memory Allocation: The number of (intermediate) embeddings are not known before executing the mining algorithm, so memory for them needs to be allocated dynamically. Moreover, during parallel execution, different threads might allocate memory for embeddings they create or enumerate. Existing systems use standard (std) maps and sets, which internally use a global lock to dynamically allocate memory. This limits the performance and scaling significantly.

Summary: Existing graph mining systems have limitations in their API, execution model, and implementation. Pangolin addresses these issues by permitting application-specific optimizations in its API, optimizing the execution model, and providing an efficient, scalable implementation on multicore CPU and GPU. These optimizations can be applied to existing embedding-centric systems like Arabesque.

Figure 3: An example of vertex extension.

3 Design of Pangolin

This section introduces our graph mining framework, Pangolin, which uses the “think like an embedding” (TLE) programming paradigm. First, we describe the execution model used in Pangolin. We then introduce the programming interface (i.e. APIs) provided to the user. Finally, we present examples of applications expressed in Pangolin.

Figure 4: The reduction operation that calculates pattern frequency using a pattern map.

3.1 Execution Model

Algorithm 1 describes the execution engine in Pangolin which illustrates our extend-reduce-filter execution model. To begin with, a worklist of embeddings is initialized with all the single-edge embeddings (line 4). The engine then works in an iterative fashion (line 6). In each iteration, i.e level, there are three phases: Extend (line 8), Reduce (line 10) and Filter (line 12). Pangolin exposes necessary details in each phase to enable a more flexible programming interface (Section 3.2) than existing systems; for example, Pangolin exposes the Extend phase which is implicit in Arabesque.

The Extend phase takes each embedding in the input worklist and extends it with a vertex (vertex-induced) or an edge (edge-induced). The newly generated embeddings then form the output worklist for the next level. The embedding size is increased with until the user defined maximum size is reached (line 14). Fig. 3 shows an example of the first iteration of vertex-based extension. The input worklist consists all the 2-vertex (i.e., single-edge) embeddings. For each embedding in the worklist, one vertex is added to yield a 3-vertex embedding. For example, the first 2-vertex embedding is extended to two new 3-vertex embeddings and .

1:procedure MineEngine((,), MAX_SIZE)
2:     EmbeddingList , double buffering
3:     PatternMap
4:     Init() insert single-edge embeddings
6:     while true do
7:           clear the new worklist
8:          Extend(, )
9:           clear the pattern map
10:          Reduce(, )
11:           clear the old worklist
12:          Filter(, , )
14:          if  MAX_SIZE - 1 then
15:               break termination condition                
16:     return ,
Algorithm 1 Execution Model for Mining

After vertex/edge extension, a Reduce phase is used to extract some pattern-based statistical information, i.e., pattern frequency or support, from the embedding worklist. The Reduce phase first classifies all the embeddings in the worklist into different categories according to their patterns, and then computes the support for each pattern category, forming pattern-support pairs. All the pairs together constitute a pattern map ( in line 10). Fig. 4 shows an example of the reduction operation. The three embeddings (top) can be classified into two categories, i.e. triangle and wedge (bottom). Within each category, this example counts the number of embeddings to calculate the support. As a result, we get the pattern-map as {[triangle, 2], [wedge, 1]}. After reduction, a Filter phase may be needed to remove those embeddings which the user are no longer interested in; e.g., FSM removes infrequent embeddings in this phase.

Note that Reduce and Filter phases are not necessary for all applications, and they can be disabled by the user. If they are used, they are also executed soon after initializing single-edge embeddings (line 4). However, we omit this from Algorithm 1 due to lack of space. If Reduce is enabled but Filter is disabled, then reduction is only required and executed for the last iteration, as the pattern map produced by reduction is not used in prior iterations (dead code).

1:procedure Extend(, )
2:     for each embedding in parallel do
3:          for each vertex in  do
4:               if toExtend(, ) =  then
5:                    for each vertex in  do
6:                         if toAdd(, ) =  then
7:                              insert to                                                                            
9:procedure Reduce(, )
11:     for each embedding in parallel do
12:          Pattern getPattern()
13:          Support getSupport()
14:          [] Aggregate([], )      
16:procedure Filter(, , )
17:     for each embedding in parallel do
18:          if toPrune(, ) =  then
19:               insert to                
Algorithm 2 Compute Phases in Vertex-induced Mining

3.2 Programming Interface

Pangolin exposes flexible and simple interfaces to the user to express application-specific optimizations. LABEL:lst:api lists user-defined functions (APIs) and Algorithm 2 describes how these functions (marked in blue) are invoked by the Pangolin execution engine. A specific application can be created by defining these APIs. Note that all the functions are not mandatory; each of them has a default return value.

In the Extend phase, we provide two functions, toAdd and toExtend, for the user to prune embedding candidates aggressively. When they return false, the execution engine avoids generating an embedding and thus the search space is reduced. More specifically, toExtend checks whether a vertex in the current embedding needs to be extended. Extended embeddings can have duplicates due to automorphism. Fig. 5 illustrates automorphism: two different embeddings and can be extended into the same embedding . Therefore, only one of them (the canonical embedding) should be kept, and the other (the redundant one) should be removed. This is done by a canonical test in toAdd, which checks whether the newly generated embedding is a qualified candidate. An embedding is not qualified when it is a duplicate or it does not have certain user-defined characteristics. Application-specific knowledge can be used to specialize the two functions. If left undefined, toExtend returns true and toAdd does a default canonical test. Note that the user specifies whether the embedding exploration is vertex-induced or edge-induced. The only difference for edge-induced extension is in lines 5 to 7: instead of vertices adjacent to , edges incident on are used.

1bool toExtend(Embedding emb, Vertex v);
2bool toAdd(Embedding emb, Vertex u)
3bool toAdd(Embedding emb, Edge e)
4Pattern getPattern(Embedding emb)
5Pattern getCanonicalPattern(Pattern pt)
6Support getSupport(Embedding emb)
7Support Aggregate(Support s1, Support s2)
8bool toPrune(Embedding emb);
Listing 1: User-defined functions in Pangolin.
Figure 5: An example of automorphism.

In the Reduce phase, getPattern function specifies how to obtain the pattern of an embedding. Finding the canonical pattern of an embedding involves an expensive isomorphism test. This can be specialized using application-specific knowledge to avoid such tests. If left undefined, a canonical pattern is returned by getPattern. In this case, to reduce the overheads of invoking the isomorphism test, embeddings in the worklist are first reduced using their quick patterns [Arabesque], and then quick patterns are aggregated using their canonical patterns. In addition, getSupport and Aggregate functions specify the support of an embedding and the reduction operator for the support, respectively.

Lastly, in the Filter stage, toPrune is used to specify those embeddings the user is no longer interested in. This depends on the support for the embedding’s canonical pattern (that is in the computed pattern map).

Pangolin also provide APIs to process the embeddings or pattern maps at the end of each phase (e.g., this is used in clique-listing, which a variant of clique-finding that requires listing all the cliques). We omit this from Algorithm 2 and LABEL:lst:api for the sake of brevity.

To implement the application-specific functions, users are required to write C++ code for CPU and CUDA __device__ functions for GPU (compiler support can provide a unified interface for both CPU and GPU in the future). LABEL:lst:routine lists the helper routines provided by Pangolin to the user. These routines are commonly used in mining applications; e.g., to check connectivity or canonicality. Pangolin also provides an implementation of domain (MNI) support. The helper functions are available on both CPU and GPU, with efficient implementation on each architecture.

Comparison With Other Graph Mining APIs: Existing graph mining frameworks do not expose toExtend and getPattern to the application developer (instead, they assume these functions always return true and a canonical pattern, respectively). However, existing embedding-centric frameworks like Arabesque can easily expose these functions to enable application-specific optimizations (Section 4) like in Pangolin.

1// connectivity checking routines
2bool isConnected(Vertex u, Vertex v)
4// canonical checking routines
5bool isAutoCanonical(Embedding emb, Vertex v)
6bool isAutoCanonical(Embedding emb, Edge e)
7Pattern getIsoCanonicalBliss(Embedding emb)
8Pattern getIsoCanonicalEigen(Embedding emb)
10// to get domain (MNI) support
11Support getDomainSupport(Embedding emb)
12Support mergeDomainSupport(Support s1, Support s2)
13Support getPatternSupport(Embedding emb)
14Support getPatternSupport(Edge e)
Listing 2: Helper routines provided to the user by Pangolin.
1bool toExtend(Embedding emb, Vertex v) {
2  return (emb.getLastVertex() == v);
4bool toAdd(Embedding emb, Vertex u) {
5  for v in emb.getVertices() except last:
6    if (!isConnected(v, u)) return false;
7  return true;
Listing 3: Clique finding (vertex induced) in Pangolin.

3.3 Applications in Pangolin

TC, CF, and MC use vertex-induced embeddings, while FSM uses edge-induced embeddings. LABEL:lst:fsm, LABEL:lst:motif and LABEL:lst:kcl show CF, MC, and FSM implemented in Pangolin (we omit TC due to lack of space). For TC, the extension happens only once: the second vertex is extended to get the third vertex. After that, all we need to check is whether the third vertex is connected to the first vertex. If it is, this 3-vertex embedding forms a triangle. For CF in LABEL:lst:kcl, the search space is reduced by extending only the last vertex in the embedding instead of extending every vertex. If the newly added vertex is connected to all the vertices in the embedding, the new embedding forms a clique. Since cliques can only grow from smaller cliques (e.g., 4-cliques can only be generated by extending 3-cliques), all the non-clique embeddings are implicitly pruned. Both TC and CF do not use Reduce and Filter phases.

1bool toAdd(Embedding emb, Vertex v) {
2return isAutoCanonical(emb, v);
4Support getSupport(Embedding emb) { return 1; }
5Pattern getPattern(Embedding emb) {
6return getIsoCanonicalBliss(emb);
8Support Aggregate(Support s1, Support s2) {
9return s1 + s2;
Listing 4: Motif counting (vertex induced) in Pangolin.

LABEL:lst:motif shows MC. An extended embedding is added only if it is canonical according to an automorphism check. In the reduction phase, a quick pattern of each embedding is first obtained (by default) and then the canonical pattern is obtained using an isomorphism test. In Section 4.2, we show a way to customize this pattern classification method for MC to improve performance. Filter phase is not used by MC.

1bool toAdd(Embedding emb, Edge e) {
2return isAutoCanonical(emb,e) }
3Support getSupport(Embedding emb) {
4return getDomainSupport(emb); }
5Pattern getCanonicalPattern(Embedding emb) {
6return getIsoCanonicalBliss(emb); }
7Support Aggregate(Support s1, Support s2) {
8return mergeDomainSupport(s1, s2); }
9bool toPrune(Embedding emb, PatternMap map) {
10return (getPatternSupport(emb, map) < MIN_SUPPORT)
Listing 5: Frequent subgraph mining (edge induced) in Pangolin.

FSM is the most complicated mining application. As shown in LABEL:lst:fsm, it uses the custom domain (MNI) support routines provided by Pangolin. An extended embedding is added only if the new embedding is (automorphism) canonical. FSM uses the Filter phase to remove embeddings whose patterns are not frequent from the worklist. Despite the complexity of this application, the Pangolin implementation is still much simpler than hand-optimized FSM implementations [DistGraph, Scalemine, GraMi], thanks to the Pangolin API and helper routines.

4 Supporting Application-Specific Optimizations in Pangolin

In this section, we describe how Pangolin’s API and execution model supports application-specific optimizations. We first describe enabling enumeration search space pruning and then describe enabling the eliding of isomorphism tests.

4.1 Pruning Enumeration Search Space

Directed Acyclic Graph (DAG): In typical mining applications, the input graph is undirected. In some vertex-induced mining applications, a common optimization technique is orientation which converts the undirected input graph into a directed acyclic graph (DAG) [Arboricity, Alon]. Instead of enumerating candidate subgraphs in an undirected graph, the direction significantly cuts down the combinatorial search space. Orientation has been adopted in triangle counting [Schank], clique finding [KClique], and motif counting [ESCAPE]. Fig. 7 illustrates an example of the DAG construction process. In this example, vertices are ordered by vertex ID. Edges are directed from vertices with smaller IDs to vertices with larger IDs. Generally, vertices can be ordered in any total ordering, which guarantees the input graph is converted into a DAG. In our current implementation, we establish the order [DistTC] among the vertices based on their degrees: edges will point towards the vertex with higher degree. When there is a tie, the edge points to the vertex with the larger vertex ID. Other orderings can be included in the future. In Pangolin, the user can enable orientation by simply setting a macro.

Eager Pruning: In some applications like MC and FSM, all vertices in an embedding may need to be extended before determining whether the new embedding candidate is a (automorphism) canonical embedding or a duplicate. However, in some applications like TC and CF [KClique], duplicate embeddings can be detected eagerly before extending current embeddings. In both TC and CF, all embeddings obtained by extending vertices except (the last) one will lead to duplicate embeddings. Thus, as shown in Listing LABEL:lst:kcl, only the last vertex of the current embedding needs to be extended. This aggressive pruning can significantly reduce the search space. The toExtend function in Pangolin enables the user to specify such eager pruning.

4.2 Eliding Isomorphism Test

Exploiting Memoization: Pangolin avoids redundant computation in each stage with memoization. Memoization is a tradeoff between computation and memory usage. Since graph mining applications are usually memory hungry, we only do memoization when it requires small amount of memory and/or it dramatically reduce complexity. For example, in the Filter phase of FSM, Pangolin avoid isomorphism test to get the pattern of each embedding, since it has been done in the Reduce phase. This recomputation is avoided by maintaining a pattern ID (hash value) in each embedding after isomorphism test, and setting up a map between the pattern ID and pattern support. Compared to isomorphism test, which is extremely compute and memory intensive, storing the pattern ID and a small pattern support map is relatively lightweight. In MC, which is another application to find multiple patterns, the user can easily enable memoization for the pattern id in each level. In this case, when it goes to the next level, the pattern of each embedding can be identified with its pattern id in the previous level with much less computation than a generic isomorphism test. As shown in Fig. 6, to identify a 4-cycle from a wedge or a diamond from a triangle, we only need to check if vertex 3 is connected to both vertex 1 and 2.

Figure 6: An example of eliding isomorphism test for 4-MC.

Customized Pattern Classification: In the reduction phase, the embeddings are classified into different categories based on their patterns, as shown in Fig. 4. To get the pattern of an embedding, a generic way is to convert the embedding into a canonical graph that is isomorphic to it (done in two steps, as explained in Section 3.2). Like Arabesque and Rstream, Pangolin uses the Bliss [Bliss] library for getting the canonical graph or pattern for an embedding. This graph isomorphism approach is applicable to embeddings of any size, but it is very expensive as it requires frequent dynamic memory allocation and consumes a huge amount of memory. For small embeddings, such as 3-vertex and 4-vertex embeddings in vertex-induced applications and 2-edge and 3-edge embeddings in edge-induced applications, the canonical graph or pattern can be computed very efficiently. For example, in 3-motif counting, we know that there are only 2 patterns (i.e., wedge and triangle in Fig. 1), so the only computation needed to differentiate the two patterns is to count the number of edges (i.e., a wedge has 2 edges and a triangle has 3), as shown in LABEL:lst:3-motif. This specialized method significantly reduces the computational complexity of pattern classification. The getPattern function in Pangolin enables the user to specify such customized pattern classification.

1Pattern getPattern(Embedding emb) {
2if (emb.size() == 3) {
3if (emb.getNumEdges() == 3) return P1;
4else return P0;
5} else return getIsoCanonicalBliss(emb);
Listing 6: Customized pattern classification for 3-motif.

5 Implementation of Pangolin on CPU and GPU

The user implements application-specific optimizations using the Pangolin API and helper functions, and Pangolin transparently parallelizes the application. Pangolin provides an efficient and scalable parallel implementation on shared-memory multicore CPU and GPU. Its CPU implementation is built using the Galois [Galois] libray and its GPU implementation is built using the LonestarGPU [LonestarGPU] infrastructure. Pangolin includes several architectural optimizations. In this section, we briefly describe some of them: (1) exploiting locality and fully utilizing memory bandwidth [Analysis, Locality, Memory]; (2) reducing the memory consumption; (3) mitigating the overhead of dynamic memory allocation; (4) minimizing synchronization and other overheads.

Figure 7: Orientation: convert the undirected input graph into a directed acyclic graph (DAG).

5.1 Data Structures for Embeddings

Since the number of possible -embeddings in a graph increases exponentially with , storage for embeddings grows rapidly and easily becomes the performance bottleneck. Existing systems use array-of-structures (AoS) to organize the embeddings, which leads to poor locality, especially for GPU computing. In Pangolin, we use a structure of arrays (SoA) to store the embeddings in memory. The SoA layout is particularly beneficial for parallel processing on the GPU as memory accesses to the embeddings are fully coalesced.

Fig. 8 illustrates the embedding list data structure. On the left is the prefix-tree that illustrates the embedding extension process in Fig. 3. The numbers in the vertices are vertex IDs (VIDs). Orange VIDs are in the first level , and blue VIDs belong to the second level . The grey level is a dummy level which does not actually exist but is used to explain the key ideas. On the right, we show the corresponding storage of this prefix tree. For simplicity, we only show the vertex-induced case. Given the maximum size , the embedding list contains levels. In each level, there are two arrays, index array (idx) and vertex ID array (vid). In the same position of the two arrays, an element of index and vertex ID consists of a pair (idx, vid). In level , idx is the index pointing to the vertex of the same embedding in the previous level , and vid is the -th vertex ID of the embedding.

Figure 8: The embedding list data structure (vertex-induced).

We can reconstruct each embedding by backtracking from the last level lists. For example, to get the first embedding in level , which is a vertex set of , we use an empty vertex set at the beginning. We start from the first entry (, ) in , which indicates the last vertex ID is ‘’ and the previous vertex is at the position of ‘’. We put ‘’ into the vertex set . Then we go back to the previous level , and get the -th entry (, ). Now we put ‘’ into the vertex set . Since is the lowest level and its index is the same as the vertex ID in level , we put ‘’ into the vertex set .

For the edge-induced case, the strategy is similar but requires one more column his in each level to indicate the history information. Each entry is a triplet (vid, his, idx) that represents an edge instead of a vertex. The history information his indicates at which level the source vertex of this edge is, while vid is the ID of the destination vertex. In this way we can backtrack the source vertex with his and reconstruct the edge connectivity inside the embedding. Note that we use three distinct arrays for vid, his and idx, which is also an SoA layout. This data layout can improve temporal locality with more data reuse. For example, the first vid in () is connected to two vertices in ( & ). Therefore will be reused. Considering high-degree vertices in power-law graphs, there are lots of reuse opportunities.

5.2 Avoiding Materializaton of Data Structures

Loop Fusion: Existing mining systems first collect all the embedding candidates into a list and then call the user-defined function (like toAdd) to select embeddings from the list. This leads to materializaton of the candidate embeddings list. In contrast, Pangolin preemptively discards embedding candidates using the toAdd function before adding it to the embedding list (as shown in Algorithm 2), thereby avoiding the materialization of the candidate embeddings (this is similar to loop fusion in array languages). This significantly reduces memory allocations, yielding lower memory usage and execution time.

Blocking Schedule: Since the memory consumption increases exponentially as the embedding size, existing systems utilize either distributed memory or disk to hold the data. However, Pangolin is a shared memory framework and could run out of memory for large graphs. In order to support processing large datasets, we introduce an edge-blocking technique in Pangolin. Since an application starts expansion with single-edge embeddings, Pangolin blocks the initial embedding list into smaller chunks, and process each chunk one after another. For large graphs, blocking will not affect parallelism, since there are still a large number of edges in each chunk that can be processed concurrently. Note that edge-blocking does not work for applications that require strict synchronization in each level. For example, we need to gather embeddings for each pattern in FSM in order to compute the domain support. In the way all embeddings needs to be processed before turning to the next level and thus we disable blocking for FSM. Currently edge-blocking is particularly used for bounding memory usage, however it is also potentially benefitial for data locality if we carefully determine the block size. We leave this for future work.

5.3 Dynamic Memory Allocation

Inspection-Execution: Compared to graph analytics applications, graph mining applications need significantly more dynamic memory allocations and memory allocation could become a performance bottleneck. A major source of memory allocation is the embedding list. As the size of embedding list increases, we need to allocate memory for the embeddings in each round. When generating the embedding list, there are write conflicts as different threads write to the same shared embedding list. In order to avoid frequent resize and insert operation, we use inspection-execution technique to generate the embedding list.

The generation include three steps. In the first step, we only calculate the number of newly generated embeddings for each embedding in the current embedding list, and then use parallel prefix sum to calculate the start index for each current embedding, and allocate the exact amount of memory for all the new embeddings. Finally, we actually write the new embeddings to update the embedding list, according to the start indices. In this way, each thread can write to the shared embedding list simultaneously without conflicts.

Although inspection-execution requires iterating over the embeddings twice, but making this tradeoff for GPU is reasonable for two reasons. First, it is fine for the GPU to do the recomputation as it has a lot of computation power. Second, improving the memory access pattern to better utilize memory bandwidth is more important for GPU. This is also a more scalable design choice for the CPU as the number of cores on the CPU are increasing.

Scalable Allocators: The pattern reduction in FSM is another case where dynamic memory allocation is frequently invoked. To calculate the domain (MNI) support of each pattern, we need to gather all the embeddings associated with the same pattern. This gathering requires resizing the vertex set of each domain (as shown in Fig. 2). The C++ standard std library employs a concurrent allocator implemented by using a global lock for each allocation, which could seriously limit performance and scalability. We leverage the Galois memory allocator to alleviate this overhead. Galois provides an in-built efficient and concurrent memory allocator that implements ideas from prior scalable allocators [Hoard, Michael, Schneider]. The allocator uses per-thread memory pools of huge pages. Each thread manages its own memory pool. If a thread has no more space in its memory pool, then it uses a global lock to add another huge page to its pool. Most allocations thus avoid locks. Pangolin uses variants of std data structures provided by Galois that use the Galois memory allocator. For example, this is useful in maintaining the pattern map itself. Our GPU infrastructure currently lacks support for efficient dynamic memory allocation inside CUDA kernels. To avoid frequent resize

operations inside kernels, we conservatively calculate the memory space required and pre-allocate bit vectors for kernel use. This pre-allocation requires much more memory than is actually required, and restricts our GPU implementation to smaller inputs for FSM.


Graph Source # V # E Labels
Mi Mico [GraMi] 100,000 2,160,312 22 29
Pa Patents [Patent] 2,745,761 27,930,818 10 37
Yo Youtube [Youtube] 7,066,392 114,190,484 16 29
Pdb ProteinDB [DistGraph] 48,748,701 387,730,070 8 25
Lj LiveJournal [SNAP] 4,847,571 85,702,474 18 0
Or Orkut [SNAP] 3,072,441 234,370,166 76 0
Tw Twitter [Konect] 21,297,772 530,051,090 25 0
Gsh Gsh-2015 [gsh2015] 988,490,691 51,381,410,236 52 0


Table 1: Input graphs (symmetric, no loops, no duplicate edges) and their properties ( is the average degree).

5.4 Other Optimizations

Graph mining algorithms make extensive use of connectivity operations for determining how vertices are connected in the input graph. For example, in -cliques, we need to check whether a new vertex is connected to all the vertices in the current embedding. Another common connectivity operation is to determine how many vertices are connected to given vertices and , which is usually obtained by computing the intersection of the neighbor lists of the two vertices. A naive solution of connectivity checking is to search for one vertex in the other vertex ’s neighbor list sequentially. If found, the two vertices are directly connected. To reduce complexity and improve parallel efficiency, we employ binary search for the connectivity check. Binary search is particularly efficient on GPU, as it improves GPU memory access efficiency [TriCore]. We provide efficient CPU and GPU implementations of these connectivity operations as helper routines, e.g. isConnected (LABEL:lst:routine), which allow the user to easily compose pruning strategies in applications.


Eager Prune
Customized Pattern


Table 2: Optimizations enabled in each mining application.


Applications Mico Patent Youtube
Name Options Arabesque RStream Kaleido Pangolin Arabesque RStream Kaleido Pangolin Arabesque RStream Kaleido Pangolin
TC 30.78 2.56 0.17 0.02 100.78 7.81 0.52 0.08 601.29 39.82 2.24 0.34
CF 3 32.23 7.29 0.46 0.04 97.82 39.10 0.56 0.15 617.04 862.25 2.16 0.69
4 41.72 637.77 3.88 1.56 108.07 62.13 1.14 0.40 1086.88 - 7.83 3.08
5 311.89 - 183.63 60.45 108.78 76.91 1.46 0.51 1123.63 - 18.99 7.32
MC 3 36.06 7137.48 1.39 0.21 101.56 3886.88 4.74 0.93 538.37 89387.00 35.47 5.54
4 352.96 - 198.17 175.56 779.75 - 152.28 209.07 5132.80 - 4988.96 4405.32
3-FSM 300 104.87 56.77 7.35 3.92 340.67 230.13 25.47 14.70 666.97 1415.12 132.59 96.90
500 72.18 57.88 8.19 3.63 433.55 208.63 26.41 15.78 576.51 1083.92 133.31 97.74
1000 48.49 52.91 7.84 2.96 347.33 194.01 28.71 18.05 693.19 1179.28 136.24 98.02
5000 36.41 35.63 3.97 2.41 366.11 172.23 31.51 27.02 758.56 1248.14 155.04 102.26


Table 3: Execution time (sec) of mining applications (option: minimum support for 3-FSM; for other applications) in graph mining frameworks on 28 threads (‘-’ indicates out of memory or disk, or timed out in 30 hours).

6 Evaluation

In this section, we first present our experimental setup. Then, we compare Pangolin with state-of-art graph mining frameworks and hand-optimized applications. Finally, we analyze the performance of Pangolin in more detail.

6.1 Experimental Setup

We compare Pangolin with state-of-the-art graph mining frameworks: Arabesque [Arabesque], RStream [RStream], and Kaleido [Kaleido]. Kaleido is concurrent work and is not publicly available, but we include numbers reported in their arxiv for comparison.

We test the 4 mining applications discussed in Section 3.3, i.e., TC, CF, MC, and FSM. -MC and -CF terminate when subgraphs reach a size of vertices. For -FSM, we mine the frequent subgraphs with edges. Table 1 lists the input graphs used in the experiments. We assume that input graphs are symmetric, have no self-loops, and have no duplicated edges. We represent the input graphs in memory in a compressed sparse row (CSR) format. The neighbor list of each vertex is sorted by ascending vertex ID.

The first 3 graphs — Mi, Pa, and Yo — have been previously used by Arabesque, RStream, and Kaleido. We use the same graphs to compare Pangolin with these existing frameworks. In addition, we include larger graphs from SNAP Collection [SNAP] (Lj, Or), Koblenz Network Collection [Konect] (Tw), DistGraph [DistGraph](Pdb), and a very large web-crawl [gsh2015] (Gsh) . Except Pdb, other larger graphs do not have vertex labels, therefore, we only use them to test TC, CF, and MC. Pdb is used only for FSM.

Unless specified otherwise, CPU experiments were conducted on a 4 socket machine with Intel Xeon Gold 5120 CPU 2.2GHz with 56 cores (14 cores per socket), 190GB memory, and 3TB SSD. Kaleido was tested using 56 threads (with hyperthreading) on Intel Xeon Gold 5117 CPU 2.0GHz, 2 sockets (14 cores each), 128GB memory, and 480GB SSD. To make our comparison fair, we restrict our experiments to use only 2 sockets of our machine, however, we only use 28 threads without hyperthreading. For the largest graph, Gsh, we used a 2 socket machine with Intel’s second generation Xeon scalable processor with 2.2 Ghz and 48 cores, equipped with 6TB of Intel Optane PMM [optane] (byte-addressable memory technology). Our GPU platforms are NVIDIA GTX 1080Ti (11GB memory) and Tesla V100 (32GB memory) GPUs with CUDA 9.0. Unless specified otherwise, GPU results reported are on V100.

For RStream, the number of partitions is a key performance knob. For each configuration, we choose to be the best performing one among 10, 20, 50, and 100. Arabesque and Pangolin run all applications in memory, whereas RStream writes its intermediate data to the SSD. We exclude preprocessing time and only report the computation time (on the CPU or GPU) as an average of 3 runs.


Input Arabesque RStream PA-CPU PA-GPU GAP
LiveJ 313.5 610.3 0.6 0.2 0.5
Orkut 1336.2 759.3 3.9 0.7 4.2
Twitter - - 38.8 8.1 40.1


(a) TC


Input PA-CPU PA-GPU KClist
LiveJ 26.3 2.3 1.9
Orkut 82.3 4.3 4.1
Twitter 28165.0 1508.7 628.3


(b) 4-CF


LiveJ 19.5 1.7 12.7
Orkut 174.6 18.0 46.9
Twitter 9388.1 1163.4 1883.3


(c) 3-MC


Mico Patent Youtube PDB
300 3.9 0.6 52.2 14.7 2.7 19.9 96.9 - - 63.7 - 281.4
500 3.6 0.5 52.9 15.8 2.7 18.7 97.7 - - 65.6 - 279.5
1000 3.0 0.4 59.1 18.1 2.7 18.6 98.0 - - 73.4 - 274.5
5000 2.4 0.2 58.1 27.0 1.7 18.4 102.3 - - 145.3 - 322.9


(d) 3-FSM. DG: DistGraph


15K 438.9 129.0
20K 224.7 81.9
30K 31.9 26.2


(e) 4-FSM for Patent
Table 4: Execution time (sec) of Pangolin (PA) and hand-optimized mining applications (: minimum support).


Applications Gsh-2015
TC 139.3
3-CF 659.3
4-CF 23474.9


Table 5: Execution time (sec) of Pangolin applications on Intel Optane-PMM machine.

6.2 Graph Mining Frameworks

Table 3 reports the execution time of the four systems. On the same 28-core CPU, Pangolin is an order-of-magnitude faster than both Arabesque and RStream. We observe that for small inputs (e.g., TC and -CF with Mi), Arabesque suffers non-trivial overhead due to the startup cost of Giraph. Moreover, due to lack of eager pruning, lack of customized pattern classification, and ODAG data structure, it is also slower than Pangolin for larger datasets. On average, Pangolin is 49 faster than Arabesque.

RStream only supports edge-induced exploration and does not does not support pattern-specific optimization. This results in extremely large search spaces for CF and MC because there are many more edges than vertices. In addition, RStream does not scale well because of the intensive use of mutex locks for updating shared data. Lastly, Pangolin avoids inefficient data structures and expensive redundant computation (isomorphism test) used by RStream. Pangolin is 88 faster than RStream on the average (Kaleido also observes that RStream is slower than Arabesque).

Pangolin outperforms Kaleido in all cases except 4-MC on patent. On average, it is 2.6 faster than Kaleido (7.4, 3.3, 2.4, and 1.6 for TC, CF, MC, and FSM respectively). This is mainly due to DAG construction and customized pattern classification in Pangolin.

6.3 Hand-Optimized Mining Applications

We compare hand-optimized implementations with Pangolin on CPU and GPU. Depending on the memory usage of each application, we report results for the largest datasets supported on our platform.

For TC, we compare with the graph benchmark suite, GAP [GAPBS] in Table 3(a). TC implementations in existing mining frameworks are orders of magnitude slower than the hand-optimized implementation in GAP. In contrast, Pangolin achieves competitive performance compared to GAP, which makes it a practical framework for real-world use cases.

Table 3(d) and Table 3(e) compares our 3-FSM and 4-FSM, respectively, with DistGraph (DG) [DistGraph, ParFSM]. DistGraph supports both shared-memory and distributed platforms. DistGraph supports a runtime parameter , which specifies the minimum support, but we had to modify it to add the maximum size . On CPUs, Pangolin (PA-CPU) outperforms DistGraph for 3-FSM in all the cases, except for Pa with support 5K. For the graphs that fit in GPUs memory (Mi, Pa), Pangolin on GPUs (PA-GPU) is always faster than DistGraph as well as Pangolin on CPUs. For 4-FSM, Pangolin is 22% to 240% slower than DistGraph. The slowdown is mainly due to the algorithmic differences. DistGraph adopts depth first search (DFS) exploration and a recursive approach which reduces computation and memory consumption, while Pangolin does BFS exploration. Note that DistGraph employs a complicated customized load balancing scheme for its shared-memory parallel FSM to handle the load imbalance issue caused by its DFS exploration. This substantially complicates the programming and leads to 17K lines of code as opposed to 300 in Pangolin (LABEL:lst:fsm) with optimizations.

Figure 9: Strong scaling of Pangolin (using Yo).

Table 3(b) compares our -clique with KClist [KClique]. Pangolin is 10 to 20 slower than KClist on the CPU, although GPU acceleration of Pangolin significantly reduces the performance gap. This is because KClist constructs a shrinking local graph for each edge, which significantly reduces the search space. This optimizations can only enabled in the DFS exploration. We observe the same trend for our 3-MC on CPU compared with PGD [PGD] in Table 3(c). However, the Pangolin’s GPU implementation is much faster than PGD, due to GPU’s larger computation power and the lack of DAG construction in PGD. In the future, we are considering to add support for DFS exploration to further improve the performance of Pangolin.

Figure 10: Speedup of Pangolin on GPU over Pangolin on 28-thread CPU. Missing bars of 1080Ti are due to out of memory.

6.4 Scalability and GPU Performance

Although Pangolin is an in-memory processing system, Pangolin can scale to very large datasets by using large memory systems. To demonstrate this, we evaluate Pangolin on the Intel Optane PMM system and mine a very large real-world web crawl, Gsh. As shown in Table 5, TC and 3-CF only take 2 and 11 minutes, respectively. 4-CF is much more compute and memory intensive, so it takes hours. To the best of our knowledge, this is the largest graph dataset for which 4-CF has been mined.

Fig. 9 illustrates how the performance of Pangolin applications scales as the number of threads increases for different applications on Yo. Pangolin achieves good scalability by utilizing efficient, concurrent, scalable data structures and allocators. For TC, we observe near linear speedup over single-thread execution. In contrast, FSM’s scalability suffers due to the overheads of computing domain support.

Fig. 10 illustrates speedup of Pangolin applications on GPU over 28 threads CPU. Note that due to the limited memory size, GPUs fail to run some applications and inputs. On average, 1080Ti and V100 GPUs achieve a speedup of 6 and 15 respectively over the CPU execution. Specifically, we observe substantial speedup on CF and MC. For example, the V100 GPU achieves 50 speedup on 4-MC for Yo, demonstrating the suitability of GPUs for these compute intensive applications.

Figure 11: Peak memory usage in Arabesque, RStream, and Pangolin for Pa (4-MC in RStream runs out of memory).

6.5 Memory Consumption

The peak memory consumption for Arabesque, RStream, and Pangolin is illustrated in Fig. 11. We observe that Arabesque always requires the most memory because it is implemented in Java using Giraph [Giraph] framework that allocates a huge amount of memory. In contrast, Pangolin avoids this overhead and reduces memory usage. Since Pangolin does in-memory computation, it is expected to consume much more memory than RStream which stores its embeddings in disk. However, we find that the difference in memory usage is trivial because aggressive search space pruning and customized pattern classification significantly reduce memory usage. Since this small memory cost brings substantial performance improvement, we believe Pangolin makes a reasonable trade-off. For 4-MC, RStream runs out of memory due to its edge-induced exploration (Arabesque and Pangolin are using vertex-induced exploration).

(a) 4-CF (pruning)
(b) 3-FSM (Galois allocator)
(c) -MC (Customized Patterns)
(d) -MC (Avoiding Materialization)
Figure 12: Performance improvement due to various optimizations: (fig:prune) eager pruning and DAG; (fig:alloc) Galois scalable memory allocator; (fig:custom) customized pattern classification; (fig:eager) avoiding materialization.
Figure 13: -MC speedup of (a) using embedding list (SoA+inspection-exection) over using embedding queue (AoS) and (b) binary search over linear search.
Figure 14: LLC miss counts in the vertex extension phase of -CF using AoS and SoA for embeddings.

6.6 Impact of Optimizations

We evaluate the performance improvement due to the optimizations described in Section 4 and Section 5. Due to lack of space, we present these comparisons only for the CPU implementations, but the results on the GPU are similar. Fig. 11(a) shows the impact of orientation (DAG) and user-defined eager pruning (Prune) on 4-CF. Both techniques significantly improve performance for TC (not shown) and CF. Fig. 11(b) demonstrates the advantage of using Galois memory allocators instead of std allocators. This is particularly important for FSM as it requires intensive memory allocation for counting support. Fig. 11(c) illustrates that customized pattern classification used in MC and FSM yields huge performance gains by eliding expensive generic isomorphism tests. Fig. 11(d) shows that materialization of temporary embeddings causes 11% to 37% slowdown for MC. This overhead exists in every application of Arabesque (and RStream), and is avoided in Pangolin. In Fig. 12(a), we evaluate the performance of our proposed embedding list data structure with SoA layout and inspection-execution. Compared to the straight-forward embedding queue (mimic the AoS implementation used in Arabesque and RStream), the -MC performance is 2.1 to 4.7 faster. Another optimization is employing binary search for connectivity check. Fig. 12(b) shows that binary search can achieve up to 6.6 speedup compared to linear search. Finally, Fig. 14 illustrates the last level cache (LLC) miss counts in the vertex extension phase of -CF. We compare two data structure schemes for the embeddings, AoS and SoA. We observe a sharp reduction of LLC miss count by switching from AoS to SoA. This further confirms that SoA has better locality than AoS, due to the data reuse among embeddings.

7 Related Work

Mining Applications: Hand-optimized graph mining applications target various platforms. For triangle counting, Shun et al. [Shun] present a parallel, cache-oblivious TC solver on multicore CPUs that achieves good cache performance without fine-tuning cache parameters. Load balancing is applied in distributed TC solvers [Suri, PDTL] to evenly distribute workloads. TriCore [TriCore] is a multi-GPU TC solver that uses binary search to increase coalesced memory accesses, and it employs dynamic load balancing.

Chiba and Nishizeki (C&N) [Arboricity] proposed an efficient -clique listing algorithm which computes the subgraph induced by neighbors of each vertex, and then recurses on the subgraph. Searching over a local space of neighborhoods is more efficient than searching over the global space of the entire graph. Danisch et al. [KClique] refine the C&N algorithm for parallelism and construct DAG using a core value based ordering to further reduce the search space. PGD [PGD] counts 3 & 4-motifs by leveraging a number of proven combinatorial arguments for different patterns. Some patterns (e.g., cliques) are counted first, and the frequencies of other patterns are obtained in constant time using these combinatorial arguments. Escape [ESCAPE] extends this approach to 5-vertex subgraphs and leverages DAG to reduce search space.

Frequent subgraph mining (FSM) [Huan] is one of the most important mining applications. gSpan [gSpan] is an efficient sequential FSM solver which implements a depth-first search (DFS) based on a lexicographic order called minimum DFS Code. GraMi [GraMi] proposes an approach that finds only the minimal set of instances to satisfy the support threshold and avoids enumerating all instances. This idea has been adopted by most other frameworks. DistGraph [DistGraph] parallelizes gSpan for both shared-memory and distributed CPUs. Each worker thread does the DFS walk concurrently. To balance workload, it introduces a customized dynamic load balancing strategy which splits tasks on the fly and recomputes the embedding list from scratch after the task is sent to a new worker. Scalemine [Scalemine] solves FSM with a two-phase approach, which approximates frequent subgraphs in phase-1, and uses collected information to compute the exact solution in phase-2.

Other important mining applications includes maximal cliques [MaximalClique], maximum clique [MaximumClique, Aboulnaga], and subgraph listing [SubgraphListing, CECI, DUALSIM, PathSampling, TurboFlux, Ma, Lai]. They employ various optimizations to reduce computation and improve hardware efficiency. They inspired our work to design a flexible interface for user-defined optimizations. However, they achieve high performance at the cost of tremendous programming efforts, while Pangolin provides a unified model for ease of programming.

Mining Frameworks: For high productivity and high performance, graph mining systems such as Arabesque [Arabesque], RStream [RStream], and Kaleido [Kaleido] have been proposed. They provide a unified programming interface to the user which simplifies application development. However, their interface is not flexible enough to enable application specific optimizations. Instead of the BFS exploration used in these frameworks, Fractal [Fractal] employs a DFS strategy to enumerate subgraphs, which reduces memory footprint. Pangolin uses the BFS approach that is inherently more parallel than the DFS approach. In the future, we plan to also support DFS exploration.

Approximate Mining: There are approximate solutions for triangle counting [DOULION, Rahman, Tsourakakis], clique finding [Mitzenmacher, Jain], motif counting [Slota, Bressan1], and frequent subgraph mining [Approx]. ASAP [ASAP] is an approximate graph mining framework that supports various mining applications. It extends graph approximation theory to general graph patterns and incurs less than 5% error. Since approximation can reduce computation, it is much faster than exact frameworks such as Arabesque, and can scale to large graphs. In contrast, our work focuses on exact mining and shows that Pangolin can also achieve high performance without sacrificing accuracy.

8 Conclusion

We present Pangolin, a high-performance, flexible graph mining system on shared-memory CPUs and GPUs. Pangolin provides a simple programming interface that enables the user to specify eager enumeration search space pruning and customized pattern classifications. To exploit locality, Pangolin uses an efficient structure of arrays (SoA) for storing embeddings. It avoids materialization of temporary embeddings and blocks the schedule of embedding exploration to reduce the memory usage. It also uses inspection-execution and scalable memory allocators to mitigate the overheads of dynamic memory allocation. These application-specific and architectural optimizations enable Pangolin to outperform state-of-the-art mining frameworks, Arabesque and RStream, by 49 and 88, on average, respectively, on the same 28-core CPU. Moreover, Pangolin on V100 GPU is 15 faster than that on the CPU on average. Thus, Pangolin provides performance competitive with hand-optimized implementations but with much better programming experience. To mine 4-cliques in a web-crawl (gsh) with 988 million vertices and 51 billion edges, Pangolin takes hours on a 48-core Intel Optane machine with 6 TB memory.