A Structure-aware Approach for Efficient Graph Processing

by   Beibei Si, et al.

With the advent of the big data, graph are processed in an iterative manner, which incrementally described in the form of graph in big data applications. Most currently, graph processing methods treat the underlying map data as black boxes. However, as shown in experimental evaluation, graph structures often have diversity, different graph processing methods are very sensitive to the graph structure and show different performance for different data sets. Based on this, a graph processing method for graph structure analysis is proposed in this paper: (1) This paper calculates the vertex activity of a graph according to the in-degree and out-degree, and divide the corresponding vertices into the hot or cold partitions; (2) According to the change of graph structure caused by partial vertex convergence after iteration, this paper reclassifies the partitions, divides the lower active vertices into cold partition and reduces the frequency of calculation, which thereby reducing the cache miss rate and the I/O overhead caused by active vertices as well; (3) The partition with highest vertex status degree are given a priority calculation in this paper. In detail, more pronounced and more frequent vertices have higher processing priority. In this way, the convergence speed of the graph vertices is accelerated, and the running time of the graph algorithm in the big data environment is reduced. Our experiments show that compared with the latest system, the proposed method can double the performance of different graph algorithms and data sets.


page 1

page 2

page 3

page 4


A Closer Look at Lightweight Graph Reordering

Graph analytics power a range of applications in areas as diverse as fin...

Towards Improving Brandes' Algorithm for Betweenness Centrality

Betweenness centrality, measuring how many shortest paths pass through a...

Efficient Constructions for the Győri-Lovász Theorem on Almost Chordal Graphs

In the 1970s, Győri and Lovász showed that for a k-connected n-vertex gr...

A New Perspective of Graph Data and A Generic and Efficient Method for Large Scale Graph Data Traversal

The BFS algorithm is a basic graph data processing algorithm and many ot...

On the coalition number of graphs

Let G be a graph with vertex set V. Two disjoint sets V_1, V_2 ⊆ V form ...

Scalable Mining of Maximal Quasi-Cliques: An Algorithm-System Codesign Approach

Given a user-specified minimum degree threshold γ, a γ-quasi-clique is a...

Load-Balancing for Parallel Delaunay Triangulations

Computing the Delaunay triangulation (DT) of a given point set in R^D is...

1 Introduction

Incrementally described in the form of graph in big data applications, graph are processed in an iterative manner. For example, search services (such as Google [1]) use PageRank algorithm to sort results, social networks (such as Facebook [2]) use Clustering algorithm to analyze user communities, knowledge sharing sites (such as Wikipedia [3]

) use Named Entity Recognition algorithm to identify text information, video sites (such as Netflix 

[4] and Anysee [5]) Based on Collaborative Filtering algorithm to provide film and television recommendations. Relevant studies indicate that the computational and storage characteristics of graph computing make it difficult for data-oriented parallel programming models to provide efficient support. The lack of description of correlation between data and inefficient support for iterative calculations can result in multiple times. Dozens of times the performance loss. The urgent need for an efficient Graph Computation system has made it one of the most important issues to be solved in the field of parallel and distributed processing. Current graph system processing strategy [6, 7, 8, 9, 10, 11, 12] still lack of efficiency which listed below: (1) High cache miss rate; (2) Large I/O access overhead; (3) Slow convergence rate of large-scale graph data.

We profiled the solutions that resulted in the low performance of the existing representative graph systems.Due to the small-world phenomenon, the graph vertices will obey the power function distribution. A few graph vertices will connect the vast majority of graph vertices, while the vast majority of these vertices need to transfer state through these few vertices. Therefore, frequent visits and updates are needed for these core graph vertices while other vertices shortly converge, resulting in low frequency of access, thus confronting the problem mentioned above. So this paper adopts the graph partition of the dynamic increment, which will be explained explicitly in Section 3.

Currently, some work has already been done for graph partition of power law graph, but most of them are based on a distributed environment, regarding the underlying computing nodes as equivalent nodes. However, most graph processing methods treat the underlying graph data as black boxes, lacking research on dynamic graph partitioning and graph processing based on graph structure. However, in the real world, the graph structure is constantly changing. With iteration, a large number of graph vertices may converge in the graph partition. Frequent accesses to a small number of active vertices may result in repetitive loading of the entire graph partition including convergence, but these convergent vertices do not require access and processing, which leads to the severe waste of memory bandwidth and cache. The existing method does not consider the structural features of each partition, and the graph algorithm requires more update times for convergence and each update requires large overhead.

The graph vertex degree and its state degree have particularly critical influence on the convergence of graph vertices. Meanwhile, they also determine the processing order of the graph vertices. In the case of PowerSwitch system as shown in figure [7], vertex 1 has a large degree and is more active. Theoretically, asynchronous method should be adopted to increase the convergence speed as a large number of graph vertices () require state transfer through active vertices. After updating its own data by asynchronous method, each vertex will be immediately updated through sending messages, so that the neighbors can be calculated by using the latest data. The vertices () have lower degree and will shortly converge, and it is of no high value to adopt asynchronous system to increase the convergence rate. The synchronization system should be adopted to reduce the cache miss rate and the time required for state updates of graph vertices.

Presently, the graph structure can be diverse, and its processing performance can be more different in a uniform way. Secondly, the graph structure formed by the unconverged graph vertices are constantly changing in operation, causing large fluctuations in performance. According to the above reasons, the paper proposes graph processing methods for graph structure perception. This paper incrementally obtain the graph structure characteristics formed by unconverged graph vertices in accordance with the analysis, adopting a suitable graph processing method for each graph partition block adaptively according to the underlying operation environment (the processor load, cache miss rate, etc. in each graph partition). More specifically, the main contributions of this work are summarized as follows:

  • This paper analyzes the existing problems in the state-of-the-art distributed graph processing system and points out that the current graph processing system is lacked with targeted processing in the graph structure, affecting system performance.

  • This paper proposes the structure-centered graph partition and graph processing. According to the graph structure (graph vertices heat, etc.), the graph is partitioned by dynamic increment manner. The order of block partition is processed according to the graph schedule map of graph partition state degree.

  • This paper uses the graph structure perception combined with feature analysis in operation to switch each block of graph partition to the appropriate processing method.

  • The method is applied in the latest system. Experiments with five applications on five real-world graphs show that Gemini significantly outperforms existing distributed implementations, and the performance is improved by 2 times.

The remainder of this paper is organized as follows. Section analyzed the defects of the existing graph processing system, which puts forward the dynamic graph partitioning and adaptive graph processing optimization strategy. Section presents the dynamic graph partitioning modus, followed by adaptive graph processing method in Section . Section shows experimental results. The related work is surveyed in Section , and finally, Section concludes this work.

2 Background and Motivation

With the present of big data era, increasing data applications needed to be expressed in the form of vertices and edges, and processed through iterations. While state-of-the-art graph processing systems mainly concentrated on solving load balancing and communication overhead among varies of runtime environment, therefor ignoring the graph structural features of input data which have great impact on system performance. First, Assorted graph structure been processed may lead to immense performance differences with unified method; Second, Structure variations of the vertices that haven’t converged in operation bring out volatile performance.

Figure 1: Ineffective graph processing of partitions

Graph processing methods are very sensitive to the graph structure, therefore graph processing systems show quite different performance among diverse data sets. Most of the previous graph processing methods treats underlying data as black boxes, take neither partitioning nor processing strategy accordingly. Current graph computing model research work are mainly carried out in two aspects: one is focusing on performance optimization for a certain pattern, the other providing a same interface for two patterns(Synchronous mode and Asynchronous mode) that allows the user to choose according to the algorithmic features.Three issues has arisen due to the ignorance of above model, including low cache hit ratios, high input/output overhead, and slow convergence of large scale data.

2.1 Disadvantages of Existing Methods

To study the performance lose, We select some typical graph processing algorithms: PR (PageRank), CC (Connected Components), SSSP (Single-Source Shortest Paths), BFS (Breadth-First Search) and BC (Betweenness Centrality), along with commonly used graph data sets: amazon-2008, WikiTalk and twitter-2010 to evaluate the performance otherness among different algorithms and data sets. We set up experiments on an 8-node high-performance cluster interconnected with Infiniband EDR network (with up to 100Gbps bandwidth), each node containing two Intel Xeon E5-2670 v3 CPUs (12 cores and 30MB L3 cache per CPU) and 128 GB DRAM. We run 100 iterations on Gemini.

Figure 5 shows the vertex convergence of six data sets with different structures under four algorithms through iterations. Figure 5 gives detailed cache miss rate for different algorithms under different data sets. As shown in Figure 5, for the same data set, The structure of subgraph that non-convergent vertices composed of change continuously, traditional methods lack of the reflection to the diversity and dynamic changes of the graph structure, but a integrated graph partitioning and processing methods. The above strategies may depress convergence rate of the whole algorithm: In the iteration, some less active vertices have already converged while other remain active, which keep the entire partition loaded uninterruptedly and lead to decline in cache miss rate. (See Figure I)

2.2 Optimized strategy

We argue that inefficiency of traditional strategies mainly illustrated by following three points:

(1) Static graph partition methods. Structural diversification caused by vertex convergence is not considered. After one iteration. large number of vertices in each partition may converge, several vertices remain active, which result in frequent loading of a whole cache block, eventually wasting memory bandwidth, reducing cache hit rate, and frequent IO as well.

(2) Unified message processing mechanism. In terms of graph processing, The structural differences among graph partitions haven’t been considered by the message passing model of existing systems, instead, they adopt a unified message processing mechanism. Some graph processing systems, such as  [7], allow switching execution modes between synchronous and asynchronism, but are indistinguishably operated on all blocks. When synchronous message passing mechanism is adopted, the convergence speed of graph partition with more active vertices is limited. When asynchronous, High cache miss rate occur in partitions with less active vertices.

(3) Equal treatment to all graph partitions. The partitions are all treated the same as giving the same weight, nevertheless, It is known that natural graphs subject to skewed power-law degree distribution, which means small portion of vertex connects bulk of edge. Therefore, Frequent IO and high cache miss rate will arise in the event of average vertices partition.

For the reasons mentioned above, We present a novel graph structure-aware technique in the paper that obtains graph structure of the vertices that are not convergent by the analysis, and then incrementally partition the graph. After dynamic partition, We schedule the processing order of graph partitions, and for each iteration, adaptively choosing appropriate way to processing the graph partitions.In Summary, we have the following contributions:

  • Our partition method separates the hot vertices from the cold, which endues the former with frequent update and significant change a higher priority, and reach the convergence faster, eventually reduce the average number of updates that an input graph needs to achieve converge.

  • After The graph partition s with dramatically drop-off in active vertices will be repartitioned after specific times of iterations. This method, on the one hand, takes the load balance problem caused by the change of graph structure into consideration, on the other hand, controls the computation overhead caused by the migration of vertices during dynamic graph partition.

  • We put high activity vertices with frequent updates into the same cache, for the vertices will be loaded in memory at the same time. By doing this, we reduce the overhead caused by inactive vertices and their loading times as well.

3 Dynamic Graph Partition

Due to the small-world phenomenon, the graph vertices will obey the power function distribution. A few graph vertices will connect with the vast majority of graph vertices, while the vast majority of these vertices need to transfer state through these few vertices. Therefore, frequent visits and updates are needed for these core graph vertices while other vertices rapidly reaching convergence, resulting in low frequency of access, thus confronting the problem mentioned above. Consequently, according to changes in graph structure caused by the convergence of some vertices during iteration. In this paper, partitions will be redivided, the less active vertices will be moved together to decrease the calculation frequency by graph partition manner of dynamic increment, thereby reducing the I/O overhead caused by active vertices and lowering the cache miss rate.

Figure 2: Example graph

3.1 Active Degree and State Degree

Before getting to details, let us first give the targeted graph processing concepts. As the graph data increases dramatically, the researchers divided the graph data into several partitions and assigned the closely-related graph vertices to the same partition in order to accelerate convergence of the graph vertices. The input graph data is represented by . While represents all the vertices and represents the edges of all the connected vertices. The current graph processing system stores the updated messages in the vertices by default, and the edges exist as fixed values. Therefore, the vertex degree is regarded as a fixed value in the computing.

Symbol Definition
In-degree of vertex
Out-degree of vertex
Degree function of vertex
The maximum degree of all vertices
State degree of vertex
Active degree of vertex
Iteration that re-partitioning the partitions
Iteration that schedule cold partitions to compute
Threshold of vertices active degree
Threshold of vertices convergence
Table I: Definitions of symbols

Degree In this paper, is used to represent in-degree of vertex . The larger the in-degree, the more easily the vertex is affected by the neighbors. Which means, only when most neighbor vertex converge can the vertex tend to converge. Therefore, in the practical computation, vertices with large in-degree should be delayed to reduce the number of unnecessary updates. indicates out-degree of vertex . The greater the out-degree, more vertices will be affected by its update state. That indicates that only when the vertex converges can its neighbors tend to converge. Thence, in the practical computing, vertices with large out-degree should be processed in priority to accelerate the entire graph convergence. Regarding which mentioned above, the paper puts forward the concept of vertex power function, which is used to quantify the static structure features of graph vertices. Its formula is as follows:


The parameter 0.51 is an adjustable parameter, which is dynamically adjusted according to different data sets in the actual computation in order to achieve optimal performance. It can be a challenge to select the condition to match the value when computing heat value. The basis for selection is: value is adjusted according to the structure of input graph data. In the case of road network, a data set, each vertex has even in-edge and out-edge distributions and most graphs have similar vertex activity. The entire graph is of even distribution with value a trending to 0.5. However, in the case of a data set focused on by Weibo users, a few celebrities will have a large number of followers while most people have few followers, which leads to data skew. It amplifies the influence of vertex out-edge on the convergence of the entire graph, so value will trend to 1 accordingly.

Active Degree The vertex activity depends not only on its degree function, but also on its neighbor vertex structure. In order to predict the initial activity information of each vertex in an input graph data set, the graph data is optimally partitioned under the condition of guaranteeing load balancing while improving the computing efficiency of subsequent iterations. The paper puts forward the structure features of quantification graph active degree, which are used as reference factors for the initial graph partition of data of data graph. It relies on the in-degree and out-degree of the vertex as well as the degree of its neighbors.To this end, We use the hot-cold notion as in HotGraph and present our active degree algorithm function, scilicet the following :

Figure 3: Initial Chunk-based partitioning

indicates its neighbors’ degrees while indicates the maximum degree of all vertices, here we explore the feasibility of extending such design with fine-grained quantification of graph structure. The major difference is decoupling the in-degree and out-degree of vertices, note that unlike in HotGraph, in this paper act like an degree function, taking in degree and out degree both into consideration and extending the graph data set to a more common directed graph.

is set as the active degree threshold, and is determined on the basis of user-defined sample size and the ratio of hot vertices, which follows T in HotGraph. For example: if the vertex number is viewed as 10000, the user-defined sample size is =1000 and the ratio of hot vertex is 0.1, then the active degree threshold is , ie the active degree of the 100th vertex in the sample.

The vertices with active degree value greater than are marked as hot vertices and are stored in the hot partition. The vertices with active degree value smaller than are marked as cold vertices and stored in the cold partition. The hot and cold partitions are physically composed of cache blocks. The hot and cold vertices are stored in multiple cache blocks. For instance, vertices with active degree value greater than 50 are hot vertices and the number is 200. On the contrary are cold vertices and the number is 2000. One cache block can store 100 vertices. Therefore, there are 2 hot partition and 20 cold partition. Particularly, vertices with 0 degree neither affect nor been affected by other vertices. Its convergence can be achieved in one iteration. This paper uniformly partitions them into regions with continuous addresses. The region is called as: dead partition.

State Degree According to the characteristics of input graph structure, evaluates the activity of vertices. As the vertices convergence, in iteration process, the activity would alter. Thereby, the state degree, represent the alteration of the activity of vertices in iteration process. The state degree means that when the state degree is higher, the state of graph vertices changed more. Also, more activity vertices have more influence on neighbor vertices. only when the vertex converges can its neighbors tend to converge. Otherwise, the low state degree vertices would continue to be updated. For different algorithmic, the definition of state degree and the methods of calculation are different, we will elaborate on the state degree formula corresponding to the common graph algorithm in section 3.3.

The partition state degree, is the average of all vertices state degree accumulation in this partition. As a result of separation according to active degree value, the state degree of hot vertices is high and the state degree of cold vertices is low, which avoid this situation where low state degree vertices are more and there are fewer high state degree vertices so that the partition state degree improve. In conclusion, that the average of all vertices state degree accumulation in this partition is regarded as the state degree of the whole partition is reasonable.

The vertices state degree, and the partition state degree, are applied in evaluating the activity of graph vertices and partition, respectively. In order to the whole graph can be convergence rapidly, as well as making high state degree vertices synchronous load to reduce cache invalidation, high state degree vertices and partition would be dealt priority.

Vertices active degree, and state degree, play an important role in vertices separation. This essay gives details about how to divide input graph structure and initial graph based in vertices activity in section 3.2 and section 3.3.

3.2 Activity-based Partitioning

In order to reduce the average number of updates that an input graph needs to achieve converge, this paper propose a graph partition strategy about graph structure sense, which not only is extensive, but also it has a characteristics that when the scale of data is larger, the performance is better. According to graph vertices in-degree and out-degree, vertices active degree, is calculated. And then according to , graph vertices sort in descending. Based in this order, vertices are separated. The scale of partition is an exact of multiple of cache number. This separation act only is operated when data input. The time of reordering graph vertices is once in the whole algorithmic process. The expense produced by initial graph separation is divided to every time iteration. Not only it is helpful to improve cache hits rate, but also it lessen number of calculation. It can be proved that systems performance improvement is much more than extra expense. What’s more, for big scale input data, expense producing by every time become fewer contribution to great system extensiveness.

In initial iteration situation, 0 state degree vertices is investigated and put them into the dead partition. We Create a table named the first graph vertices degree table store vertices in-degree and out-degree. Moreover, We Create a table named the second graph vertices degree table to store position of neighbor vertices. The first graph vertices value table and the second graph vertices value table are applied in storing this time calculation value and last time calculation, respectively. Based in the value stored in the first graph vertices value table and the second graph vertices value table, vertices state degree and partition state degree can be known. To store partition ID and partition state degree, we create two tables, one is called ID table and the other one is partition state degree table. After this, we can separated hot partition and cold partition based in heat of vertices. As soon as all vertices are marked and separated to specific partitions, the table, partition state degree,is initialized and output initial partition.

1:procedure Active_Based Partition(v, D(v),D(v))
2:      expected chunk size remain amount remain partitions
3:      while  has unvisited vertex v do
4:            if  D(v) = and D(v) =  then
5:                 P v
6:            end if
7:            if  AD(v) T then
9:                 if  expected chunk size then
11:                 end if
12:                 hot partitions D(v)
13:                 P v
14:            end if
15:            if  AD(v) T then
17:                 if  expected chunk size then
19:                 end if
20:                 cold partitions D(v)
21:                 P v
22:            end if
23:      end while
24:end procedure
Algorithm 1 Initial Activity-based partitioning

Figure 5 gives an example of chunk-based partitioning, showing the vertex set on three nodes, with their corresponding dense mode edge sets. Knowing graph vertices active degree value and sorting them in descending relying on , we separate graph vertices to two partition, and . Each partition is made up of equal cache blocks. To read data conveniently, the scale of cache block is designed as the integral multiple of cache page number. For 0 state degree vertices, not only it does not received message of neighbor vertices, but also it can not transfer and update. And only one iteration can make it convergence. For this reason, we filter 0 state degree vertices firstly and deal with these data alone, which means these vertices would be separated when the state degree of vertices is calculated, would separate store them and would calculate priority in adaptive schedule period. When the iteration of 0 state degree vertices is achieved, there is no any act to reduce expense of iteration.

Figure 4: Dynamic Structure-based graph partition for PageRank
Figure 5: The comparision of cold partition and hot partition

Due to the constant of edge data and input/output degree, we can preprocess input data and distinguish hot vertices and cold vertices relying on edge function, which is useful to increase rate of cache hits rate and decrease expense of I/O. Also, it is a great way to lessen number of iteration. Hot vertices become cold is a common trend in iteration process. There is a few cold vertices affected by neighbor vertices to be hot. It is essential to improve system performance which is separated based in heat graph partition. However, as the graph vertices convergence in calculation process, graph structure would modified so that it can not satisfy requirement that in initial partition, the high activity vertices is calculated priority during iteration. Because of this situation, dynamic increment graph partition is proposed. In initial partition, it would be separated again, according to graph vertices state.

3.3 Structure-based Partitioning

After a certain number of iteration, due to hot vertices convergence continuously, the number of hot vertices plumped. In order to make expense fall, hot partition would be rescheduled which would not result plenty of expense. Only to a marked variance is required to be updated. Time Complexity is


The accumulation of the vertex state degree is obtained every iteration to obtain the average block state degree of all the hot and cold partitions and to determine whether there are hot partition with value smaller than the threshold and whether there are cold partition with value larger than threshold . The hot partition with decreasing activity can be marked as the cold partition, and similarly, the cold partition with increasing activity can be marked as the hot partition. Because in the previous section, it is divided according to active degree value. Generally speaking, the state degree of hot vertices is higher while the state degree of the cold vertex is lower. So the phenomenon will not exist that many vertices with low state degree are in the partition while a few vertices with high state degree raise the state degree of the whole partition. Therefore, it is reasonable to use the average value of the vertex state degree in the partition as the state degree of the entire partition.

However, for some graph algorithms such as , the graph data shows the whole tendency from dense state to sparse state under these algorithms. The case fails to exist that the cold notion tan become the hot notion. In order to optimize the algorithm to reduce the program space occupation, the border variable barrier is maintained to partition cold and hot vertices. As the hot block gradually becomes cold, the barrier also moves accordingly. Compared with the universal partition method mentioned above which requires maintaining a tag variable table, the method only needs to maintain a variable. However, for graph algorithms such as , the graph data tends to be dense and then tend to be sparse as a whole in these algorithms. That is to say, the cold vertices will first become hot and then converge, and a single barrier variable cannot represent the tendency. It requires the application of the universal method first proposed.

1:function Process_Vertex(v, curr[ ], nexr[ ])
2:      #Pragma omp parallel reduction(:reducer)
3:      while  active vertices all been visited do
4:            local_reducer local_reducer Process(v, curr[v], next[v])
5:            v
6:      end while
7:      reducer reducer local_reducer
8:      end Pragma
9:      global_reducer global_reducer reducer
10:      return global_reducer
11:end function
13:procedure Structed_Based Partition(barrier, curr , nexr )
14:      if iteration == I then
15:            for P and P have all been processed do
16:                 for v belongs to Partition i do
17:                       Process_Vertex(v, curr[ ], nexr[ ])
18:                 end for
19:                 if SD(P) T and P then
20:                       P Partition i
22:                 end if
23:                 if SD(P) T and P then
24:                       P Partition i
25:                 end if
26:            end for
27:      end if
28:end procedure
Algorithm 2 Dynamic Structure-based Partition

Figure [7] indicate the process of dynamic graph partitioning in . According to the accumulation state degree of each vertices, the average state degree of hot partition can be calculated. Find out the partition whose average state degree is less than . And then, value of barrier is changed to be ID of the first vertices. This separation methods separate hot vertices again, but for cold vertices there is no effect. When it is calculated, hot vertices would fewer and fewer. Therefore, it is obvious that the scale of reschedule would zoom out.

When measuring the value of , the edge would be operated and divided, so the results relates to input edge and output edge of the vertex. The in-degree and the out-degree would straightly affected the vertices convergence. Hence, the difference of could be applied in evaluation the activity of vertices, which means, for , the definition of state degree is accumulation of the difference between this algorithmic result and last algorithmic result. Give an example, assume that the first result is default 1 and there is no accumulation result at first time. If the second result is 5,and then it is obvious that the difference is 4. The accumulation is also 4. If the third result is 7, the difference between this result 7 and last result 5 is 2 so that the accumulation is 6, 4 and 2.


Figure [8] indicate the process of dynamic graph partitioning in . According to the accumulation state degree of each vertices, the average state degree of hot partition can be calculated. The partition whose average state degree is less than , is marked as hot partition. Otherwise, it is marked as cold partition.

For , there is the same methods. When was applied, because the calculation of the shortest path is relate to the accumulation account, it is not adaptable to evaluate the activity with the difference. In this methods, the smaller edge data between two calculation results is utilized, which is accumulated to decide whether there is modification of the vertices activity. Consequently, for , the definition of state degree is the accumulation of the smaller edge data between this result and last result. Same analogy to , it take a maximum clique. In this situation, the definition of state degree is the accumulation of the larger between this result and last result. The example of is not a separate example here.

Figure 6: Dynamic Structure-based graph partition for SSSP

The number of dynamic Structure-based Partition is positively correlated with the number of iteration. For this reason, when the number of iteration goes up, the adjacent interval of rescheduling increases. Under the condition of ensuring the right results of algorithmic and avoiding extra expense, the rate of convergence is improved and the absence rate of cache is decreased.

4 Adaptive Partition Scheduling

In adaptive scheduling period, because of the different convergence rate of each vertices and the distinguish of the state degree of graph vertices, in calculating the accumulation of state degree of graph vertices process, the vertices that vary more frequent and greater have priority to be measured, as well as increasing the rate of convergence of graph vertices and shortening algorithmic running time. During iteration, hot partition would be separated again. After separation, if there is still hot partition, we adaptively scheduling hot partition and cold partition for calculation. If there is no situation where the whole graph is not convergence, the highest state degree cold partition is measured.

When there is hot partition after rescheduling, in each iteration process, the highest state degree cold partition and highest state degree hot partition are operated. The value of keep pace with the number of CPU. For example, if the number of CPU is 10, would be 10. For iteration, the value of and is decided by the algorithmic. Usually, it have to satisfy the condition . It means that each time in hot partition the highest state degree cache partition are chosen and in cold partition the n highest state degree cache partition are chosen. On the contrary, if it is not iteration, we only apply the highest state degree hot partition. Thus, is equal to 0, and is equal to the number of CPU and equal to 10. Interval vertices stores with orders in ID sequence. If we need the specific partition to calculate, it represents reading in ascending ID order.

The sum of state degree values with all partitions is computed based on the partition state degree values stored in the partition state degree table. The smaller the state degree is, the closer the vertices are to convergence. When the sum of partition state degrees is smaller than a minimum value , it can be regarded that the entire graph converges. Therefore, when the sum of state degree values with all partitions is smaller than the convergence threshold, it is determined that the entire picture converges, and the computation comes to an end and its result is output. Preferably, the specific value of the convergence threshold is defined by the user, and the default value is 0.000001.

According to a preferred mode of execution, the graph processing method stated further includes the step: judging whether it is the initial iteration. In the case of the first iteration, the block with the highest state degree in the hot partition is scheduled to be computed on the basis of computation the mentioned dead partition, and the convergence of the entire graph is determined after the iterative computations based on the sum of status degree value with all partitions. In the case that the entire graph does not converge, the subsequent iteration is proceeded.

1:function Procee_Active(m, n, Partition P, P)
2:      threads = numa_num_configured_cpus()
3:      for each Partition p do
4:            for Vertex v belongs to Partition p do
5:                 SD(p) SD(p) + Process_Vertices(Process(), V)
6:            end for
7:      end for
8:      if  Still remains P then
9:            if iterations == I then
10:                 actives vertices m * P + n * P
11:            else
12:                 actives vertices threads * P
13:            end if
14:      end if
15:      if  Only remains P then
16:            actives vertices threads * P
17:      end if
18:      return actives vertices
19:end function
21:procedure Scheduling(active vertices)
22:      for Still remains untraversed Partition do
23:            Send edge in Partition p to other nodes
24:      end for
25:      for edge in Partition p hasn’t all been received do
26:            Receive edge in Partition p from other nodes
27:      end for
28:      if  P then
29:            master mirror vertex update
30:      end if
31:      if  P then
32:            mirror master vertex update
33:      end if
34:end procedure
Algorithm 3 Adaptive Partition Scheduling
Figure 7: Adaptive Partition Scheduling

One of the challenges of adaptive scheduling is to ensure that the hot partition is sufficiently computed. When the number of hot partitions is greater than that of machine threads, it indicates that one single iteration fails to make all hot partitions to be computed. Therefore, it is necessary to ensure for scheduling of partition that the hot partition with higher heat can be computed after the activity of the hot partition is reduced. It should be noted that it is a long process when the hot partition is repeatedly computed and the activity declines, trending to the activity of the cold partition. Due to the complexity of the structure, even the computation number increases, the convergence condition the hot partition requires is also more than that of the cold partition. And when all the hot partitions tend to converge, the entire graph will tend to converge and the graph algorithm will also be close to the end of computation.

In addition, regarding the convergence threshold: (1) Whether the algorithm actually converges or whether the computation is completed has nothing to do with the convergence threshold. is just a value for judging whether the current algorithm converges as the accomplish time of computation cannot be known in operational process of algorithm. Therefore, the state degree should be obtained at regular intervals to obtain the result that if it has converged compared with . Consequently, it is not the case that the smaller the D2 is set, the faster the algorithm converges. (2) It does not make much sense that different convergence thresholds should be set based on different algorithms or application cases. Because the state degree of all algorithms can be 0 when reaching the convergence. It requires a relatively long time from 0.000001 to 0. In order to improve performance, the state degree 0.000001 is considered as convergence, and the convergence threshold is fixed at 0.000001 while the algorithm result is within the tolerance range.

5 Related Work

Previous graph processing systems, over distributed [13, 14, 6, 15, 16, 7, 8, 9] or single multi-core platform [10, 17, 18, 19, 11, 12], have done plenty of work on effective graph processing such as load balancing and communication overhead reducing. Most of those approaches treat graph data as black box, which means, graph have been managed as a combination of vertices and edges (i.e.Vertex-centric and Edge-centric) rather than logical structure, the difference among which generates performance variability. In this section, we give a brief summary of related categories of prior work.

Distribute Systems: Pregel [13] divide the graph by hashing the vertex id which ensure loading balancing. Yet Pregel uses message communication paradigm, messages that need to be processed will be huge when vertices has many adjacent points. Coincidentally, Pregel performs inefficiency in power-law graph and only allows global synchronization. X-Pregel [14] optimizes Pregel’s messaging mechanism by reducing the number of messages that needed to be delivered in every iteration. Giraph [16] adds more features compare to Pregel, including master computation, out-of-core computation, etc. But the poor locality of data access limits its effective.

GraphLab [16] follows the vertex-centric GAS model, but its partitions are still obtained by randomly division. On the other hand, Its shared memory storage strategy may have performance bottlenecks for large graphs. PowerGraph [15] works well on power-law graph, but no special optimizations are considered for speeding up I/O access just as GraphLab. PowerSwitch [7] proposed an adaptive graph processing method based on PowerGraph, adaptively switching between synchronous and asynchronous processing modes according to the amount of vertices processed per unit time to achieve the best performance. However, it treats all the vertices of a graph as the same, and does not handle the convergence according to the vertices in the iteration. PREDIcT [20] proposes an experimental methodology for predicting the runtime of iterative algorithms, which optimizes cluster resource allocations among multiple workloads of iterative algorithms.

Maiter [21] propose delta-based accumulative iterative computation which reduce costs and accelerate calculations.HybridGraph [8] puts forward a algorithm adaptively switching between pull and push, focusing on performing graph analysis on a cluster IO-efficiently. Compare to GraphLab, PowerGraph employs a vertex-cut mechanism to reduce the network cost of sending requests and transferring messages at the expense of incurring the space cost of vertex replications. GrapH [22] focus on minimize overall communication costs by using an adaptive edge migration strategy to avoid frequent communication over expensive network links. Gemini [9] is a computation-centric distributed graph processing system that uses a hybrid pull/push approach to facilitate state updates and messaging of graph vertices.

Single-machine Systems: GraphChi [10] is a vertex-centric graph processing system and improve IO access efficiency by parallel Sliding Window processing strategy. But the outgoing edges of all vertices have to be loaded into memory before computation, resulting in unnecessary transfer of disk data. Also, all memory blocks have to be scanned when accessing neighboring vertices, which lead to inefficient graph traversal. TurboGraph [17] proposed a Pin-And-Slide model to solve this problem. PAS has no delay in dealing with local graph data, but only applies to some specific parallel algorithms. Compare to the two above, VENUS [18] expands to nearly every algorithm and enables streamlined processing which performs computation while the data is streaming in. Moreover, it uses a fixed buffer to cache the v-shard, which can reduce random IO.

GridGraph [19] uses a 2-level Hierarchical Partitioning scheme to reduce the amount of data transfer, enable streamlined disk access, and maintain locality. But it requires more disk data transfer using TurboGraph-like updating strategy. Besides, it cannot fully utilize the parallelism of multi-thread CPU without sorted edges. NXgraph [11] propose the Destination-Sorted Sub-Shard (DSSS) structure to store graph with three updating strategies: SPU, DPU and MPU. it adaptively choose suitable one to fully utilize the memory space and reduce the amount of data transfer. It achieves higher locality than v-shards in VENUS [18] and reduces the amount of data transfer and enables streamlined disk access pattern. Mosaic [12] combines fast host processors for concentrated memory-intensive operations, with coprocessors for compute and I/O intensive components.

Traditional graph systems, either memory-share nor distribute, take the variable of graph structure into consideration, which appears through constantly convergence of vertices during iterations and plays significant role in program optimization. In this case, We present a novel graph structure-aware technique in the paper that provide adaptive graph partitioning and processing scheduling according to the variety of graph structure. Our strategy reduce the overhead caused by inactive vertices and their loading times as well speed up convergence rate.

6 Conclusion

In this paper, We adopted a structure-centric distributed graph processing method, Through graph structure perception, graph structure features of unconvergent vertices are incrementally obtained according to analysis, adaptively scheduling suitable graph processing methods. Our development reveal that (1) The dynamic incremental partitioning of vertex degree and state degree can significantly reducing IO resource overhead and cache miss rate, and (2) Computation and communication overhead of less active vertices can be reduced by setting priority of graph partitions and scheduling them based on predestinated order, and accelerated the algorithm convergence as well. Our experimental results on a variety of different data sets and their structural features of the graph demonstrate the efficiency, effectiveness and scalability of our approach, in comparison to state-of-the-art race detection approaches.


  • [1] “Google,” http://www.google.com/, 2018.
  • [2] “facebook,” http://www.facebook.com/, 2018.
  • [3] “Wikipedia,” http://zh.wikipedia.org/, 2018.
  • [4] “Netflix,” http://www.netflix.com/, 2018.
  • [5] X. Liao, H. Jin, Y. Liu, L. M. Ni, and D. Deng, “Anysee: Peer-to-peer live streaming,” in Proceedings of the IEEE International Conference on Computer Communications.   IEEE, 2006, pp. 1–10.
  • [6]

    Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, “Distributed graphlab: a framework for machine learning and data mining in the cloud,”

    Proceedings of the VLDB Endowment, vol. 5, no. 8, pp. 716–727, 2012.
  • [7] C. Xie, R. Chen, H. Guan, B. Zang, and H. Chen, “Sync or async: Time to fuse for distributed graph-parallel computation,” in ACM SIGPLAN Notices, vol. 50, no. 8.   ACM, 2015, pp. 194–204.
  • [8] Z. Wang, Y. Gu, Y. Bao, G. Yu, and J. X. Yu, “Hybrid pulling/pushing for i/o-efficient distributed and iterative graph computing,” in Proceedings of the 2016 International Conference on Management of Data.   ACM, 2016, pp. 479–494.
  • [9] X. Zhu, W. Chen, W. Zheng, and X. Ma, “Gemini: A computation-centric distributed graph processing system.” in OSDI, 2016, pp. 301–316.
  • [10] A. Kyrola, G. E. Blelloch, and C. Guestrin, “Graphchi: Large-scale graph computation on just a pc,” in OSDI.   USENIX, 2012.
  • [11] Y. Chi, G. Dai, Y. Wang, G. Sun, G. Li, and H. Yang, “Nxgraph: An efficient graph processing system on a single machine,” in Data Engineering (ICDE), 2016 IEEE 32nd International Conference on.   IEEE, 2016, pp. 409–420.
  • [12] S. Maass, C. Min, S. Kashyap, W. Kang, M. Kumar, and T. Kim, “Mosaic: Processing a trillion-edge graph on a single machine,” in Proceedings of the Twelfth European Conference on Computer Systems.   ACM, 2017, pp. 527–543.
  • [13] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.   ACM, 2010, pp. 135–146.
  • [14] N. T. Bao and T. Suzumura, “Towards highly scalable pregel-based graph processing platform with x10,” in Proceedings of the 22nd International Conference on World Wide Web.   ACM, 2013, pp. 501–508.
  • [15] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
  • [16] C. Avery, “Giraph: Large-scale graph processing infrastructure on hadoop,” in Proceedings of the Hadoop Summit. Santa Clara, vol. 11, no. 3, 2011, pp. 5–9.
  • [17] W.-S. Han, S. Lee, K. Park, J.-H. Lee, M.-S. Kim, J. Kim, and H. Yu, “Turbograph: a fast parallel graph engine handling billion-scale graphs in a single pc,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2013, pp. 77–85.
  • [18] J. Cheng, Q. Liu, Z. Li, W. Fan, J. C. Lui, and C. He, “Venus: Vertex-centric streamlined graph computation on a single pc,” in Data Engineering (ICDE), 2015 IEEE 31st International Conference on.   IEEE, 2015, pp. 1131–1142.
  • [19] X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.
  • [20] A. B. Adrian Daniel Popescu, “Towards predicting the runtime of iterative analytics with predict,” in Ecole Polytechnique Federal de Lausanne.   EPFL, 2013.
  • [21] L. G. Yanfeng Zhang, Qixin Gao, “Maiter: An asynchronous graph processing framework for delta-based accumulative iterative computation,” in IEEE transactions on parallel and distribution and distrbuted systems.   IEEE, 2014, pp. 2091–2100.
  • [22] M. A. T. Christian Mayer, “Graph: Heterogeneity-aware graph computation with adaptive partitioning,” in 2016 IEEE 36th International Conference on Distributed Computing Systems.   IEEE, 2016.