Distributed Evolutionary Graph Partitioning

10/03/2011 ∙ by Peter Sanders, et al. ∙ 0

We present a novel distributed evolutionary algorithm, KaFFPaE, to solve the Graph Partitioning Problem, which makes use of KaFFPa (Karlsruhe Fast Flow Partitioner). The use of our multilevel graph partitioner KaFFPa provides new effective crossover and mutation operators. By combining these with a scalable communication protocol we obtain a system that is able to improve the best known partitioning results for many inputs in a very short amount of time. For example, in Walshaw's well known benchmark tables we are able to improve or recompute 76

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Problems of graph partitioning arise in various areas of computer science, engineering, and related fields. For example in high performance computing [27], community detection in social networks [25] and route planning [4]. In particular the graph partitioning problem is very valuable for parallel computing. In this area, graph partitioning is mostly used to partition the underlying graph model of computation and communication. Roughly speaking, vertices in this graph represent computation units and edges denote communication. This graph needs to be partitioned such that there are few edges between the blocks (pieces). In particular, if we want to use processors we want to partition the graph into blocks of about equal size.

In this paper we focus on a version of the problem that constrains the maximum block size to times the average block size and tries to minimize the total cut size, i.e., the number of edges that run between blocks. It is well known that this problem is NP-complete [7] and that there is no approximation algorithm with a constant ratio factor for general graphs [7]

. Therefore mostly heuristic algorithms are used in practice.

A successful heuristic for partitioning large graphs is the multilevel graph partitioning (MGP) approach depicted in Figure 1 where the graph is recursively contracted to achieve smaller graphs which should reflect the same basic structure as the input graph. After applying an initial partitioning algorithm to the smallest graph, the contraction is undone and, at each level, a local refinement method is used to improve the partitioning induced by the coarser level.

The main focus of this paper is a technique which integrates an evolutionary search algorithm with our multilevel graph partitioner KaFFPa and its scalable parallelization. We present novel mutation and combine operators which in contrast to previous methods that use a graph partitioner [28, 11] do not need random perturbations of edge weights. We show in Section 6 that the usage of edge weight perturbations decreases the overall quality of the underlying graph partitioner. The new combine operators enable us to combine individuals of different kinds (see Section 4 for more details). Due to the parallelization our system is able to compute partitions that have quality comparable or better than previous entries in Walshaw’s well known partitioning benchmark within a few minutes for graphs of moderate size. Previous methods of Soper et.al [28] required runtimes of up to one week for graphs of that size. We therefore believe that in contrast to previous methods, our method is very valuable in the area of high performance computing.

The paper is organized as follows. We begin in Section 2 by introducing basic concepts. After shortly presenting Related Work in Section 3, we continue describing the main evolutionary components in Section 4 and its

Figure 1: Multilevel graph partitioning.

parallelization in Section 5. A summary of extensive experiments done to tune the algorithm and evaluate its performance is presented in Section 6. A brief outline of the techniques used in the multilevel graph partitioner KaFFPa is provided in Appendix A. We have implemented these techniques in the graph partitioner KaFFPaE (Karlsruhe Fast Flow Partitioner Evolutionary) which is written in C++. Experiments reported in Section 6 indicate that KaFFPaE is able to compute partitions of very high quality and scales well to large networks and machines.

2 Preliminaries

2.1 Basic concepts

Consider an undirected graph with edge weights , node weights , , and . We extend and to sets, i.e., and . denotes the neighbors of . We are looking for blocks of nodes ,…, that partition , i.e., and for . The balancing constraint demands that for some parameter . The last term in this equation arises because each node is atomic and therefore a deviation of the heaviest node has to be allowed. The objective is to minimize the total cut where . A clustering is also a partition of the nodes, however is usually not given in advance and the balance constraint is removed. A vertex that has a neighbor , is a boundary vertex. An abstract view of the partitioned graph is the so called quotient graph, where vertices represent blocks and edges are induced by connectivity between blocks. Given two clusterings and the overlay clustering is the clustering where each block corresponds to a connected component of the graph where is the union of the cut edges of and , i.e. all edges that run between blocks in either or . By default, our initial inputs will have unit edge and node weights. However, even those will be translated into weighted problems in the course of the algorithm.

A matching is a set of edges that do not share any common nodes, i.e., the graph has maximum degree one. Contracting an edge means to replace the nodes and by a new node connected to the former neighbors of and . We set so the weight of a node at each level is the number of nodes it is representing in the original graph. If replacing edges of the form , would generate two parallel edges , we insert a single edge with . Uncontracting an edge undos its contraction. In order to avoid tedious notation, will denote the current state of the graph before and after a (un)contraction unless we explicitly want to refer to different states of the graph. The multilevel approach to graph partitioning consists of three main phases. In the contraction (coarsening) phase, we iteratively identify matchings and contract the edges in . Contraction should quickly reduce the size of the input and each computed level should reflect the global structure of the input network. Contraction is stopped when the graph is small enough to be directly partitioned using some expensive other algorithm. In the refinement (or uncoarsening) phase, the matchings are iteratively uncontracted. After uncontracting a matching, a refinement algorithm moves nodes between blocks in order to improve the cut size or balance.

KaFFPa, which we use as a base case partitioner, extended the concept of iterated multilevel algorithms which was introduced by [29]. The main idea is to iterate the coarsening and uncoarsening phase. Once the graph is partitioned, edges that are between two blocks are not contracted. An F-cycle works as follows: on each level we perform at most two recursive calls using different random seeds during contraction and local search. A second recursive call is only made the second time that the algorithm reaches a particular level. As soon as the graph is partitioned, edges that are between blocks are not contracted. This ensures nondecreasing quality of the partition since our refinement algorithms guarantee no worsening and break ties randomly. These so called global search strategies are more effective than plain restarts of the algorithm. Extending this idea will yield the new combine and mutation operators described in Section 4.

Local search algorithms find good solutions in a very short amount of time but often get stuck in local optima. In contrast to local search algorithms, genetic/evolutionary algorithms are good at searching the problem space globally. However, genetic algorithms lack the ability of fine tuning a solution, so that local search algorithms can help to improve the performance of a genetic algorithm. The combination of an evolutionary algorithm with a local search algorithm is called

hybrid or memetic evolutionary algorithm [20].

3 Related Work

There has been a huge amount of research on graph partitioning so that we refer the reader to [15, 31] for more material on multilevel graph partitioning and to [20] for more material on genetic approaches for graph partitioning. All general purpose methods that are able to obtain good partitions for large real world graphs are based on the multilevel principle outlined in Section 2. Well known software packages based on this approach include, Jostle [31], Metis [19], and Scotch [24]. KaFFPa [17] is a MGP algorithm using local improvement algorithms that are based on flows and more localized FM searches. It obtained the best results for many graphs in [28]. Since we use it as a base case partitioner it is described in more detail in Appendix A. KaSPar [23] is a graph partitioner based on the central idea to (un)contract only a single edge between two levels. KaPPa [17] is a "classical" matching based MGP algorithm designed for scalable parallel execution.

Soper et al. [28] provided the first algorithm that combined an evolutionary search algorithm with a multilevel graph partitioner. Here crossover and mutation operators have been used to compute edge biases, which yield hints for the underlying multilevel graph partitioner. Benlic et al. [5] provided a multilevel memetic algorithm for balanced graph partitioning. This approach is able to compute many entries in Walshaw’s Benchmark Archive [28] for the case . PROBE [8] is a meta-heuristic which can be viewed as a genetic algorithm without selection. It outperforms other metaheuristics, but it is restricted to the case and .

Very recently an algorithm called PUNCH [11] has been introduced. This approach is not based on the multilevel principle. However, it creates a coarse version of the graph based on the notion of natural cuts. Natural cuts are relatively sparse cuts close to denser areas. They are discovered by finding minimum cuts between carefully chosen regions of the graph. They introduced an evolutionary algorithm which is similar to Soper et al. [28], i.e. using a combine operator that computes edge biases yielding hints for the underlying graph partitioner. Experiments indicate that the algorithm computes very good partitions for road networks. For instances without a natural structure such as road networks, natural cuts are not very helpful.

4 Evolutionary Components

The general idea behind evolutionary algorithms (EA) is to use mechanisms which are highly inspired by biological evolution such as selection, mutation, recombination and survival of the fittest. An EA starts with a population of individuals (in our case partitions of the graph) and evolves the population into different populations over several rounds. In each round, the EA uses a selection rule based on the fitness of the individuals (in our case the edge cut) of the population to select good individuals and combine them to obtain improved offspring [16]. Note that we can use the cut as a fitness function since our partitioner almost always generates partitions that are within the given balance constraint, i.e. there is no need to use a penalty function or something similar to ensure that the final partitions generated by our algorithm are feasible. When an offspring is generated an eviction rule is used to select a member of the population and replace it with the new offspring. In general one has to take both into consideration, the fitness of an individual and the distance between individuals in the population [2]. Our algorithm generates only one offspring per generation. Such an evolutionary algorithm is called steady-state [9]. A typical structure of an evolutionary algorithm is depicted in Algorithm 1.

For an evolutionary algorithm it is of major importance to keep the diversity in the population high [2], i.e. the individuals should not become too similar, in order to avoid a premature convergence of the algorithm. In other words, to avoid getting stuck in local optima a procedure is needed that randomly perturbs the individuals. In classical evolutionary algorithms, this is done using a mutation operator. It is also important to have operators that introduce unexplored search space to the population. Through a new kind of crossover and mutation operators, introduced in Section 4.1, we introduce more elaborate diversification strategies which allow us to search the search space more effectively.

Interestingly, Inayoshi et al. [18] noticed that good local solutions of the graph partitioning problem tend to be close to one another. Boese et al. [6] showed that the quality of the local optima overall decreases as the distance from the global optimum increases. We will see in the following that our combine operators can exchange good parts of solutions quite effectively especially if they have a small distance.

  procedure steady-state-EA
   create initial population
   while stopping criterion not fulfilled
    select parents from
    combine with to create offspring
    mutate offspring
    evict individual in population using
   return the fittest individual that occurred
Algorithm 1 A classic general steady-state evolutionary algorithm.

4.1 Combine Operators

We now describe the general combine operator framework. This is followed by three instantiations of this framework. In contrast to previous methods that use a multilevel framework our combine operators do not need perturbations of edge weights since we integrate the operators into our partitioner and do not use it as a complete black box.

Furthermore all of our combine operators assure that the offspring has a partition quality at least as good as the best of both parents. Roughly speaking, the combine operator framework combines an individual/partition (which has to fulfill a balance constraint) with a clustering . Note that

Figure 2: On the top a graph with two partitions, the dark and the light line, are shown. Cut edges are not eligible for the matching algorithm. Contraction is done until no matchable edge is left. The best of the two given partitions is used as initial partition.

the clustering does not necessarily has to fulfill a balance constraint and is not necessarily given in advance. All instantiations of this framework use a different kind of clustering or partition. The partition and the clustering are both used as input for our multi-level graph partitioner KaFFPa in the following sense. Let be the set of edges that are cut edges, i.e. edges that run between two blocks, in either or . All edges in are blocked during the coarsening phase, i.e. they are not contracted during the coarsening phase. In other words these edges are not eligible for the matching algorithm used during the coarsening phase and therefore are not part of any matching computed. An illustration of this can be found in Figure 2.

The stopping criterion for the multi-level partitioner is modified such that it stops when no contractable edge is left. Note that the coarsest graph is now exactly the same as the quotient graph of the overlay clustering of and of (see Figure 3). Hence vertices of the coarsest graph correspond to the connected components of and the weight of the edges between vertices corresponds to the sum of the edge weights running between those connected components in .

As soon as the coarsening phase is stopped, we apply the partition to the coarsest graph and use this as initial partitioning. This is possible since we did not contract any cut edge of . Note that due to the specialized coarsening phase and this specialized initial partitioning we obtain a high quality initial solution on a very coarse graph which is usually not discovered by conventional partitioning algorithms. Since our refinement algorithms guarantee no worsening of the input partition and use random tie breaking we can assure nondecreasing partition quality. Note that the refinement algorithms can effectively exchange good parts of the solution on the coarse levels by moving only a few vertices. Figure 3 gives an example.

Also note that this combine operator can be extended to be a multi-point combine operator, i.e. the operator would use instead of two parents. However, during the course of the algorithm a sequence of two point combine steps is executed which somehow "emulates" a multi-point combine step. Therefore, we restrict ourselves to the case . When the offspring is generated we have to decide which solution should be evicted from the current population. We evict the solution that is most similar to the offspring among those individuals in the population that have a cut worse or equal than the offspring itself. The difference of two individuals is defined as the size of the symmetric difference between their sets of cut edges. This ensures some diversity in the population and hence makes the evolutionary algorithm more effective.

4.1.1 Classical Combine using Tournament Selection

This instantiation of the combine framework corresponds to a classical evolutionary combine operator . That means it takes two individuals of the population and performs the combine step described above. In this case corresponds to the partition having the smaller cut and corresponds to the partition having the larger cut. Random tie breaking is used if both parents have the same cut. The selection process is based on the tournament selection rule [22], i.e. is the fittest out of two random individuals from the population. The same is done to select . Note that in contrast to previous methods the generated offspring will have a cut smaller or equal to the cut of . Due to the fact that our multi-level algorithms are randomized, a combine operation performed twice using the same parents can yield different offspring.

4.1.2 Cross Combine / (Transduction)

In this instantiation of the combine framework , the clustering corresponds to a partition of . But instead of choosing an individual from the population we create a new individual in the following way. We choose uniformly at random in and uniformly at random in . We then use KaFFPa to create a -partition of fulfilling the balance constraint . In general larger imbalances reduce the cut of a partition which then yields good clusterings for our crossover. To the best of our knowledge there has been no genetic algorithm that performs combine operations combining individuals from different search spaces.

  
Figure 3: A graph and two bipartitions; the dotted and the dashed line (left). Curved lines represent a large cut. The four vertices correspond to the coarsest graph in the multilevel procedure. Local search algorithms can effectively exchange or to obtain the better partition depicted on the right hand side (dashed line).

4.1.3 Natural Cuts

Delling et al. [11] introduced the notion of natural cuts as a preprocessing technique for the partitioning of road networks. The preprocessing technique is able to find relatively sparse cuts close to denser areas. We use the computation of natural cuts to provide another combine operator, i.e. combining a -partition with a clustering generated by the computation of natural cuts. We closely follow their description: The computation of natural cuts works in rounds. Each round picks a center vertex and grows a breadth-first search (BFS) tree. The BFS is stopped as soon as the weight of the tree, i.e. the sum of the vertex weights of the tree, reaches

Figure 4: On the top we see the computation of a natural cut. A BFS Tree which starts from is grown. The gray area is the core. The dashed line is the natural cut. It is the minimum cut between the contracted versions of the core and the ring (shown as the solid line). During the computation several natural cuts are detected in the input graph (bottom).

, for some parameters and . The set of the neighbors of in is called the ring of . The core of is the union of all vertices added to before its size reached where is another parameter.

The core is then temporarily contracted to a single vertex and the ring into a single vertex to compute the minimum --cut between them using the given edge weights as capacities.

To assure that every vertex eventually belongs to at least one core, and therefore is inside at least one cut, the vertices are picked uniformly at random among all vertices that have not yet been part of any core in any round. The process is stopped when there are no such vertices left.

In the original work [11] each connected component of the graph , where is the union of all edges cut by the process above, is contracted to a single vertex. Since we do not use natural cuts as a preprocessing technique at this place we don’t contract these components. Instead we build a clustering of such that each connected component of is a block.

This technique yields the third instantiation of the combine framework which is divided into two stages, i.e. the clustering used for this combine step is dependent on the stage we are currently in. In both stages the partition used for the combine step is selected from the population using tournament selection. During the first stage we choose uniformly at random in , uniformly at random in and we set . Using these parameters we obtain a clustering of the graph which is then used in the combine framework described above. This kind of clustering is used until we reach an upper bound of ten calls to this combine step. When the upper bound is reached we switch to the second stage. In this stage we use the clusterings computed during the first stage, i.e. we extract elementary natural cuts and use them to quickly compute new clusterings. An elementary natural cut (ENC) consists of a set of cut edges and the set of nodes in its core. Moreover, for each node in the graph, we store the set of of ENCs that contain in their core. With these data structures its easy to pick a new clustering (see Algorithm 2) which is then used in the combine framework described above.

1:  unmarked all nodes in
2:  for each in random order do
3:   if is not marked then
4:    pick a random ENC in
5:    output
6:    mark all nodes in ’s core
Algorithm 2 computeNaturalCutClustering (second stage)

4.2 Mutation Operators

We define two mutation operators, an ordinary and a modified F-cycle. Both mutation operators use a random individual from the current population. The main idea is to iterate coarsening and refinement several times using different seeds for random tie breaking. The first mutation operator can assure that the quality of the input partition does not decrease. It is basically an ordinary F-cycle which is an algorithm used in KaFFPa. Edges between blocks are not contracted. The given partition is then used as initial partition of the coarsest graph. In contrast to KaFFPa, we now can use the partition as input to the partition in the very beginning. This ensures nondecreasing quality since our refinement algorithms guarantee no worsening. The second mutation operator works quite similar with the small difference that the input partition is not used as initial partition of the coarsest graph. That means we obtain very good coarse graphs but we can not assure that the final individual has a higher quality than the input individual. In both cases the resulting offspring is inserted into the population using the eviction strategy described in Section 4.1.

5 Putting Things Together and Parallelization

We now explain the parallelization and describe how everything is put together. Each processing element (PE) basically performs the same operations using different random seeds (see Algorithm 3

). First we estimate the population size

: each PE performs a partitioning step and measures the time spend for partitioning. We then choose such that the time for creating partitions is approximately where the fraction is a tuning parameter and is the total running time that the algorithm is given to produce a partition of the graph. Each PE then builds its own population, i.e. KaFFPa is called several times to create

individuals/partitions. Afterwards the algorithm proceeds in rounds as long as time is left. With corresponding probabilities, mutation or combine operations are performed and the new offspring is inserted into the population.

We choose a parallelization/communication protocol that is quite similar to randomized rumor spreading [12]. Let denote the number of PEs used. A communication step is organized in rounds. In each round, a PE chooses a communication partner and sends her the currently best partition of the local population. The selection of the communication partner is done uniformly at random among those PEs to which not already has been send to. Afterwards, a PE checks if there are incoming individuals and if so inserts them into the local population using the eviction strategy described above. If is improved, all PEs are again eligible. This is repeated times. Note that the algorithm is implemented completely asynchronously, i.e. there is no need for a global synchronisation. The process of creating individuals is parallelized as follows: Each PE makes calls to KaFFPa using different seeds to create individuals. Afterwards we do the following times: The root PE computes a random cyclic permutation of all PEs and broadcasts it to all PEs. Each PE then sends a random individual to its successor in the cyclic permutation and receives a individual from its predecessor in the cyclic permutation. We call this particular part of the algorithm quick start.

The ratio of mutation to crossover operations yields a tuning parameter . As we will see in Section 6 the ratio is a good choice. After some experiments we fixed the ratio of the mutation operators to and the ratio of the combine operators to .

Note that the communication step in the last line of the algorithm could also be performed only every -iterations (where is a tuning parameter) to save communication time. Since the communication network of our test system is very fast (see Section 6), we perform the communication step in each iteration.

  procedure locallyEvolve
   estimate population size
   while time left
    if elapsed time then create individual and insert into local population
    else
     flip coin with corresponding probabilities
     if shows head then
      perform a mutation operation
     else
      perform a combine operation
     insert offspring into population if possible
    communicate according to communication protocol
Algorithm 3 All PEs perform basically the same operations using different random seeds.

6 Experiments

Implementation.

We have implemented the algorithm described above using C++. Overall, our program (including KaFFPa) consists of about 22 500 lines of code. We use two base case partitioners, KaFFPaStrong and KaFFPaEco. KaFFPaEco is a good tradeoff between quality and speed, and KaFFPaStrong is focused on quality. For the following comparisons we used Scotch 5.1.9., and kMetis 5.0 (pre2).

System.

Experiments have been done on two machines. Machine A is a cluster with 200 nodes where each node is equipped with two Quad-core Intel Xeon processors (X5355) which run at a clock speed of 2.667 GHz. Each node has 2x4 MB of level 2 cache each and run Suse Linux Enterprise 10 SP 1. All nodes are attached to an InfiniBand 4X DDR interconnect which is characterized by its very low latency of below 2 microseconds and a point to point bandwidth between two nodes of more than 1300 MB/s. Machine B has two Intel Xeon X5550, 48GB RAM, running Ubuntu 10.04. Each CPU has 4 cores (8 cores when hyperthreading is active) running at 2.67 GHz. Experiments in Sections 6.1, 6.2, 6.3 and 6.5 have been conducted on machine A, and experiments in Sections 6.4 and 6.6 have been conducted on machine B. All programs were compiled using GCC Version 4.4.3 and optimization level 3 using OpenMPI 1.5.3. Henceforth, a PE is one core.

Instances.

We report experiments on three suites of instances (small, medium sized and road networks) summarized in Appendix C. is a random geometric graph with nodes where nodes represent random points in the unit square and edges connect nodes whose Euclidean distance is below . This threshold was chosen in order to ensure that the graph is almost connected. is the Delaunay triangulation of random points in the unit square. Graphs ,.. and .. come from Walshaw’s benchmark archive [30]. Graphs and , and are undirected versions of the road networks, used in [10]. is a road network taken from [3]. Our default number of partitions are 2, 4, 8, 16, 32, 64 since they are the default values in [30] and in some cases we additionally use 128 and 256. Our default value for the allowed imbalance is 3% since this is one of the values used in [30] and the default value in Metis. Our default number of PEs is 16.

Methodology.

We mostly present two kinds of data: average values and plots that show the evolution of solution quality (convergence plots). In both cases we perform multiple repetitions. The number of repetitions is dependent on the test that we perform. Average values over multiple instances are obtained as follows: for each instance (graph,

), we compute the geometric mean of the average edge cut values for each instance. We now explain how we compute the convergence plots. We start explaining how we compute them for a single instance

: whenever a PE creates a partition it reports a pair (, cut), where the timestamp is the currently elapsed time on the particular PE and cut refers to the cut of the partition that has been created. When performing multiple repetitions we report average values (, avgcut) instead. After the completion of KaFFPaE we are left with sequences of pairs (, cut) which we now merge into one sequence. The merged sequence is sorted by the timestamp . The resulting sequence is called . Since we are interested in the evolution of the solution quality, we compute another sequence . For each entry (in sorted order) in we insert the entry into . Here is the minimum cut that occurred until time . refers to the normalized sequence, i.e. each entry (, cut) in is replaced by (, cut) where and is the average time that KaFFPa needs to compute a partition for the instance . To obtain average values over multiple instances we do the following: for each instance we label all entries in , i.e. (, cut) is replaced by (, cut, ). We then merge all sequences and sort by . The resulting sequence is called . The final sequence presents event based geometric averages values. We start by computing the geometric mean cut value using the first value of all (over ). To obtain we basically sweep through : for each entry (in sorted order) in we update , i.e. the cut value of that took part in the computation of is replaced by the new value , and insert into . Note that can be only smaller or equal to the old cut value of .

6.1 Parameter Tuning

We now tune the fraction parameter and the ratio between mutation and crossover operations. For the parameter tuning we choose our small testset because runtimes for a single graph partitioner call are not too large. To save runtime we focus on for tuning the parameters. For each instance we gave KaFFPaE ten minutes time and 16 PEs to compute a partition. During this test the quick start option is disabled.

For this test the flip coin parameter is set to one. In Figure 5 we can see that the algorithm is not too sensitive about the exact choice of this parameter. However, larger values of speed up the convergence rate and improve the result achieved in the end. Since and are the best parameter in the end, we choose as our default value. For tuning the ratio of mutation and crossover operations, we set to ten. We can see that for smaller values of the algorithm is not too sensitive about the exact choice of the parameter. However, if the exceeds 8 the convergence speed slows down which yields worse average results in the end. We choose because it has a slight advantage in the end. The parameter tuning uses KaFFPaStrong as a partitioner. We also performed the parameter tuning using KaFFPaEco as a partitioner (see Appendix B.1).

Figure 5: Conv. plots for the fraction using (left) and the flip coin using (right).

6.2 Scalability

In this Section we study the scalability of our algorithm. We do the following to obtain a fair comparison: basically each configuration has the same amount of time, i.e. when doubling the number of PEs used, we divide the time that KaFFPaE has to compute a partition per instance by two. To be more precise, when we use one PE KaFFPaE has to compute a partition of an instance. When KaFFPaE uses PEs, then it gets time to compute a partition of an instance. For all the following tests the quick start option is enabled. To save runtime we use our small sized testset and fix to 64. Here we perform five repetitions per instance. We can see in Figure 6 that using more processors speeds up the convergence speed and up to also improves the quality in the end (in these cases the speedups are optimal in the end). This might be due to island effects [1]. For results are worse compared to . This is because the algorithm is barely able to perform combine and mutation steps, due to the very small amount of time given to KaFFPaE (60 seconds). On the largest graph of the testset (delaunay16) we need about 20 seconds to create a partition into blocks.

We now define pseudo speedup which is a measure for speedup at a particular normalized time of the configuration using one PE. Let be the mean minimum cut that KaFFPaE has computed using PEs until normalized time . The pseudo speedup is then defined as where . If for all we set (in this case the parallel algorithm is not able to compute the result computed by the sequential algorithm at normalized time ; this is only the case for ). We can see in Figure 6 that after a short amount of time we reach super linear pseudo speedups in most cases.

Figure 6: Scalability of our algorithm: (left) a normal convergence plot, (middle) mean minimum cut relative to best cut of KaFFPaE using one PE, (right) pseudo speedup (larger versions can be found in Appendix B.3).

6.3 Comparison with KaFFPa and other Systems

/Algo. Reps. KaFFPaE
Avg. impr. %
2 569 0.2%
4 1 229 1.0%
8 2 206 1.5%
16 3 568 2.7%
32 5 481 3.4%
64 8 141 3.3%
128 11 937 3.9%
256 17 262 3.7%
overall 3 872 2.5%
Table 1: Different algorithms after two hours of time on 16 PEs.

In this Section we compare ourselves with repeated executions of KaFFPa and other systems. We switch to our middle sized testset to avoid the effect of overtuning our algorithm parameters to the instances used for calibration. We use 16 PEs and two hours of time per instance when we use KaFFPaE. We parallelized repeated executions of KaFFPa (embarrassingly parallel, different seeds) and also gave 16 PEs and two hours to KaFFPa. We look at and performed three repetitions per instance. Figure 7 show convergence plots for . All convergence plots can be found in the Appendix B.2. As expected the improvements of KaFFPaE relative to repeated executions of KaFFPa increase with increasing . The largest improvement is obtained for . Here KaFFPaE produces partitions that have a 3.9% smaller cut value than plain restarts of the algorithm. Note that using a weaker base case partitioner, e.g. KaFFPaEco, increases this value. On the small sized testset we obtained an improvement of 5.9% for compared to plain restarts of KaFFPaEco. Tables comparing KaFFPaE with the best results out of ten repetitions of Scotch and Metis can be found in the Appendix Table 4. Overall, Scotch and Metis produce 19% and 28% larger (best) cuts than KaFFPaE respectively. However, these methods are much faster than ours (Appendix Table 4).

Figure 7: Convergence plots for the comparison of KaFFPaE with repeated executions of KaFFPa.

6.4 Combine Operator Experiments

Algo. S3R K3R KC SC
Avg. improvement %
2 591 2.4 1.6 0.2
4 1 304 3.4 4.0 0.2
8 2 336 3.7 3.6 0.2
16 3 723 2.9 2.0 0.2
32 5 720 2.7 3.3 0.0
64 8 463 2.8 3.0 -0.6
128 12 435 3.6 4.5 0.0
256 17 915 3.4 4.2 -0.1
Table 2: Comparison of quality of different algorithms relative to S3R.

We now look into the effectiveness of our combine operator . We conduct the following experiment: we compare the best result of three repeated executions of KaFFPa (K3R) against a combine step (KC), i.e. after creating two partitions we report the result of the combine step combining both individuals. The same is done using the combine operator of Soper et. al. [28] (SC), i.e. we create two individuals using perturbed edge weights as in [28] and report the cut produced by the combine step proposed there (the best out of the three individuals). We also present best results out of three repetitions when using perturbed edge weights as in Soper et. al. (S3R). Since our partitioner does not support double type edge weights, we computed the perturbations and scaled them by a factor of 10 000 (for S3R and SC). We performed ten repetitions on the middle sized testset. Results are reported in Table 2. A table presenting absolute average values and comparing the runtime of these algorithms can be found in Appendix Table 5. We can see that for large our new combine operator yields improved partition quality in compareable or less time (KC vs. K3R)). Most importantly, we can see that edge biases decrease the solution quality (K3R vs. S3R). This is due to the fact that edge biases make edge cuts optimial that are not close to optimial in the unbiased problem. For example on 2D grid graphs, we have straight edge cuts that are optimal. Random edge biases make bended edge cuts optimal. However, these cuts are are not close to optimal cuts of the original graph partitioning problem. Moreover, local search algorithms (Flow-based, FM-based) work better if there are a lot of equally sized cuts.

6.5 Walshaw Benchmark

We now apply KaFFPaE to Walshaw’s benchmark archive [28] using the rules used there, i.e., running time is not an issue but we want to achieve minimal cut values for and balance parameters . We focus on since KaFFPaE (more precisely KaFFPa) is not made for the case . We run KaFFPaE with a time limit of two hours using 16 PEs (two nodes of the cluster) per graph, and and report the best results obtained in the Appendix D. KaFFPaE computed 300 partitions which are better than previous best partitions reported there: 91 for 1%, 103 for 3% and 106 for 5%. Moreover, it reproduced equally sized cuts in 170 of the 312 remaining cases. When only considering the 15 largest graphs and we are able to reproduce or improve the current result in 224 out of 240 cases. Overall our systems (including KaPPa, KaSPar, KaFFPa, KaFFPaE) now improved or reproduced the entrys in 550 out of 612 cases (for ).

6.6 Comparison with PUNCH

grp, algorithm/runtime
ger. P B B
2 164 83 161 6 161
4 400 96 394 6 393
8 711 102 694 9 693
16 1 144 83 1 148 16 1 137
32 1 960 71 1 928 31 1 898
64 3 165 83 3 164 62 3 143
eur. P B B

2
129 423 149 39 129
4 309 358 313 39 310
8 634 293 693 47 659
16 1 293 252 1 261 73 1 238
32 2 289 217 2 259 130 2 240
64 3 828 241 3 856 248 3 825
Table 3: Results on road networks: best results of PUNCH (P) out of 100 repetitions and total time [m] needed to compute these results; average and best cut results of Buffoon (B) as well as average runtime [m] (including preprocessing).

In this Section we focus on finding partitions for road networks. We implemented a specialized algorithm, Buffoon, which is similar to PUNCH [11] in the sense that it also uses natural cuts as a preprocessing technique to obtain a coarser graph on which the graph partitioning problem is solved. For more information on natural cuts, we refer the reader to [11]. Using our (shared memory) parallelized version of natural cut preprocessing we obtain a coarse version of the graph. Note that our preprocessing uses slightly different parameters than PUNCH (using the notation of [11], we use ). Since partitions of the coarse graph correspond to partitions of the original graph, we use KaFFPaE to partition the coarse version of the graph.

After preprocessing, we gave KaFFPaE on europe and on germany, to compute a partition. In both cases we used all 16 cores (hyperthreading active) of machine B for preprocessing and for KaFFPaE. The experiments where repeated ten times. A summary of the results is shown in Table 3. Interestingly, on germany already our average values are smaller or equal to the best result out of 100 repetitions obtained by PUNCH. Overall in 9 out of 12 cases we compute a best cut that is better or equal to the best cut obtained by PUNCH. Note that for obtaining the best cut values we invest significantly more time than PUNCH. However, their machine is about a factor two faster (12 cores running at 3.33GHz compared to 8 cores running at 2.67GHz) and our algorithm is not tuned for road networks. A table comparing the results on road networks against KaFFPa, KaSPar, Scotch and Metis can be found in Appendix 6. These algorithms produce 9%, 12%, 93% and 288% larger cuts on average respectively.

7 Conclusion and Future Work

KaFFPaE is an distributed evolutionary algorithm to tackle the graph partitioning problem. Due to new crossover and mutation operators as well as its scalable parallelization it is able to compute the best known partitions for many standard benchmark instances in only a few minutes. We therefore believe that KaFFPaE is still helpful in the area of high performance computing.

Regarding future work, we want to integrate other partitioners if they implement the possibility to block edges during the coarsening phase and use the given partitioning as initial solution. It would be interesting to try other domain specific combine operators, e.g. on social networks it could be interesting to use a modularity clusterer to compute a clustering for the combine operation.

References

  • [1] Enrique Alba and Marco Tomassini. Parallelism and evolutionary algorithms.

    IEEE Trans. Evolutionary Computation

    , 6(5):443–462, 2002.
  • [2] Thomas Bäck. Evolutionary algorithms in theory and practice : evolution strategies, evolutionary programming, genetic algorithms. PhD thesis, 1996.
  • [3] David Bader, Henning Meyerhenke, Peter Sanders, and Dorothea Wagner. 10th DIMACS Implementation Challenge - Graph Partitioning and Graph Clustering, http://www.cc.gatech.edu/dimacs10/.
  • [4] Reinhard Bauer, Daniel Delling, Peter Sanders, Dennis Schieferdecker, Dominik Schultes, and Dorothea Wagner. Combining hierarchical and goal-directed speed-up techniques for dijkstra’s algorithm. ACM Journal of Experimental Algorithmics, 15, 2010.
  • [5] Una Benlic and Jin-Kao Hao. A multilevel memtetic approach for improving graph -partitions. In

    22nd Intl. Conf. Tools with Artificial Intelligence

    , pages 121–128, 2010.
  • [6] K.D. Boese, A.B. Kahng, and S. Muddu. A new adaptive multi-start technique for combinatorial global optimizations. Operations Research Letters, 16(2):101–113, 1994.
  • [7] Thang Nguyen Bui and Curt Jones. Finding good approximate vertex and edge partitions is NP-hard. Inf. Process. Lett., 42(3):153–159, 1992.
  • [8] Pierre Chardaire, Musbah Barake, and Geoff P. McKeown. A probe-based heuristic for graph partitioning. IEEE Trans. Computers, 56(12):1707–1720, 2007.
  • [9] Kenneth Alan De Jong. Evolutionary computation : a unified approach. MIT Press, 2006.
  • [10] D. Delling, P. Sanders, D. Schultes, and D. Wagner. Engineering route planning algorithms. In Algorithmics of Large and Complex Networks, volume 5515 of LNCS State-of-the-Art Survey, pages 117–139. Springer, 2009.
  • [11] Daniel Delling, Andrew V. Goldberg, Ilya Razenshteyn, and Renato F. Werneck. Graph Partitioning with Natural Cuts. In 25th International Parallel and Distributed Processing Symposium (IPDPS’11). IEEE Computer Society, 2011.
  • [12] Benjamin Doerr and Mahmoud Fouz. Asymptotically optimal randomized rumor spreading. In ICALP (2), volume 6756 of Lecture Notes in Computer Science, pages 502–513. Springer, 2011.
  • [13] D. Drake and S. Hougardy. A simple approximation algorithm for the weighted matching problem. Information Processing Letters, 85:211–213, 2003.
  • [14] C. M. Fiduccia and R. M. Mattheyses. A Linear-Time Heuristic for Improving Network Partitions. In 19th Conference on Design Automation, pages 175–181, 1982.
  • [15] P.O. Fjallstrom. Algorithms for graph partitioning: A survey. Linkoping Electronic Articles in Computer and Information Science, 3(10), 1998.
  • [16] David E. Goldberg.

    Genetic algorithms in search, optimization, and machine learning

    .
    Addison-Wesley, 1989.
  • [17] M. Holtgrewe, P. Sanders, and C. Schulz. Engineering a Scalable High Quality Graph Partitioner. 24th IEEE International Parallal and Distributed Processing Symposium, 2010.
  • [18] Hiroaki Inayoshi and Bernard Manderick. The weighted graph bi-partitioning problem: A look at ga performance. In PPSN, volume 866 of Lecture Notes in Computer Science, pages 617–625. Springer, 1994.
  • [19] G. Karypis, V. Kumar, Army High Performance Computing Research Center, and University of Minnesota. Parallel multilevel k-way partitioning scheme for irregular graphs. SIAM Review, 41(2):278–300, 1999.
  • [20] Jin Kim, Inwook Hwang, Yong-Hyuk Kim, and Byung Ro Moon. Genetic approaches for graph partitioning: a survey. In GECCO, pages 473–480. ACM, 2011.
  • [21] J. Maue and P. Sanders. Engineering algorithms for approximate weighted matching. In 6th Workshop on Exp. Algorithms (WEA), volume 4525 of LNCS, pages 242–255. Springer, 2007.
  • [22] Brad L. Miller and David E. Goldberg. Genetic algorithms, tournament selection, and the effects of noise. Complex Systems, 9:193–212, 1995.
  • [23] V. Osipov and P. Sanders. n-Level Graph Partitioning. 18th European Symposium on Algorithms (see also arxiv preprint arXiv:1004.4024), 2010.
  • [24] F. Pellegrini. Scotch home page. http://www.labri.fr/pelegrin/scotch.
  • [25] Josep M. Pujol, Vijay Erramilli, and Pablo Rodriguez. Divide and conquer: Partitioning online social networks. CoRR, abs/0905.4918, 2009.
  • [26] P. Sanders and C. Schulz. Engineering Multilevel Graph Partitioning Algorithms. 19th European Symposium on Algorithms (see also arxiv preprint arXiv:1012.0006v3), 2011.
  • [27] K. Schloegel, G. Karypis, and V. Kumar. Graph Partitioning for High Performance Scientific Simulations. UMSI research report/University of Minnesota (Minneapolis, Mn). Supercomputer institute, page 38, 2000.
  • [28] A.J. Soper, C. Walshaw, and M. Cross. A combined evolutionary search and multilevel optimisation approach to graph-partitioning. Journal of Global Optimization, 29(2):225–241, 2004.
  • [29] C. Walshaw. Multilevel refinement for combinatorial optimisation problems. Annals of Operations Research, 131(1):325–372, 2004.
  • [30] C. Walshaw and M. Cross. Mesh Partitioning: A Multilevel Balancing and Refinement Algorithm. SIAM Journal on Scientific Computing, 22(1):63–80, 2000.
  • [31] C. Walshaw and M. Cross. JOSTLE: Parallel Multilevel Graph-Partitioning Software – An Overview. In F. Magoules, editor, Mesh Partitioning Techniques and Domain Decomposition Techniques, pages 27–58. Civil-Comp Ltd., 2007. (Invited chapter).

Appendix A Karlsruhe Fast Flow Partitioner

We now provide a brief overview over the techniques used in the underlying graph partitioner which is used a graph partitioner later. KaFFPa [26] is a classical matching based multilevel graph partitioner. Recall that a multilevel graph partitioner basically has three phases: coarsening, initial partitioning and uncoarsening.

KaFFPa makes contraction more systematic by separating two issues: A rating function indicates how much sense it makes to contract an edge based on local information. A matching algorithm tries to maximize the sum of the ratings of the contracted edges looking at the global structure of the graph. While the rating functions allows a flexible characterization of what a “good” contracted graph is, the simple, standard definition of the matching problem allows to reuse previously developed algorithms for weighted matching. Matchings are contracted until the graph is “small enough”. In [17] we have observed that the rating function works best among other edge rating functions, so that this rating function is also used in KaFFPa.

We employed the Global Path Algorithm (GPA) as a matching algorithm. It was proposed in [21] as a synthesis of the Greedy algorithm and the Path Growing Algorithm [13]. This algorithm achieves a half-approximation in the worst case, but empirically, GPA gives considerably better results than Sorted Heavy Edge Matching and Greedy (for more details see [17]). GPA scans the edges in order of decreasing weight but rather than immediately building a matching, it first constructs a collection of paths and even cycles. Afterwards, optimal solutions are computed for each of these paths and cycles using dynamic programming.

The contraction is stopped when the number of remaining nodes is below . The graph is then small enough to be partitioned by some initial partitioning algorithm. KaFFPa employs Scotch as an initial partitioner since it empirically performs better than Metis.

Recall that the refinement phase iteratively uncontracts the matchings contracted during the contraction phase. After a matching is uncontracted, local search based refinement algorithms move nodes between block boundaries in order to reduce the cut while maintaining the balancing constraint. Local improvement algorithms are usually variants of the FM-algorithm [14]. The algorithm is organized in rounds. In each round, a priority queue is used which is initialized with all vertices that are incident to more than one block, in a random order. The priority is based on the gain where is the decrease in edge cut when moving to block . Ties are broken randomly if there is more than one block that yields the maximum gain when moving to it. Local search then repeatedly looks for the highest gain node . Each node is moved at most once within a round. After a node is moved its unmoved neighbors become eligible, i.e. its unmoved neighbors are inserted into the priority queue. When a stopping criterion is reached all movements to the best found cut that occurred within the balance constraint are undone. This process is repeated several times until no improvement is found.

During the uncoarsening phase KaFFPa additionally uses more advanced refinement algorithms. The first method is based on max-flow min-cut computations between pairs of blocks, i.e., a method to improve a given bipartition. Roughly speaking, this improvement method is applied between all pairs of blocks that share a non-empty boundary. The algorithm basically constructs a flow problem by growing an area around the given boundary vertices of a pair of blocks such that each min cut in this area yields a feasible bipartition of the original graph within the balance constraint. This yields a locally improved -partition of the graph. The second method for improving a given partition is called multi-try FM. Roughly speaking, a -way local search initialized with a single boundary node is repeatedly started. Previous methods are initialized with all boundary nodes.

KaFFPa extended the concept of iterated multilevel algorithms which was introduced by [29]. The main idea is to iterate the coarsening and uncoarsening phase. Once the graph is partitioned, edges that are between two blocks are not contracted. An F-cycle works as follows: on each level we perform at most two recursive calls using different random seeds during contraction and local search. A second recursive call is only made the second time that the algorithm reaches a particular level. As soon as the graph is partitioned, edges that are between blocks are not contracted. This ensures nondecreasing quality of the partition since our refinement algorithms guarantee no worsening and break ties randomly. These so called global search strategies are more effective than plain restarts of the algorithm.

Appendix B Additional Experimental Data

b.1 Further Parameter Tuning

In this Section we perform parameter tuning using KaFFPaEco (a faster but not so powerful as KaFFPaStrong) as a base case partitioner. We start tuning the fraction parameter . As before we set the flip coin parameter to one. In Figure 5 we can see that the algorithm is not too sensitive about the exact choice of this parameter. As before, larger values of speed up the convergence rate and improve the result achieved in the end. Since is the best parameter in the end, we choose it as our default value.

We now tune the ratio between mutation to crossover operations. For this test we set . The results a similar to the results achieved when using KaFFPaStrong as a base case partitioner. Again we can see that for smaller values of the algorithm is not to sensitive about the exact choice of the parameter. When , i.e. no crossover operation is performed the convergence speed slows down which yields worse average results in the end. The results of and are comparable in the end. We choose for consistency.

Figure 8: Conv. plots for the fraction using (left) and the flip coin using (right). In both cases KaFFPaEco is used as a base case partitioner.

b.2 Further Comparison Data

Figure 9: Convergence plots for the comparison with repeated executions of KaFFPa.
/Algo. Reps. KaFFPaE Scotch Metis
Avg. Avg. Best. [s] Best. [s]
2 569 568 671 0.22 711 0.12
4 1 229 1 217 1 486 0.41 1 574 0.13
8 2 207 2 173 2 663 0.62 2 831 0.13
16 3 568 3 474 4 192 0.86 4 500 0.14
32 5 481 5 298 6 437 1.15 6 899 0.15
64 8 141 7 879 9 335 1.46 10 306 0.18
128 11 937 11 486 13 427 1.85 14 500 0.20
256 17 262 16 634 18 972 2.28 20 341 0.25
overall 3 872 3 779 4 507 0.87 4 835 0.16
Table 4: Averages of final values of different algorithms on the middlesized testset. KaFFPa (Reps) and KaFFPaE was given after two hours of time on 16 PEs per repetitions and instance. Average values of Metis and Scotch are average values of the best cut that occurred out of ten repetitions.
Algo. S3R K3R KC SC
avg. [s] avg. [s] avg. [s] avg. [s]
2 591 19 577 14 582 12 590 17
4 1 304 30 1 261 28 1 254 22 1 302 27
8 2 336 40 2 252 45 2 255 36 2 332 41
16 3 723 54 3 617 67 3 649 57 3 714 61
32 5 720 82 5 569 110 5 540 99 5 722 84
64 8 463 116 8 236 164 8 213 146 8 512 113
128 12 435 171 12 008 239 11 895 225 12 432 162
256 17 915 217 17 335 327 17 199 329 17 935 232

Table 5: Comparison of different combine operators. Average values of cuts and runtime.

b.3 Larger Scalability Plots

Figure 10: Scalability of our algorithm: (upper) a normal convergence plot, (middle) mean minimum cut relative to best cut of KaFFPaE using one PE, (lower) pseudo speedup .

b.4 Road Networks

PUNCH Buffoon KaFFPa Strong KaSPar Strong Scotch Metis
graph Best Avg. [m] Best Avg. [m] Best Avg. [m] Best Avg. [m] Best Avg. [m] Best Avg. [m]
deu 2 164 166 0.83 161 161 6.2 163 166 3.29 167 172 3.86 265 279 0.05 271 296 0.10
deu 4 400 410 0.96 393 394 6.8 395 403 5.25 419 426 4.07 608 648 0.10 592 710 0.10
deu 8 711 746 1.02 693 694 9.7 726 729 5.85 762 773 4.17 1 109 1 211 0.15 1 209 1 600 0.10
deu 16 1 144 1 188 0.83 1 137 1 148 16.8 1 263 1 278 7.05 1 308 1 333 4.64 1 957 2 061 0.20 2 052 2 191 0.10
deu 32 1 960 2 032 0.71 1 898 1 928 31.7 2 115 2 146 7.68 2 182 2 217 4.73 3 158 3 262 0.25 3 225 3 607 0.10
deu 64 3 165 3 253 0.83 3 143 3 164 61.1 3 432 3 440 8.55 3 610 3 631 4.89 4 799 4 937 0.30 4 985 5 320 0.10
eur 2 129 130 4.25 129 175 39.5 130 130 16.88 133 138 32.44 369 448 0.20 412 454 0.55
eur 4 309 309 3.58 310 317 39.1 412 430 30.40 355 375 36.13 727 851 0.40 902 1 698 0.54
eur 8 634 671 2.93 659 671 47.9 749 772 34.45 774 786 37.21 1 338 1 461 0.60 2 473 3 819 0.55
eur 16 1 293 1 353 2.52 1 238 1 257 73.5 1 454 1 493 39.01 1 401 1 440 42.56 2 478 2 563 0.81 3 314 8 554 0.56
eur 32 2 289 2 362 2.17 2 240 2 260 130.2 2 428 2 504 40.76 2 595 2 643 43.31 4 057 4 249 1.00 5 811 7 380 0.55
eur 64 3 828 3 984 2.41 3 825 3 862 248.9 4 240 4 264 42.23 4 502 4 526 42.23 6 518 6 739 1.23 10 264 13 947 0.55
overall 822 847 1.57 812 831 33.9 893.05 909 13.97 911 931 13.03 1 495 1 607 0.30 1 800 2 400 0.23
Table 6: Detailed per instance results for road networks. PUNCH was run 100 times, Buffoon 10 times and KaFFPa, KaSPar, Scotch and Metis where run 5 times.

Appendix C Instances

small sized instances
graph
rgg15 160 240
rgg16 342 127
delaunay15 98 274
delaunay16 196 575
uk 4 824 6 837
luxemburg 114 599 119 666
3elt 4 720 13 722
4elt 15 606 45 878
fe_sphere 16 386 49 152
cti 16 840 48 232
fe_body 45 087 163 734
medium sized instances
graph
rgg17 728 753
rgg18 1 547 283
delaunay17 393 176
delaunay18 786 396
bel 463 514 591 882
nld 893 041 1 139 540
t60k 60 005 89 440
wing 62 032 121 544
fe_tooth 78 136 452 591
fe_rotor 99 617 662 431
memplus 17 758 54 196

road networks
graph
germany 4 378 446 5 483 587
europe 18 029 721 22 217 686
Table 7: Basic properties of our benchmark set.

Appendix D Detailed Walshaw Benchmark Results


Graph/
2 4 8 16 32 64

add20
642 594 1 194 1 159 1 727 1 696 2 107 2 062 2 512 2 687 3 188 3 108
data 188 188 377 378 656 659 1 142 1 135 1 933 1 858 2 966 2 885
3elt 89 89 199 199 340 341 568 569 967 968 1 553 1 553
uk 19 19 40 40 80 82 144 146 251 256 417 419
add32 10 10 33 33 66 66 117 117 212 212 486 493
bcsstk33 10 096 10 097 21 390 21 508 34 174 34 178 55 327 54 763 78 199 77 964 109 811 108 467
whitaker3 126 126 380 380 654 655 1 091 1 091 1 678 1 697 2 532 2 552
crack 183 183 362 362 676 677 1 098 1 089 1 697 1 687 2 581 2 555
wing_nodal 1 695 1 695 3 563 3 565 5 422 5 427 8 353 8 339 12 040 11 828 16 185 16 124
fe_4elt2 130 130 349 349 603 604 1 002 1 005 1 620 1 628 2 530 2 519
vibrobox 11 538 10 310 18 956 19 098 24 422 24 509 33 501 32 102 41 725 40 085 49 012 47 651
bcsstk29 2 818 2 818 8 029 8 029 13 904 13 950 22 618 21 768 35 654 34 841 57 712 57 031
4elt 138 138 320 320 532 533 932 934 1 551 1 547 2 574 2 579
fe_sphere 386 386 766 766 1 152 1 152 1 709 1 709 2 494 2 488 3 599 3 584
cti 318 318 944 944 1 749 1 752 2 804 2 837 4 117 4 129 5 820 5 818
memplus 5 491 5 484 9 448 9 500 11 807 11 776 13 250 13 001 15 187 14 107 17 183 16 543
cs4 366 366 925 934 1 436 1 448 2 087 2 105 2 910 2 938 4 032 4 051
bcsstk30 6 335 6 335 16 596 16 622 34 577 34 604 70 945 70 604 116 128 113 788 176 099 172 929
bcsstk31 2 699 2 699 7 282 7 287 13 201 13 230 23 761 23 807 37 995 37 652 59 318 58 076
fe_pwt 340 340 704 704 1 433 1 437 2 797 2 798 5 523 5 549 8 222 8 276
bcsstk32 4 667 4 667 9 195 9 208 20 204 20 323 35 936 36 399 61 533 60 776 94 523 91 863
fe_body 262 262 598 598 1 026 1 048 1 714 1 779 2 796 2 935 4 825 4 879
t60k 75 75 208 208 454 454 805 815 1 320 1 352 2 079 2 123
wing 784 784 1 610 1 613 2 479 2 505 3 857 3 880 5 584 5 626 7 680 7 656
brack2 708 708 3 013 3 013 7 040 7 099 11 636 11 649 17 508 17 398 26 226 25 913
finan512 162 162 324 324 648 648 1 296 1 296 2 592 2 592 10 560 10 560
fe_tooth 3 814 3 815 6 846 6 867 11 408 11 473 17 411 17 396 25 111 24 933 34 824 34 433
fe_rotor 2 031 2 031 7 180 7 292 12 726 12 813 20 555 20 438 31 428 31 233 46 372 45 911
598a 2 388 2 388 7 948 7 952 15 956 15 924 25 741 25 789 39 423 38 627 57 497 56 179
fe_ocean 387 387 1 816 1 824 4 091 4 134 7 846 7 771 12 711 12 811 20 301 19 989
144 6 478 6 478 15 152 15 140 25 273 25 279 37 896 38 212 56 550 56 868 79 198 80 406
wave 8 658 8 665 16 780 16 875 28 979 29 115 42 516 42 929 61 104 62 551 85 589 86 086
m14b 3 826 3 826 12 973 12 981 25 690 25 852 42 523 42 351 65 835 67 423 98 211 99 655
auto 9 949 9 954 26 614 26 649 45 557 45 470 77 097 77 005 121 032 121 608 172 167 174 482
Table 8: Computing partitions from scratch %. In each -column the results computed by KaFFPaE are on the left and the current Walshaw cuts are presented on the right side.

Graph/
2 4 8 16 32 64

add20
623 576 1 180 1 158 1 696 1 689 2 075 2 062 2 422 2 387 2 963 3 021
data 185 185 369 369 638 638 1 111 1 118 1 815 1 801 2 905 2 809
3elt 87 87 198 198 334 335 561 562 950 950 1 537 1 532
uk 18 18 39 39 78 78 140 141 240 245 406 411
add32 10 10 33 33 66 66 117 117 212 212 486 490
bcsstk33 10 064 10 064 20 767 20 854 34 068 34 078 54 772 54 455 77 549 77 353 108 645 107 011
whitaker3 126 126 378 378 650 651 1 084 1 086 1 662 1 673 2 498 2 499
crack 182 182 360 360 671 673 1 077 1 077 1 676 1 666 2 534 2 529
wing_nodal 1 678 1 678 3 538 3 542 5 361 5 368 8 272 8 310 11 939 11 828 15 967 15 874
fe_4elt2 130 130 342 342 595 596 991 994 1 599 1 613 2 485 2 503
vibrobox 11 538 10 310 18 736 18 778 24 204 24 170 33 065 31 514 41 312 39 512 48 184 47 651
bcsstk29 2 818 2 818 7 971 7 983 13 717 13 816 22 000 21 410 34 535 34 400 55 544 55 302
4elt 137 137 319 319 522 523 906 908 1 523 1 524 2 543 2 565
fe_sphere 384 384 764 764 1 152 1 152 1 698 1 704 2 474 2 471 3 552 3 530
cti 318 318 916 916 1 714 1 714 2 746 2 758 3 994 4 011 5 579 5 675
memplus 5 353 5 353 9 375 9 362 11 662 11 624 13 088 13 001 14 617 14 107 16 997 16 259
cs4 360 360 917 926 1 424 1 434 2 055 2 087 2 892 2 925 4 016 4 051
bcsstk30 6 251 6 251 16 399 16 497 34 137 34 275 69 592 69 763 113 888 113 788 173 290 171 727
bcsstk31 2 676 2 676 7 150 7 150 12 985 13 003 23 299 23 232 37 109 37 228 58 143 57 953
fe_pwt 340 340 700 700 1 410 1 411 2 773 2 776 5 460 5 488 8 124 8 205
bcsstk32 4 667 4 667 8 725 8 733 19 956 19 962 35 140 35 486 59 716 58 966 91 544 91 715
fe_body 262 262 598 598 1 018 1 016 1 708 1 734 2 738 2 810 4 643 4 799
t60k 71 71 203 203 449 449 793 802 1 304 1 333 2 039 2 098
wing 773 773 1 593 1 602 2 451 2 463 3 807 3 852 5 559 5 626 7 561 7 656
brack2 684 684 2 834 2 834 6 800 6 861 11 402 11 444 17 167 17 194 25 658 25 913
finan512 162 162 324 324 648 648 1 296 1 296 2 592 2 592 10 560 10 560
fe_tooth 3 788 3 788 6 764 6 795 11 287 11 274 17 176 17 310 24 752 24 933 34 230 34 433
fe_rotor 1 959 1 959 7 118 7 126 12 445 12 472 20 076 20 112 30 664 31 233 45 053 45 911
598a 2 367 2 367 7 816 7 838 15 613 15 722 25 563 25 686 38 346 38 627 56 153 56 179
fe_ocean 311 311 1 693 1 696 3 920 3 921 7 657 7 631 12 437 12 539 19 521 19 989
144 6 434 6 438 15 203 15 078 25 092 25 109 37 730 37 762 55 941 56 356 78 636 78 559
wave 8 591 8 594 16 665 16 668 28 506 28 495 42 259 42 295 60 731 61 722 84 533 85 185
m14b 3 823 3 823 12 948 12 948 25 390 25 520 41 778 41 997 65 359 65 180 96 519 96 802
auto 9 673 9 683 25 789 25 836 44 785 44 832 75 719 75 778 119 157 120 086 170 989 171 535
Table 9: Computing partitions from scratch %. In each -column the results computed by KaFFPaE are on the left and the current Walshaw cuts are presented on the right side.

Graph/
2 4 8 16 32 64

add20
598 546 1 169 1 149 1 689 1 675 2 061 2 062 2 411 2 387 2 963 3 021
data 182 181 363 363 628 628 1 088 1 084 1 786 1 776 2 832 2 798
3elt 87 87 197 197 329 330 557 558 944 942 1 509 1 519
uk 18 18 39 39 75 76 137 139 237 242 395 400
add32 10 10 33 33 63 63 117 117 212 212 483 486
bcsstk33 9 914 9 914 20 167 20 179 33 919 33 922 54 333 54 296 77 457 77 101 106 903 106 827
whitaker3 126 126 377 378 644 644 1 073 1 079 1 650 1 667 2 477 2 498
crack 182 182 360 360 666 667 1 065 1 076 1 661 1 655 2 505 2 516
wing_nodal 1 669 1 668 3 521 3 522 5 341 5 345 8 241 8 264 11 793 11 828 15 892 15 813
fe_4elt2 130 130 335 335 578 580 983 984 1 575 1 592 2 461 2 482
vibrobox 11 254 10 310 18 690 18 696 23 924 23 930 32 615 31 234 40 816 39 183 47 624 47 361
bcsstk29 2 818 2 818 7 925 7 936 13 540 13 575 21 459 20 924 33 851 33 817 55 029 54 895
4elt 137 137 315 315 515 515 888 895 1 504 1 516 2 514 2 546
fe_sphere 384 384 762 762 1 152 1 152 1 681 1 683 2 434 2 465 3 528 3 522
cti 318 318 889 889 1 684 1 684 2 719 2 721 3 927 3 920 5 512 5 594
memplus 5 281 5 267 9 292 9 297 11 624 11 543 13 095 13 001 14 537 14 107 16 650 16 044
cs4 353 353 909 912 1 420 1 431 2 043 2 079 2 866 2 919 3 973 4 012
bcsstk30 6 251 6 251 16 189 16 186 34 071 34 146 69 337 69 288 112 159 113 321 170 321 170 591
bcsstk31 2 669 2 670 7 086 7 088 12 853 12 865 22 871 23 104 36 502 37 228 57 502 56 674
fe_pwt 340 340 700 700 1 405 1 405 2 743 2 745 5 399 5 423 7 985 8 119
bcsstk32 4 622 4 622 8 441 8 441 19 411 19 601 34 481 35 014 58 395 58 966 90 586 89 897
fe_body 262 262 588 588 1 013 1 014 1 684 1 697 2 696 2 787 4 512 4 642
t60k 65 65 195 195 443 445 788 796 1 299 1 329 2 021 2 089
wing 770 770 1 590 1 593 2 440 2 452 3 775 3 832 5 538 5 564 7 567 7 611
brack2 660 660 2 731 2 731 6 592 6 611 11 193 11 232 16 919 17 112 25 598 25 805
finan512 162 162 324 324 648 648 1 296 1 296 2 592 2 592 10 560 10 560
fe_tooth 3 773 3 773 6 688 6 714 11 154 11 185 17 070 17 215 24 733 24 933 34 320 34 433
fe_rotor 1 940 1 940 6 899 6 940 12 309 12 347 19 680 19 932 30 356 30 974 45 131 45 911
598a 2 336 2 336 7 728 7 735 15 414 15 483 25 450 25 533 38 476 38 550 56 377 56 179
fe_ocean 311 311 1 686 1 686 3 893 3 902 7 385 7 412 12 211 12 362 19 400 19 727
144 6 357 6 359 15 004 14 982 25 030 24 767 37 419 37 122 55 460 55 984 77 430 78 069
wave 8 524 8 533 16 558 16 533 28 489 28 492 42 084 42 134 60 537 61 280 83 413 84 236
m14b 3 802 3 802 12 945 12 945 25 154 25 143 41 465 41 536 65 237 65 077 96 257 96 559
auto 9 450 9 450 25 271 25 301 44 206 44 346 74 636 74 561 119 294 119 111 169 835 171 329
Table 10: Computing partitions from scratch %. In each -column the results computed by KaFFPaE are on the left and the current Walshaw cuts are presented on the right side.