1 Background and polytopes tested
A convex polyhedron can be represented by either a list of vertices and extreme rays, called a V-representation, or a list of its facet defining inequalities, called an H-representation. The vertex enumeration problem is to convert an H-representation to a V-representation. The computationally equivalent facet enumeration problem performs the reverse transformation. For further background see G. Ziegler .
In this note we consider only polytopes (bounded polyhedra) so extreme rays will not be required. Furthermore, for technical simplicity in this description, we assume that all polytopes are full dimensional. Neither condition is required for the algorithms tested and in fact some of our test problems are not full dimensional. The input for either problem is represented by an by matrix. For the vertex enumeration problem this is a list of inequalities in variables whose intersection define . For a facet enumeration problem it is a list of the vertices of each beginning with a 1 in column one111Extreme rays would be indicated by a zero in column one.. So in either case, under our assumption, the dimension of is .
One of the features of this type of enumeration problem is that the output size varies widely for given input parameters and . This is shown explicitly by McMullen’s Upper Bound Theorem (see, e.g., ) which is tight. It states that for a vertex enumeration problem with parameters we have:
where is the number of vertices that are output. For a facet enumeration problem, by polarity of polytopes, the same inequality holds if we replace by , the number of facets. By inverting the formula we can get lower bounds on the output size.
A class of polytopes for which the bound (1) is tight are the which are usually given as a V-representation consisting of points on the
-dimensional moment curve. So, for example, a cyclic polytope withand has vertices in dimension 20 and facets. This implies that if we started with its H-representation, i.e., and , then the output would consist of only 40 vertices! Problems of this second type are called highly degenerate since each vertex may be described by many different combinations of facets. This contrasts with a simple polytope where each vertex is given by the intersection of exactly facets. Dually a simplicial polytope is one where each facet contains precisely vertices. Cyclic polytopes are simplicial.
The polytopes we tested are described in Table 1 and range from simple polyhedra to highly degenerate polyhedra. This table includes the results of an lrs run on each polytope as lrs gives the number of cobases in a symbolic perturbation of the polytope, showing how degenerate the polytope is. The corresponding input files are contained in the lrslib-062 distribution  in subdirectory lrslib-062/ine/test-062. Note that the input sizes are small, roughly comparable and, except for , much smaller than the output sizes. Five of the problems were previously used in :
c30-15, c40-21: cyclic polytopes described above. These have very large integer coefficients, the longest having 23 digits for c30-15 and 33 digits for c40-21.
mit: a configuration polytope used in materials science, created by G. Garbulsky . The inequality coefficients are mostly integers in the range with a few larger values.
perm10: the permutahedron for permutations of length , whose vertices are the permutations of . It is a 9-dimensional simple polytope. More generally, for permutations of length , this polytope is described by facets and one equation and has vertices. The variables all have coefficients or .
bv7: an extended formulation of the permutahedron based on the Birkhoff-Von Neumann polytope. It is described by inequalities and equations in variables and also has vertices. The inequalities are all valued and the equations have single digit integers. The input matrix is very sparse and the polytope is highly degenerate.
The new problems are:
fq48-19: related to the travelling salesman problem for , created by F. Quondam (private communication). The coefficients are all valued and it is moderately degenerate.
mit71: a correlation polytope related to problem mit, created by G. Garbulsky . The coefficients are similar to mit and it is moderately degenerate.
zfw91: polytope based on a sensor network that is extremely degenerate and has large output size, created by Z.F. Wang . There are three non-zeroes per row.
cp6: the cut polytope for
solved in the ‘reverse’ direction: from an H-representation to a V-representation. The output consists of the 32 cut vectors of. It is extremely degenerate, approaching the lower bound of 19 vertices implied by (1) for these parameters. The coefficients of the variables are .
|zfw91||H||91||38||7.1K||2787415||205M||10819289888124†††Computed by mplrs1 v6.2 in 2144809 seconds using 289 cores (see Section 2).||*|
|cp6||H||368||16||18K||32||1.6K||4844923002||153||1762156‡‡‡Computed by lrs v6.0|
2 Algorithms, implementations and machines used
There are basically two approaches to this problem: pivoting using reverse search  and the Fourier-Motzkin double description method, see . The conventional wisdom is to use the double description method if the polytope is highly degenerate and use a pivoting method if it is simple or has low degeneracy (see e.g. , Section 3). Results below shed some doubt on the first part of this rule, especially when parallel processing is used. They do, however, confirm the second part of the rule. We tested five sequential codes, including four based on the double description method and one based on pivoting:
cddr+ (v. 0.77): Double description code developed by K. Fukuda .
normaliz (v. 3.0.0): Hybrid parallel double description code developed by the Normaliz project .
porta (v. 1.4.1): Double description code developed by T. Christof and A. Lobel .
ppl (v. 1.1): Double description code developed by the Parma Polyhedral Library project .
lrs (v. 6.2): C vertex enumeration code based on reverse search developed by D. Avis .
Of these five codes, lrs and normaliz offer parallelization. For normaliz this occurs automatically with the standard implementation if it is run on a shared memory multicore machine. The number of cores used can be controlled with the -x option, which we used extensively in our tests. For lrs two wrappers have been developed:
plrs (v. 6.2): C++ wrapper for lrs using the Boost library, developed by G. Roumanis . It runs on a single shared memory multicore machine.
mplrs (v. 6.2): C wrapper for lrs using the MPI library, developed by the authors . It runs on a network of multicore machines.
For cp6 the lrs times in Tables 1–2 and the plrs times in Table 4 were obtained using v. 6.0 which has a smaller backtrack cache size than v. 6.2. Hence the mplrs speedups against lrs for cp6 in Table 2
are probably somewhat larger than they would be againstlrs v. 6.2. The mplrs times in Table 6 were obtained using v. 5.1b.
All of the above codes compute in exact integer arithmetic and with the exception of porta, are compiled with the GMP library for this purpose. However normaliz uses hybrid arithmetic, giving a very large speedup for certain inputs as described in the next section. In addition, porta can be run in either fixed or extended precision.
Finally, lrs is also available in a fixed precision 64-bit version, lrs1, which does no overflow checking. In general, this gives unpredictable results that need independent verification. In practice, for cases when there is no arithmetic overflow, lrs1 runs about 4–6 times faster than lrs (see Computational Results on the lrs home page ). The parallel version of lrs1, mplrs1, was used to compute the number of cobases for zfw91, taking roughly 25 days on 289 cores.
The tests were performed using the following machines:
mai20: 2x Xeon E5-2690 (10-core 3.0GHz), 20 cores, 128GB memory, 3TB hard drive
mai32abcd: 4 nodes, each containing: 2x Opteron 6376 (16-core 2.3GHz), 32GB memory, 500GB hard drive (128 cores in total)
mai32ef: 4x Opteron 6376 (16-core 2.3GHz), 64 cores, 256GB memory, 4TB hard drive
mai64: 4x Opteron 6272 (16-core 2.1GHz), 64 cores, 64GB memory, 500GB hard drive
mai12: 2x Xeon X5650 (6-core 2.66GHz), 12 cores, 24GB memory, 60GB hard drive
mai24: 2x Opteron 6238 (12-core 2.6GHz), 24 cores, 16GB memory, 600GB RAID5 array
Tsubame2: supercomputer located at Tokyo Institute of Technology
The first six machines total 312 cores and are located at Kyoto University. They were purchased between 2011-15 for a combined total of 3.9 million yen ($33,200).
3 Computational results
Table 2 contains the results obtained by running the five sequential codes on the problems described in Table 1. The times for lrs shown in Table 1 are included for comparison. The time limit was one week (604,800 seconds) except for cp6. Programs cddr+, lrs, ppl were used with no parameters.
|cp6||1762156§§§Computed by lrs v6.0||1463829||6570000||138162||1518785||**||4925580|
The program normaliz performs many additional functions, but was set to perform only vertex enumeration/facet enumeration for these tests. By default, it begins with 64-bit integer arithmetic and only switches to GMP arithmetic (used by all other programs except porta) in case of overflow. In this case, all work done with 64-bit arithmetic is discarded. For our test problems this happens on c30-15, c40-20 and mit71, however the first two problems terminated abnormally after switching to GMP. Using the -B flag normaliz will do all computations using GMP arithmetic. We give times for the default hybrid arithmetic and also for GMP-only arithmetic. Note that mit71 runs significantly faster with the -B flag reflecting the time wasted in 64-bit arithmetic mode.
As mentioned above, porta supports arithmetic using either 64-bit integers or its own extended precision arithmetic package. The program terminates if overflow occurs. We tested both options on each problem and found the extended precision option outperformed the 64-bit option in all cases.
It is hard to draw many general conclusions from the results in Table 2; especially since the four double description implementations behaved remarkably differently on most of the problems. This could be due to the fact that this method is highly sensitive to the insertion order of the input and the codes may be using different orderings. One clear result was that none of these codes could solve the cyclic polytope c40-20 problem and struggled even on c30-15. We also observed that the double description codes use substantial memory, especially normaliz. In fact the machines with 32GB or less of memory were not able to solve either mit71 or cp6 using these codes, and even in single processor mode most of the 128GB available on mai20 was required for some problems. Memory use by lrs/plrs/mplrs was negligible, making them good background processes. On the extremely degenerate problem cp6, lrs was in the middle of the pack, about 20% slower than cddr+, whereas normaliz was nearly 13 times faster than lrs. On this problem, neither ppl nor porta was able to produce any output in the time allotted (76 and 57 days respectively). Only porta and normaliz could effectively solve the sparse 0, polytope zfw91. A 289-core run with mplrs1 was approximately 70 times slower than porta and 11 times slower than normaliz. Note that with about 151 million cobases/vertex cp6 is far more degenerate than zfw91, which has about 4 million cobases/vertex.
To put the above results in perspective, we recall that the problem mit was a big challenge in the early 1990s. At that time, early versions of both cddr+ and lrs took over a month to solve this problem. Combined hardware and software improvements over the years give speedups of over 5000 times; both codes now complete the job in less than 10 minutes. We will see that parallelization of lrs can lead to further dramatic reductions in running time: on our 312-core cluster the problem now requires only 12 seconds.
We move now to the three parallel codes. For mplrs and plrs we used the default settings (see User’s guide  for details):
plrs: -id 4
mplrs: -id 2, -lmin 3 -maxc 50 -scale 100 -maxbuf 500
For normaliz we used the default settings which imply that hybrid arithemtic is used. Table 3 contains results for low scale parallelization and all problems were run on the single workstation mai20. With 4 cores available, plrs usually outperforms mplrs, they give similar performances with 8 cores, and mplrs is usually faster with 12 or more cores. With 16 cores mplrs gave quite consistent speedups, in the range 10-12.3. On the problems it could solve, the speedups obtained by normaliz
show a much higher variance, in the range 0.93-15.7.
|Name||4 cores||8 cores||12 cores||16 cores|
Table 4 contains results for medium scale parallelization on the 64-core shared memory machine mai64. Note that these processors are considerably slower than mai20 on a per-core basis. We used 8,16,32,64 cores and speedups are measured by comparing with the running time on 8 cores. With 64 cores, mplrs was the clear winner over plrs with speedups ranging from 4.3 to 7.2. plrs showed little improvement after 32 cores and normaliz again had very large variance.
|Name||8 cores||16 cores||32 cores||64 cores|
|secs/speedup vs 8 cores||secs/speedup vs 8 cores||secs/speedup vs 8 cores||secs/speedup vs 8 cores|
|cp6 555plrs times computed using v6.0||727771||565915||38621||326214||377857||23773||171194||298408||17468||100676||229713||15480|
Table 5 contains results for medium scale parallelization on a 312-core cluster of computers. Only mplrs is able to use all cores in this heterogeneous environment. The machines were scheduled in the order given at the end of Section 2 (excluding Tsubame2). Due to the heterogeneous selection of machines we do not present speedups in this table. For example, we observed that mai20 is substantially faster than the other machines – more than would be expected by simply comparing clock speeds and number of cores. It was more than twice as fast as mai12 on c40-20. Jobs completing in under a minute do not profit much, if at all, as extra cores are added. However, the longer running jobs show continuous improvement. Excluding zfw91, lrs required about 3 weeks on the fastest machine (mai20) to complete the other 9 problems. Using the 312-core cluster this time is reduced to 4 hours and 40 minutes. These total times are dominated by cp6. Excluding this problem as well, the lrs total running time of 12 hours 13 minutes is improved to roughly 8 minutes using the cluster.
|16 cores||32 cores||64 cores||128 cores||256 cores||312 cores|
Table 6 shows results for large scale parallelization obtained by Kazuki Yoshizoe using mplrs v. 6.0 on the Tsubame2 supercomputer at the Tokyo Institute of Technology. He ran tests using problems mit71 and cp6 and observed near linear speedup between 12 and 1200 cores for both problems666cp6 benchmark was taken with mai12 which has a similar processor (Xeon X5650) to those we used on Tsubame2 (Xeon X5670).. With 1200 cores mplrs solved cp6 in about 42 minutes, nearly 600 times faster than cddr+, 55 times faster than normaliz in single processor mode and over 6 times faster than normaliz running on 64 cores, the largest shared memory machine available to us.
|Name||mplrs (v. 5.1b)|
|12 cores||36 cores||72 cores||144 cores||300 cores||600 cores||1200 cores|
These results show that the difficulty of solving vertex/facet enumeration problems varies enormously, even for inputs of roughly the same size. Any given problem may be tractable or intractable depending on the method used to solve it. General rules are dangerous and likely to be contradicted by further examples, but we hazard two: learn about your polytope and use multicore hardware.
4.1 Learn something about your polytope
Unfortunately not much can be learned by simply inspecting the input file. Many 0/1 input files are highly degenerate, but not all: perm10, for example, is a simple polytope. Fortunately the degeneracy of a polytope can be checked by doing a partial run of lrs for a few minutes, stopping after a certain number of bases have been computed. As seen from Table 1
, the ratio of bases computed to V/H output gives a good estimate of degeneracy of the problem. It will also give an indication as to whether the output is binary (cp6), consists of small integers (bv7, perm10, zfw91), huge integers (c30-15, c40-20) or rationals (fq48-19, mit, mit71). lrs
also has an estimate feature that gives an unbiased estimate of the output size, number of bases and totallrs running time. These estimates have high variance but do give some indication of the tractability of the problem.
For problems with low degeneracy or very large output sizes pivoting methods such as the lrs family may be the only tractable approach. For extremely degenerate problems with binary or small integer output it is not so clear, as can be seen by comparing the results obtained for cp6 and zfw91.
4.2 Use multicore hardware
Comparing Table 2 with the remaining tables clearly indicates the necessity of using parallel processing for hard vertex/facet enumeration problems: even just 16 cores gives an order of magnitude improvement. A supercomputer on the scale of Tsubame2 may seem out of reach for most researchers. However, at current prices, a 1200-core cluster could be built for roughly $100,000 and would be considerably cheaper with used hardware. This price will certainly fall substantially in the near future making this amount of computing power readily available to more researchers. The problem will not be the availability of the hardware but the availability of software that can make effective use of it.
We thank Kazuki Yoshizoe for kindly allowing us to use the results of his Tsubame2 experiments and for helpful discussions concerning the MPI library which improved mplrs’ performance. This work was partially supported by JSPS Kakenhi Grants 23700019 and 15H00847, Grant-in-Aid for Scientific Research on Innovative Areas, ‘Exploring the Limits of Computation (ELC)’.
-  Assarf, B., Gawrilow, E., Herr, K., Joswig, M., Lorenz, B., Paffenholz, A., Rehn, T.: Computing convex hulls and counting integer ponts with polymake. Mathematical Programming Computation (to appear) pp. 1–38 (2016)
-  Avis, D.: (2013). http://cgm.cs.mcgill.ca/~avis/C/lrs.html
-  Avis, D., Fukuda, K.: A pivoting algorithm for convex hulls and vertex enumeration of arrangements and polyhedra. Discrete & Computational Geometry 8, 295–313 (1992)
-  Avis, D., Jordan, C.: mplrs: A scalable parallel vertex/facet enumeration code. CoRR abs/1511.06487 (2015). URL http://arxiv.org/abs/1511.06487
Avis, D., Roumanis, G.: A portable parallel implementation of the lrs vertex
In: Combinatorial Optimization and Applications - 7th International Conference, COCOA 2013,Lecture Notes in Computer Science, vol. 8287, pp. 414–429. Springer (2013)
-  Bugseng.org: (2013). http://bugseng.com/products/ppl
-  Ceder, G., Garbulsky, G., Avis, D., Fukuda, K.: Ground states of a ternary fcc lattice model with nearest- and next-nearest-neighbor interactions. Phys Rev B Condens Matter 49(1), 1–7 (1994)
-  Christof, T., Lobel, A.: (2009). http://typo.zib.de/opt-long_projects/Software/Porta/
-  Fukuda, K.: (2012). http://www.inf.ethz.ch/personal/fukudak/cdd_home
-  Moran, B., Cohen, F., Wang, Z., Suvorova, S., Cochran, D., Taylor, T., Farrell, P., Howard, S.: Bounds on multiple sensor fusion. ACM Transactions on Sensor Networks 12(2) (2016)
-  Normaliz: (2015). http://www.home.uni-osnabrueck.de/wbruns/normaliz/
-  Ziegler, G.M.: Lectures on Polytopes, Graduate Texts in Mathematics, vol. 152. Springer (1995)