1 Background and polytopes tested
A convex polyhedron can be represented by either a list of vertices and extreme rays, called a Vrepresentation, or a list of its facet defining inequalities, called an Hrepresentation. The vertex enumeration problem is to convert an Hrepresentation to a Vrepresentation. The computationally equivalent facet enumeration problem performs the reverse transformation. For further background see G. Ziegler [12].
In this note we consider only polytopes (bounded polyhedra) so extreme rays will not be required. Furthermore, for technical simplicity in this description, we assume that all polytopes are full dimensional. Neither condition is required for the algorithms tested and in fact some of our test problems are not full dimensional. The input for either problem is represented by an by matrix. For the vertex enumeration problem this is a list of inequalities in variables whose intersection define . For a facet enumeration problem it is a list of the vertices of each beginning with a 1 in column one^{1}^{1}1Extreme rays would be indicated by a zero in column one.. So in either case, under our assumption, the dimension of is .
One of the features of this type of enumeration problem is that the output size varies widely for given input parameters and . This is shown explicitly by McMullen’s Upper Bound Theorem (see, e.g., [12]) which is tight. It states that for a vertex enumeration problem with parameters we have:
(1) 
where is the number of vertices that are output. For a facet enumeration problem, by polarity of polytopes, the same inequality holds if we replace by , the number of facets. By inverting the formula we can get lower bounds on the output size.
A class of polytopes for which the bound (1) is tight are the which are usually given as a Vrepresentation consisting of points on the
dimensional moment curve. So, for example, a cyclic polytope with
and has vertices in dimension 20 and facets. This implies that if we started with its Hrepresentation, i.e., and , then the output would consist of only 40 vertices! Problems of this second type are called highly degenerate since each vertex may be described by many different combinations of facets. This contrasts with a simple polytope where each vertex is given by the intersection of exactly facets. Dually a simplicial polytope is one where each facet contains precisely vertices. Cyclic polytopes are simplicial.The polytopes we tested are described in Table 1 and range from simple polyhedra to highly degenerate polyhedra. This table includes the results of an lrs run on each polytope as lrs gives the number of cobases in a symbolic perturbation of the polytope, showing how degenerate the polytope is. The corresponding input files are contained in the lrslib062 distribution [2] in subdirectory lrslib062/ine/test062. Note that the input sizes are small, roughly comparable and, except for , much smaller than the output sizes. Five of the problems were previously used in [5]:

c3015, c4021: cyclic polytopes described above. These have very large integer coefficients, the longest having 23 digits for c3015 and 33 digits for c4021.

mit: a configuration polytope used in materials science, created by G. Garbulsky [7]. The inequality coefficients are mostly integers in the range with a few larger values.

perm10: the permutahedron for permutations of length , whose vertices are the permutations of . It is a 9dimensional simple polytope. More generally, for permutations of length , this polytope is described by facets and one equation and has vertices. The variables all have coefficients or .

bv7: an extended formulation of the permutahedron based on the BirkhoffVon Neumann polytope. It is described by inequalities and equations in variables and also has vertices. The inequalities are all valued and the equations have single digit integers. The input matrix is very sparse and the polytope is highly degenerate.
The new problems are:

fq4819: related to the travelling salesman problem for , created by F. Quondam (private communication). The coefficients are all valued and it is moderately degenerate.

mit71: a correlation polytope related to problem mit, created by G. Garbulsky [7]. The coefficients are similar to mit and it is moderately degenerate.

zfw91: polytope based on a sensor network that is extremely degenerate and has large output size, created by Z.F. Wang [10]. There are three nonzeroes per row.

cp6: the cut polytope for
solved in the ‘reverse’ direction: from an Hrepresentation to a Vrepresentation. The output consists of the 32 cut vectors of
. It is extremely degenerate, approaching the lower bound of 19 vertices implied by (1) for these parameters. The coefficients of the variables are .
Name  Input  Output  lrs  
H/V  size  V/H  size  bases  depth  secs  
bv7  H  69  57  8.1K  5040  867K  84707280  17  8300 
c3015  V  30  16  4.7K  341088  73.8M  319770  14  39 
c4020  V  40  21  12K  40060020  15.6G  20030010  19  9445 
fq4819  H  48  19  1.3K  119184  8.7M  7843390  24  251 
mit71  H  71  61  9.5K  3149579  1.1G  57613364  20  20688 
mit  H  729  9  21K  4862  196K  1375608  101  496 
perm10  H  1023  11  29K  3628800  127M  3628800  45  2230 
zfw91  H  91  38  7.1K  2787415  205M  10819289888124^{†}^{†}†Computed by mplrs1 v6.2 in 2144809 seconds using 289 cores (see Section 2).  *  
cp6  H  368  16  18K  32  1.6K  4844923002  153  1762156^{‡}^{‡}‡Computed by lrs v6.0 
2 Algorithms, implementations and machines used
There are basically two approaches to this problem: pivoting using reverse search [3] and the FourierMotzkin double description method, see [12]. The conventional wisdom is to use the double description method if the polytope is highly degenerate and use a pivoting method if it is simple or has low degeneracy (see e.g. [1], Section 3). Results below shed some doubt on the first part of this rule, especially when parallel processing is used. They do, however, confirm the second part of the rule. We tested five sequential codes, including four based on the double description method and one based on pivoting:

cddr+ (v. 0.77): Double description code developed by K. Fukuda [9].

normaliz (v. 3.0.0): Hybrid parallel double description code developed by the Normaliz project [11].

porta (v. 1.4.1): Double description code developed by T. Christof and A. Lobel [8].

ppl (v. 1.1): Double description code developed by the Parma Polyhedral Library project [6].

lrs (v. 6.2): C vertex enumeration code based on reverse search developed by D. Avis [2].
Of these five codes, lrs and normaliz offer parallelization. For normaliz this occurs automatically with the standard implementation if it is run on a shared memory multicore machine. The number of cores used can be controlled with the x option, which we used extensively in our tests. For lrs two wrappers have been developed:

plrs (v. 6.2): C++ wrapper for lrs using the Boost library, developed by G. Roumanis [5]. It runs on a single shared memory multicore machine.

mplrs (v. 6.2): C wrapper for lrs using the MPI library, developed by the authors [4]. It runs on a network of multicore machines.
For cp6 the lrs times in Tables 1–2 and the plrs times in Table 4 were obtained using v. 6.0 which has a smaller backtrack cache size than v. 6.2. Hence the mplrs speedups against lrs for cp6 in Table 2
are probably somewhat larger than they would be against
lrs v. 6.2. The mplrs times in Table 6 were obtained using v. 5.1b.All of the above codes compute in exact integer arithmetic and with the exception of porta, are compiled with the GMP library for this purpose. However normaliz uses hybrid arithmetic, giving a very large speedup for certain inputs as described in the next section. In addition, porta can be run in either fixed or extended precision.
Finally, lrs is also available in a fixed precision 64bit version, lrs1, which does no overflow checking. In general, this gives unpredictable results that need independent verification. In practice, for cases when there is no arithmetic overflow, lrs1 runs about 4–6 times faster than lrs (see Computational Results on the lrs home page [2]). The parallel version of lrs1, mplrs1, was used to compute the number of cobases for zfw91, taking roughly 25 days on 289 cores.
The tests were performed using the following machines:

mai20: 2x Xeon E52690 (10core 3.0GHz), 20 cores, 128GB memory, 3TB hard drive

mai32abcd: 4 nodes, each containing: 2x Opteron 6376 (16core 2.3GHz), 32GB memory, 500GB hard drive (128 cores in total)

mai32ef: 4x Opteron 6376 (16core 2.3GHz), 64 cores, 256GB memory, 4TB hard drive

mai64: 4x Opteron 6272 (16core 2.1GHz), 64 cores, 64GB memory, 500GB hard drive

mai12: 2x Xeon X5650 (6core 2.66GHz), 12 cores, 24GB memory, 60GB hard drive

mai24: 2x Opteron 6238 (12core 2.6GHz), 24 cores, 16GB memory, 600GB RAID5 array

Tsubame2: supercomputer located at Tokyo Institute of Technology
The first six machines total 312 cores and are located at Kyoto University. They were purchased between 201115 for a combined total of 3.9 million yen ($33,200).
3 Computational results
Table 2 contains the results obtained by running the five sequential codes on the problems described in Table 1. The times for lrs shown in Table 1 are included for comparison. The time limit was one week (604,800 seconds) except for cp6. Programs cddr+, lrs, ppl were used with no parameters.
Name  lrs  cddr+  ppl  normaliz  porta  
secs  secs  secs  secs(hybrid)  secs(GMP)  secs(64bit)  secs(extended)  
bv7  8300  *  578  122  1030  315  310 
c3015  39  2991  3040  **  **  **  ** 
c4020  9445  *  *  **  **  **  ** 
fq4819  252  437  1355  41  300  5103  4561 
mit71  20688  *  260347  503564  364354  108993  107689 
mit  496  368  40644  175  2174  **  47478 
perm10  2230  *  *  1025  33240  *  * 
zfw91  *  *  *  189763  *  31348  30787 
cp6  1762156^{§}^{§}§Computed by lrs v6.0  1463829  6570000  138162  1518785  **  4925580 
The program normaliz performs many additional functions, but was set to perform only vertex enumeration/facet enumeration for these tests. By default, it begins with 64bit integer arithmetic and only switches to GMP arithmetic (used by all other programs except porta) in case of overflow. In this case, all work done with 64bit arithmetic is discarded. For our test problems this happens on c3015, c4020 and mit71, however the first two problems terminated abnormally after switching to GMP. Using the B flag normaliz will do all computations using GMP arithmetic. We give times for the default hybrid arithmetic and also for GMPonly arithmetic. Note that mit71 runs significantly faster with the B flag reflecting the time wasted in 64bit arithmetic mode.
As mentioned above, porta supports arithmetic using either 64bit integers or its own extended precision arithmetic package. The program terminates if overflow occurs. We tested both options on each problem and found the extended precision option outperformed the 64bit option in all cases.
It is hard to draw many general conclusions from the results in Table 2; especially since the four double description implementations behaved remarkably differently on most of the problems. This could be due to the fact that this method is highly sensitive to the insertion order of the input and the codes may be using different orderings. One clear result was that none of these codes could solve the cyclic polytope c4020 problem and struggled even on c3015. We also observed that the double description codes use substantial memory, especially normaliz. In fact the machines with 32GB or less of memory were not able to solve either mit71 or cp6 using these codes, and even in single processor mode most of the 128GB available on mai20 was required for some problems. Memory use by lrs/plrs/mplrs was negligible, making them good background processes. On the extremely degenerate problem cp6, lrs was in the middle of the pack, about 20% slower than cddr+, whereas normaliz was nearly 13 times faster than lrs. On this problem, neither ppl nor porta was able to produce any output in the time allotted (76 and 57 days respectively). Only porta and normaliz could effectively solve the sparse 0, polytope zfw91. A 289core run with mplrs1 was approximately 70 times slower than porta and 11 times slower than normaliz. Note that with about 151 million cobases/vertex cp6 is far more degenerate than zfw91, which has about 4 million cobases/vertex.
To put the above results in perspective, we recall that the problem mit was a big challenge in the early 1990s. At that time, early versions of both cddr+ and lrs took over a month to solve this problem. Combined hardware and software improvements over the years give speedups of over 5000 times; both codes now complete the job in less than 10 minutes. We will see that parallelization of lrs can lead to further dramatic reductions in running time: on our 312core cluster the problem now requires only 12 seconds.
We move now to the three parallel codes. For mplrs and plrs we used the default settings (see User’s guide [2] for details):

plrs: id 4

mplrs: id 2, lmin 3 maxc 50 scale 100 maxbuf 500
For normaliz we used the default settings which imply that hybrid arithemtic is used. Table 3 contains results for low scale parallelization and all problems were run on the single workstation mai20. With 4 cores available, plrs usually outperforms mplrs, they give similar performances with 8 cores, and mplrs is usually faster with 12 or more cores. With 16 cores mplrs gave quite consistent speedups, in the range 1012.3. On the problems it could solve, the speedups obtained by normaliz
show a much higher variance, in the range 0.9315.7.
Name  4 cores  8 cores  12 cores  16 cores  
secs/speedup  secs/speedup  secs/speedup  secs/speedup  
mplrs  plrs  normaliz  mplrs  plrs  normaliz  mplrs  plrs  normaliz  mplrs  plrs  normaliz  
bv7  5219  2399  43  1739  1213  23  1045  818  17  747  624  13 
1.6  3.5  1.4  4.8  6.9  2.6  8.0  10.2  3.7  11.1  13.3  4.8  
c3015  28  17  **  9  11  **  6  9  **  4  10  ** 
1.4  2.4    4.4  3.6    6.7  4.4    10  4    
c4020  5979  3628  **  2023  2564  **  1219  2237  **  873  2066  ** 
1.6  2.6    4.7  3.7    7.7  4.2    10.8  4.6    
fq4819  146  99  18  49  52  13  30  36  12  21  29  12 
1.7  2.6  2.3  5.2  4.7  3.2  8.3  7.0  3.4  12.0  12.7  3.4  
mit71  11386  6479  107482  3983  3320  65507  2390  2254  50910  1709  1724  42916 
1.8  3.2  4.7  5.2  6.2  7.7  8.6  9.2  9.9  12.1  12.0  11.7  
mit  293  152  70  99  89  42  61  68  33  44  57  29 
1.7  3.3  2.5  5.0  5.6  4.2  8.1  7.3  5.3  11.3  8.7  6  
perm10  1422  709  1085  481  445  960  292  367  1090  215  320  1093 
1.6  3.2  .94  4.7  5.0  1.1  7.7  6.1  .94  10.4  7.0  .93  
zfw91  *  *  46741  *  *  23885  *  *  15975  *  *  12110 
    4.1      7.9      11.9      15.7  
cp6  968550  486667  42774  331235  268066  23493  199501  201792  18585  143006  169352  16980 
1.8  3.6  3.2  5.3  6.6  5.9  8.8  8.7  7.4  12.3  10.4  8.1  
Table 4 contains results for medium scale parallelization on the 64core shared memory machine mai64. Note that these processors are considerably slower than mai20 on a percore basis. We used 8,16,32,64 cores and speedups are measured by comparing with the running time on 8 cores. With 64 cores, mplrs was the clear winner over plrs with speedups ranging from 4.3 to 7.2. plrs showed little improvement after 32 cores and normaliz again had very large variance.
Name  8 cores  16 cores  32 cores  64 cores  
secs/speedup vs 8 cores  secs/speedup vs 8 cores  secs/speedup vs 8 cores  secs/speedup vs 8 cores  
mplrs  plrs  normaliz  mplrs  plrs  normaliz  mplrs  plrs  normaliz  mplrs  plrs  normaliz  
bv7  3238  2255  60  1478  1212  39  1206  726  29  515  506  21 
1  1  1  2.2  1.9  1.5  2.7  3.1  2.1  6.3  4.4  2.9  
c3015  17  22  **  9  20  **  5  22  **  4  21  ** 
1  1    1.9  1.1    3.4  1.0    4.3  1.0    
c4020  3882  4694  **  1876  4163  **  1141  4192  **  717  4086  ** 
1  1    2.1  1.1    3.4  1.1    5.4  1.1    
fq4819  89  95  28  42  57  24  23  39  22  14  31  23 
1  1  1  2.1  1.7  1.2  3.9  2.4  1.3  6.4  3.1  1.2  
mit71  7395  6218  115088  3401  3441  77436  1900  2130  60694  1251  1640  51594 
1  1  1  2.2  1.8  1.5  3.9  2.9  1.9  5.9  3.8  2.2  
mit  195  175  111  93  123  83  53  120  75  42  124  82 
1  1  1  2.1  1.4  1.3  3.7  1.5  1.5  4.6  1.4  1.4  
perm10  909  841  1951  432  617  1870  253  569  1840  171  573  1930 
1  1  1  2.1  1.4  1.1  3.6  1.5  1.1  5.3  1.5  1  
zfw91  *  *  42409  *  *  24822  *  *  14452  *  *  7332 
    1      1.7      2.9      5.8  
cp6 ^{5}^{5}5plrs times computed using v6.0  727771  565915  38621  326214  377857  23773  171194  298408  17468  100676  229713  15480 
1  1  1  2.2  1.5  1.6  4.3  1.9  2.2  7.2  2.5  2.5 
Table 5 contains results for medium scale parallelization on a 312core cluster of computers. Only mplrs is able to use all cores in this heterogeneous environment. The machines were scheduled in the order given at the end of Section 2 (excluding Tsubame2). Due to the heterogeneous selection of machines we do not present speedups in this table. For example, we observed that mai20 is substantially faster than the other machines – more than would be expected by simply comparing clock speeds and number of cores. It was more than twice as fast as mai12 on c4020. Jobs completing in under a minute do not profit much, if at all, as extra cores are added. However, the longer running jobs show continuous improvement. Excluding zfw91, lrs required about 3 weeks on the fastest machine (mai20) to complete the other 9 problems. Using the 312core cluster this time is reduced to 4 hours and 40 minutes. These total times are dominated by cp6. Excluding this problem as well, the lrs total running time of 12 hours 13 minutes is improved to roughly 8 minutes using the cluster.
Name  mplrs  

16 cores  32 cores  64 cores  128 cores  256 cores  312 cores  
secs  secs  secs  secs  secs  secs  
bv7  747  389  262  179  101  88 
c3015  4  3  2  3  2  2 
c4020  873  528  328  218  133  121 
fq4819  21  11  7  5  4  5 
mit71  1709  956  625  421  228  199 
mit  44  26  21  23  13  12 
perm10  215  118  89  75  53  55 
cp6  143006  75712  50225  33684  18657  16280 
Table 6 shows results for large scale parallelization obtained by Kazuki Yoshizoe using mplrs v. 6.0 on the Tsubame2 supercomputer at the Tokyo Institute of Technology. He ran tests using problems mit71 and cp6 and observed near linear speedup between 12 and 1200 cores for both problems^{6}^{6}6cp6 benchmark was taken with mai12 which has a similar processor (Xeon X5650) to those we used on Tsubame2 (Xeon X5670).. With 1200 cores mplrs solved cp6 in about 42 minutes, nearly 600 times faster than cddr+, 55 times faster than normaliz in single processor mode and over 6 times faster than normaliz running on 64 cores, the largest shared memory machine available to us.
Name  mplrs (v. 5.1b)  

12 cores  36 cores  72 cores  144 cores  300 cores  600 cores  1200 cores  
cp6  283403(mai12)  *  *  20383  9782  4913  2487 
1      14  29  58  114  
mit71  4207  1227  602  297  146  81  45 
1  3.4  7.0  14  29  52  94 
4 Conclusions
These results show that the difficulty of solving vertex/facet enumeration problems varies enormously, even for inputs of roughly the same size. Any given problem may be tractable or intractable depending on the method used to solve it. General rules are dangerous and likely to be contradicted by further examples, but we hazard two: learn about your polytope and use multicore hardware.
4.1 Learn something about your polytope
Unfortunately not much can be learned by simply inspecting the input file. Many 0/1 input files are highly degenerate, but not all: perm10, for example, is a simple polytope. Fortunately the degeneracy of a polytope can be checked by doing a partial run of lrs for a few minutes, stopping after a certain number of bases have been computed. As seen from Table 1
, the ratio of bases computed to V/H output gives a good estimate of degeneracy of the problem. It will also give an indication as to whether the output is binary (
cp6), consists of small integers (bv7, perm10, zfw91), huge integers (c3015, c4020) or rationals (fq4819, mit, mit71). lrsalso has an estimate feature that gives an unbiased estimate of the output size, number of bases and total
lrs running time. These estimates have high variance but do give some indication of the tractability of the problem.For problems with low degeneracy or very large output sizes pivoting methods such as the lrs family may be the only tractable approach. For extremely degenerate problems with binary or small integer output it is not so clear, as can be seen by comparing the results obtained for cp6 and zfw91.
4.2 Use multicore hardware
Comparing Table 2 with the remaining tables clearly indicates the necessity of using parallel processing for hard vertex/facet enumeration problems: even just 16 cores gives an order of magnitude improvement. A supercomputer on the scale of Tsubame2 may seem out of reach for most researchers. However, at current prices, a 1200core cluster could be built for roughly $100,000 and would be considerably cheaper with used hardware. This price will certainly fall substantially in the near future making this amount of computing power readily available to more researchers. The problem will not be the availability of the hardware but the availability of software that can make effective use of it.
Acknowledgements
We thank Kazuki Yoshizoe for kindly allowing us to use the results of his Tsubame2 experiments and for helpful discussions concerning the MPI library which improved mplrs’ performance. This work was partially supported by JSPS Kakenhi Grants 23700019 and 15H00847, GrantinAid for Scientific Research on Innovative Areas, ‘Exploring the Limits of Computation (ELC)’.
References
 [1] Assarf, B., Gawrilow, E., Herr, K., Joswig, M., Lorenz, B., Paffenholz, A., Rehn, T.: Computing convex hulls and counting integer ponts with polymake. Mathematical Programming Computation (to appear) pp. 1–38 (2016)
 [2] Avis, D.: (2013). http://cgm.cs.mcgill.ca/~avis/C/lrs.html
 [3] Avis, D., Fukuda, K.: A pivoting algorithm for convex hulls and vertex enumeration of arrangements and polyhedra. Discrete & Computational Geometry 8, 295–313 (1992)
 [4] Avis, D., Jordan, C.: mplrs: A scalable parallel vertex/facet enumeration code. CoRR abs/1511.06487 (2015). URL http://arxiv.org/abs/1511.06487

[5]
Avis, D., Roumanis, G.: A portable parallel implementation of the lrs vertex
enumeration code.
In: Combinatorial Optimization and Applications  7th International Conference, COCOA 2013,
Lecture Notes in Computer Science, vol. 8287, pp. 414–429. Springer (2013)  [6] Bugseng.org: (2013). http://bugseng.com/products/ppl
 [7] Ceder, G., Garbulsky, G., Avis, D., Fukuda, K.: Ground states of a ternary fcc lattice model with nearest and nextnearestneighbor interactions. Phys Rev B Condens Matter 49(1), 1–7 (1994)
 [8] Christof, T., Lobel, A.: (2009). http://typo.zib.de/optlong_projects/Software/Porta/
 [9] Fukuda, K.: (2012). http://www.inf.ethz.ch/personal/fukudak/cdd_home
 [10] Moran, B., Cohen, F., Wang, Z., Suvorova, S., Cochran, D., Taylor, T., Farrell, P., Howard, S.: Bounds on multiple sensor fusion. ACM Transactions on Sensor Networks 12(2) (2016)
 [11] Normaliz: (2015). http://www.home.uniosnabrueck.de/wbruns/normaliz/
 [12] Ziegler, G.M.: Lectures on Polytopes, Graduate Texts in Mathematics, vol. 152. Springer (1995)