A Scalable Shared-Memory Parallel Simplex for Large-Scale Linear Programming

04/12/2018 ∙ by Demetrios Coutinho, et al. ∙ 0

We present a shared-memory parallel implementation of the Simplex tableau algorithm for dense large-scale Linear Programming (LP) problems. We present the general scheme and explain each parallelization step of the standard simplex algorithm, emphasizing important solutions for solving performance bottlenecks. We analyzed the speedup and the parallel efficiency for the proposed implementation relative to the standard Simplex algorithm using a shared-memory system with 64 processing cores. The experiments were performed for several different problems, with up to 8192 variables and constraints, in their primal and dual formulations. The results show that the performance is mostly much better when we use the formulation with more variables than inequality constraints. Also, they show that the parallelization strategies applied to avoid bottlenecks caused the implementation to scale well with the problem size and the core count up to a certain limit of problem size. Further analysis showed that this was an effect of resource limitation. Even though, our implementation was able to reach speedups in the order of 19x.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Parallel processing has arisen as an important leverage for enabling high-performance scientific computing, and sustained productivity. Since mid 2000’s, we have been facing a paradigm change where the computers are not being produced with only one processing core. This tendency has being called Era Multicore, detailed by Kock and Borkar [1, 2], which the idea includes the principle of doubling the number of processing cores in a single chip with each technology generation. Thus, it is important to validate parallel algorithms for scalable performance.

In this context, this work aims to improves performance of Linear Programming (LP) solvers on multi-core systems. Solving large-scale LP problems may require more time than it could be affordable because of the speed that data is being generated nowadays. Implementations that explore the resources of parallel systems are not only desired to fill this gap but also necessary to make efficient use of the ever-increasing number of cores in modern processors.

The simplex method is widely used academically and commercially to solve LP problems. There has been many studies with parallel versions of the simplex method. Many of those have focused on the revised simplex method due to the advantage of solving sparse problems and being most effective where the number of variables is much larger than the amount of restrictions [3]. However, the revised method is not well suited for parallelization [4]. Eckstein et. al [5] describe three simplex parallel implementations, including a revised method for dense LP problems on the Connection Machine CM-2. They showed that on the CM-2, their implementation can yield faster execution times than a workstation from the same era and processing power. Thomadakis and Liu [6] worked on parallelizing the Standard and Dual simplex algorithm, comparing them on two versions of Maspar (MP-1 and MP-2). Their results show that as the problems size increases the speedup obtained by 1,024 processors MP-1 and 16,384 processors MP-2 is 100 times and 1000 times, respectively, over the sequential standard simplex algorithm. Hall and McKinnon [7] studied an asynchronous variant of the revised simplex method on the Cray T3D machine. They present different ways for improving the performance of this algorithm, using distinct T3D routines. They also explain the potential of this variant for shared memory processors, instead of the parallel distributed implementation on the Cray T3D.

In 2000, Maros and Mitra [8] presented a cooperative parallel version of the sparse simplex method for linear programs. They have adopted a distributed memory multiprocessor, the Parsytec CC12 computer with 12 processors and 64 Mbyte RAM per processor node. The objective was to reserve and dedicate each processor node to execute one of the chosen column selection strategies. They identified that a considerable improvement in efficiency can be achieved by changing strategy depending on problem behavior.

Later, in 2009, Yarmish and Slyke [9] presented a scalable simplex implementation for large-scale linear programming problems using 7 workstations connected by Ethernet. Their parallel standard algorithm showed to be more efficient than the revised method. In the same year, Ploskas et al. [10] presented a parallel implementation of the standard simplex algorithm using a personal computer with two cores. Due to dense matrices and heavy communication, the ratio of computation to communication is extremely low. Their computational results show that a linear speedup is hard to achieve even with carefully selected partitioning patterns and communication optimization.

In 2013, a parallelization of the revised simplex method for large extensive forms of two-stage stochastic linear programming problems was proposed by Lubun et al. [11]. They have developed the linear algebra techniques necessary to exploit the dual block-angular structure of an LP problem inside the revised simplex method and a technique for applying product form updates efficiently in a parallel context. Their result was most effective on large instances that could be considered very difficult to solve using the revised simplex algorithm. In the same year, a parallel implementation of the revised simplex algorithm using OpenMP in a shared memory multiprocessor architecture was suggested by Ploskas [12]. It focused on the reduction of the time taken to perform the basic matrix inverse. Their computational results with a set benchmark problems from Netlib111The Netlib repository contains freely available softwares, documents and databases of interest to the scientific computing, numerical and other communities. The repository is maintained by AT & T Bell Laboratories, from the University of Tennessee and Oak Ridge National Laboratory. For more information: http://www.netlib.org/ reported a 1.79 average speedup on a quadcore computer.

A few yeares later, in 2015, Ploskas and Samaras [13] propose two efficient GPU-based implementations of the revised method and a primal-dual exterior point simplex algorithm. Both parallel algorithms were implemented in MATLAB using MATLAB’s Parallel Computing Toolbox. They performed a computational study on large-scale randomly generated optimal sparse and dense LPs and found that both GPU-based algorithms outperformed MATLAB’s interior point method. The GPU-based primal-dual exterior point simplex algorithm shows an average speedup of 2.3 over MATLAB’s interior point method on a set of benchmark LP problems.

Remarkably, the major efforts to parallelize simplex-based solvers for LP problems focused on the revised method while strategies involving the standard method are scarcer. This is probably due to the fact that the revised method is more efficient for sparse LP problems 

[3, 9]. Nonetheless, dense LP problems do occur in a number of applications [5]. Hence, since the revised method is less suitable for parallelization [proc-519254, 4], and being parallelism the major means of performance increase nowadays, efforts toward the development of scalable versions of the standard simplex method are opportune.

In this paper, we present a scalable parallel implementation of the standard simplex algorithm, based in [14], for large-scale dense LP problems using OpenMP. We detail how we parallelize each step of the algorithm revealing some solutions for performance bottlenecks that resulted in better performance and efficiency when compared to the previous version [14]. We analyzed the performance gains of our parallel implementation over the sequential version, using a shared memory computer with 64 processing cores. We measured the speedups for various numbers of variables and constraints for different problems. With that we found out for which cases it may be more advantageous to solve problems with more variables than constraints, or vice versa, which implies in a better choice for using the primal or the dual formulation of the LP problem. The performance results show up to 19 factor of improvement for the proposed parallel implementation over the sequential standard simplex. The whole project is publicly at access https://gitlab.com/lappsufrn/MulticoreParallelSimplex.

This paper is organized in the following way: in Section 2, the Standard Simplex Algorithm is described; in Section 3, we present the proposed parallel scheme and its implementation details; in Section 4, we show the scalability and performance results and, finally, in the Conclusion, we make our final considerations about our main contributions.

Ii The Standard Simplex Algorithm

Linear Programming (LP) is the process of minimizing or maximizing a linear objective function subject to a number of linear equality and inequality constraints. Several methods are available for solving LP, among which the Simplex algorithm is the most widely used [13]. A LP problem can be modeled in the standard form as follows.

(1)
subject to

where is the constraints matrix,

is the decision variables vector,

is the right-hand-side (RHS) constants vector and is the objective function vector transposed, where .

In this paper, we assume less-than inequality linear problems for test purposes. In addition, Solving an LP problem using the Simplex method requires it to be in the standard form. Thus, besides the original variables, usually contains columns corresponding to slack variables introduced to transform inequality into equality constraints. These extra variables are initially called basic variables while the original ones are called nonbasic variables.

The following steps describes the Simplex procedure.

  1. [Step 1.]

  2. Build the initial table: the Simplex data is placed in a tableau, which contains a row for the objective function–written as , columns for the basic variables with indexes in the range , and columns for the nonbasic variables with indexes in the range . Table I presents the structured tableau.

    x x x x x x RHS
    1 0 0
    0 1 0
    0 0 1
    0 0 0 0
    TABLE I: Initial tableau for the standard Simplex
  3. Determine a nonbasic variable to become basic: If all reduced costs with being the index set of the non-basic variables and the current basis, the optimal basic feasible solution was found and the method must stop, i.e. the stop criteria are met. Otherwise, choose a pivot column such that . The method used to obtain is by finding the maximum absolute value of for all .

  4. Find a basic variable to leave the current basic solution: For each positive element , , where is the element in the -th row of the vector , calculate the ratio . Choose a pivot row that corresponds to the smallest ratio. If , then halt; the problem is unbounded.

  5. Pivot: Divide the pivot row by , where is the pivot coefficient found by the previous steps. That is, scale row so that the new value of is 1, and then subtract multiples of row from all the other rows such that every other element of the pivot column becomes zero. Return to step two.

Iii Parallel Scheme and Implementation

Fig. 1: Flowchart of the proposed Simplex algorithm.

The proposed parallel scheme is fundamentally related to the standard simplex scheme presented in Section II. A few modifications to the original scheme was necessary to ensure the code optimizations that aimed to solve performance bottlenecks such as synchronization and load balancing. The resulting scheme contains 6 steps instead of 4. Most of the steps in the proposed parallel scheme retained the same objectives of the original steps. Mainly, steps II and IV had significant changes in their objectives, which are jointly covered in the new scheme by steps II, IV, V and VI. Fig. 1 presents an illustration of the proposed parallel scheme containing its control flow. The parallel scheme was implemented in C++ using the OpenMP [15] programming model. Algorithm 1 presents the code skeleton of the scheme containing the main OpenMP parallelization directives used in each step of the algorithm.

Step I of the scheme assembles the initial simplex table with all constraints and variables. Then, the threads are created and remain active until the end of the execution in order to avoid extra overhead of creating and destroying threads.

Fig. 1 and Algorithm 1 show how step II was modified to enable less synchronization. The step II runs before the main loop structure (lines 3 to 6). Each thread works locally on a set of columns to find the index of the column holding the maximum absolute value among the negative elements in the objective row. Then, it is reduced to a global variable by using the reduction(maximum:max) clause (line 3). The schedule(guided,chunk) clause (line 3) is used for assigning the iterations to threads in the team dynamically in chunks that decrease in size as the iteration number progresses. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be assigned. The implicit barrier of the for directive (line 3) is necessary to ensure that this step completes before the global column index is used in the next step. Using the single directive modified by the nowait clause (line 7), a single the updates control data before asynchronously joining the other threads in the next step.

Step III has a similar structure as step II, except for the threads working with rows instead of columns. The loop selects the row with the minimum ratio, as explained in section II, and then it is globally reduced by the reduction(minimum:min) clause (line 10). An implicit barrier synchronizes the global row among the threads (line 10). Another barrier before the step IV and after the single directive (line 14), is necessary to ensure access to the correct pivot row and column for all threads. The number of negative values in pivot column are counted locally and the reduced to a global count by using the reduction(+:count) clause (line 10). Using the single directive modified by the nowait clause (line 14), a single thread verify if the variable count has the same amount of constraints - condition of the unbounded solution.

Step IV updates the pivot row in the same way as the original scheme described in Section II. The implicit barrier at the end of the for directive (line 17) ensures that this step completes before the values in this row are used in the next step.

Step V updates the remaining constraint rows, as shown described in section II, in a loop without the for directive implicit barrier by using the nowait clause (line 21). A barrier is not necessary because the next step is independent of this one. The GCC ivdep pragma (line 24), allows the loop to be vectorized, assuming that there are no loop-carried dependencies which would prevent consecutive iterations of the following loop from executing concurrently with SIMD (single instruction multiple data) instructions.

Step VI is a new step focusing in a better performance and synchronization. It has the same structure of step II. However, It performs three operations in the same loop. First it updates the objective row, then it selects the maximum absolute value among the negative elements in the objective row, and then it counts the negatives values of the objective row. A barrier ensures the counting is correct before the stopping criterion is assessed. If there are negative values in the objective row, the stopping criterion is not reached and the main loop restarts in the step III in order to select the row with the minimum ratio. Before restarting the loop, there is a single directive (line 33) with a implicit barrier to ensure that the global control variables are reset to prevent steps III and VI have mistaken values. The number of negative values are counted locally and the reduced to a global count by using the reduction(+:count) clause (line 27).

1#pragma omp parallel <data scope definitions>
2{
3    #pragma omp for schedule(guided,chunk) reduction(maximum:max) // Step II starts here.
4    for (j = 0; j <= column_size; j++) {
5        /* Select the column with the minimun negative value in the objective row. */
6    }
7    #pragma omp single nowait
8    /* Update Variables. */
9    do {
10        #pragma omp for reduction(+:count) reduction(minimum:min)   schedule(guided,chunk)// Step III starts here.
11        for (i = 0; i < row_size; i++) {
12            /*Select the row with the minimum ratio.*/
13        }
14        #pragma omp single nowait
15        /* verify condition of unbounded solution */
16        #pragma omp barrier
17        #pragma omp for // Step IV starts here.
18        for (j = 0; j <= column_size; j++) {
19            /* Update the pivot row. */
20        }
21        #pragma omp for nowait // Step V starts here.
22        for (i = 0; i < row_size; i++) {
23          /* Update the remaining constraint rows. */
24          #pragma GCC ivdep
25          for (j = 0; j <= column_size; j++)
26        }
27        #pragma omp for reduction(+:count) reduction(maximum:max) schedule(guided,chunk) // Step VI starts here.
28        for (j = 0; j <= column_size; j++) {
29            /* Update the objective row */
30            /* Select the maximum  absolute value among the negative elements in the objective row.*/
31            /* Count negative values in the objective row (count). */
32        }
33        #pragma omp single
34        /* Update Variables. */
35    } while(count > 0)
36}
Algorithm 1: OpenMP code skeleton of the parallel scheme.

increses

Iv Experimental Results and Discussion

In this section we present some numerical results on various randomly generated problems. The problems are generated using the following Octave222GNU Octave is a high-level language, primarily intended for numerical computations. code[16]:

1function [lpp,f,A,b] = gen_rand_deme(m,n,d)
2    c = zeros(n,1);
3    A =[];
4    b = [];
5    while c >=0
6        pl=inline(’(abs(x)+x)/2’);
7        A=sprand(m,n,d);
8        A=A*50;
9        x=sparse(10*pl(rand(n,1)));
10        u=spdiags((sign(pl(rand(m,1)-rand(m,1)))),0,m,m)*(rand(m,1)-rand(m,1));
11        b=A*x;
12        c=A’
13    end
14    f = c’;
15    A = full(A);
16    b = full(b);
17    lpp = [A eye(m) b;f zeros(1,m+1)];
18end
Algorithm 2: Generates random solvable LP Problem

The program of algorithm 2

generates a random matrix

A for a given number of constraints (m) and number of variables (n) and degree of matrix density d and the vector b. The parameter d can vary between 0 to 1, i.e. sparse to very dense. We generated a very dense matrix (0.9), that is almost all elements are not zero. The elements of A

are uniformly distributed between 0 and 50. The dimension of each problem was defined as a combination of the values: 256, 512, 1024, 2048, 4096 and 8192. With these 6 possible dimensions for the number of variables and number of constraints, we generated 3 different problem instances for each combination, thereby yielding 108 (

) problems. For example, problem named has 256 variables and 512 constraints with 3 different instances: , , . All problems were generated assuming less-than inequality constraints for maximization. Therefore, slack variables were always added to the initial tableau such that the actual number of variables were equal to the initial variables plus the number of constraints. So, we always have more variables than restrictions. Throughout this section, however, we named the problems by the amount of constraint and variables of the original problem.

All experiments where performed in a server with four AMD Opteron 6376 processors with 16 cores, totalizing 64 cores, each running at 2.3GHz, 256 GB DDR3 RAM, 768KB L1 and 16MB L2 individual caches per core, and 16Mbytes L3 caches per socket, running Ubuntu 16.04.2 LTS. The source codes were compiled using GNU GCC 6.3.0. The experiments involved the proposed parallel algorithm with 2, 4, 8, 16, 32 and 64 threads, and our serial implementation of the standard simplex described in Section II. We bound the threads to specific cores, using the system variable GOMP_CPU_AFFINITY.

Table II has the median execution time in seconds over 10 runs of the 3 problems of the same size for all problems and methods analyzed. The first column includes the dimension of the problem, the remaining columns hold the execution times of the proposed parallel simplex implementation with 2, 4, 8, 16, 32 and 64 threads.

Proposed implementation with number threads equal to
Problem 2 4 8 16 32 64
256x256 0.036 0.027 0.021 0.031 0.092 0.301
256x512 0.056 0.038 0.028 0.031 0.091 0.338
256x1024 0.086 0.052 0.038 0.038 0.082 0.286
256x2048 0.358 0.169 0.120 0.098 0.156 0.473
256x4096 0.438 0.221 0.174 0.089 0.119 0.356
256x8192 1.351 0.627 0.547 0.347 0.291 0.532
512x256 0.021 0.014 0.010 0.009 0.020 0.064
512x512 0.084 0.046 0.031 0.026 0.045 0.168
512x1024 0.775 0.451 0.348 0.231 0.349 0.863
512x2048 1.155 0.662 0.589 0.332 0.342 0.891
512x4096 3.103 1.277 1.057 0.674 0.530 0.967
512x8192 5.759 2.838 2.275 1.172 0.972 0.878
1024x256 0.327 0.168 0.131 0.060 0.071 0.236
1024x512 1.684 1.048 0.856 0.468 0.490 1.113
1024x1024 2.163 1.019 0.882 0.508 0.463 0.865
1024x2048 1.227 0.519 0.467 0.277 0.173 0.327
1024x4096 38.935 20.207 15.588 6.817 3.944 4.347
1024x8192 90.958 63.478 40.201 45.997 46.334 12.040
2048x256 8.171 4.505 3.661 1.511 0.937 1.269
2048x512 0.799 0.541 0.421 0.244 0.114 0.128
2048x1024 15.935 8.810 8.703 7.249 2.680 1.742
2048x2048 59.883 38.809 28.364 31.859 26.620 6.708
2048x4096 39.007 24.344 20.444 19.423 22.391 6.058
2048x8192 911.651 519.929 482.603 405.369 332.439 268.802
4096x256 54.202 33.635 24.085 24.954 25.742 17.898
4096x512 46.711 31.147 23.529 22.518 23.134 19.016
4096x1024 197.108 133.184 92.854 79.553 91.360 67.271
4096x2048 0.528 0.435 0.377 0.343 0.323 0.305
4096x4096 1382.832 888.342 799.383 592.892 622.866 373.573
4096x8192 1071.709 671.992 528.183 398.582 442.300 266.595
8192x256 274.244 150.246 119.035 104.966 104.110 65.755
8192x512 18.183 11.740 10.290 9.970 10.635 9.000
8192x1024 126.048 72.757 60.564 57.594 53.484 36.189
8192x2048 52.146 28.557 26.544 24.463 25.285 16.443
8192x4096 944.021 544.355 485.721 385.115 344.792 201.071
8192x8192 25208.584 16344.648 13628.544 11215.784 9979.548 5458.975
TABLE II: Time in seconds of proposed implementation.

Iv-a Speedup

In this Section, we discuss the speedup performance of the proposed algorithm. The speedup is a measure of how much faster the parallel code runs compared to the sequential code. Due to the excessive amount of time that it was necessary to run the sequential simplex algorithm, for all speedup and parallel efficiency calculations, we have assumed that the sequential execution time was twice the execution time of the case with 2 threads. By that, we are assuming a linear speedup for 2 threads and embracing in our results any potential performance inaccuracy arising from that.

Fig. (a)a shows speedups for when we fix the constraints at 256. Notice that the speedup starts equal to two, because of our assumption that the serial time is twice the time for two threads. Observe in the figure that, as the speedup increases and then drops for each problem, it decrease at larger amounts of threads for larger problems. This is an indication that the performance loss is due to the overhead of adding more threads, which does not pay off for smaller problems. In addition, from 32 threads on, the largest speedups were those with the largest amount of variables. This may indicate that the parallel portion of the code is increasing with amount of variables. Fig. (b)b shows speedup for when variables are fixed at 256. The same pattern of Fig. (a)a can be observed. Speedup drops happen at larger number of threads though.

(a) Problems with 256 constraints.
(b) Problems with 256 variables.
Fig. 2: Speedups of the proposed implementation for fixing one of the problem dimensions in 256 for primal (a) and dual (b) formulations.

Fig. (a)a shows speedups fixing the number of constraints at 512. All problems with less than 4096 variables have a drop at 32 threads. The drop for the second largest happens at 64 threads, while it was not observed for the largest problem. At 64 treads, we can see that the larger amounts of variables the larger the speedup, indicating that increasing the number of variables improves speedup performance. Similarly, the same patterns of Fig. (b)b appear in Fig. (b)b, where the number of variables is fixed at 512. Here, too, the drops of speedup happens at larger number of threads. Notice that formulating the problems this way, all problems, but the , have the number of variables increased by the number of constraints to accommodate the slack variables. This causes the problems to grow faster in size when the number of constraints increase than in the other formulation when the number of variables increase. For this reason, the speedup drop caused by the extra threads happens at larger amounts of threads in this formulation. Except for the two largest problems, which do not experience the same speedup increase as the two smaller problems ( and ) neither the same speedup drop as all other problems. For being much larger that the other problems, the speedup of these problems are probably limited by the insufficient amount of cache memory. This also explains their expressive speedup increase when going from 32 to 64 threads, because of the increase in the number of L3 caches used in the computation that goes from 2 to 4.

(a) Problems with 512 constraints.
(b) Problems with 512 variables.
Fig. 3: Speedups of the proposed implementation for fixing one of the problem dimensions in 512 for primal (a) and dual (b) formulations.

The patterns of Figs. 2 and 3 repeat throughout Figs. 4 to 7. These patterns are characterized by a moderate increase in speedup followed by a drop in the number of threads gets larger. Most likely, the speedup drops have to cause. First, for smaller problems, increasing the number of threads causes the threading overhead to become more significative than the gains in concurrency. For larger problems, however, the drop cannot be attributed to the same cause because the workload assigned to each thread dilutes the extra threads overhead. So, the second cause can be attributed to insufficient cache memory to store the simplex table. The largest evidence for that in some plots of Figs. 2 to 7 is the sudden speedup rise when going from 32 to 64 threads, when the amount of L3 caches goes from 2 to 4. Note that this rise in the speedups happens even when the drop due to threading overhead was already observed for smaller amounts of threads.

(a) [Problems with 1024 constraints.
(b) Problems with 1024 variables.
Fig. 4: Speedups of the proposed implementation for fixing one of the problem dimensions in 1024 for primal (a) and dual (b) formulations.
(a) Problems with 2048 constraints.
(b) Problems with 2048 variables.
Fig. 5: Speedups of the proposed implementation for fixing one of the problem dimensions in 2048 for primal (a) and dual (b) formulations.
(a) Problems with 4096 constraints.
(b) Problems with 4096 variables.
Fig. 6: Speedups of the proposed implementation for fixing one of the problem dimensions in 4096 for primal (a) and dual (b) formulations.
(a) Problems with 8192 constraints.
(b) Problems with 8192 variables.
Fig. 7: Speedups of the proposed implementation for fixing one of the problem dimensions in 8192 for primal (a) and dual (b) formulations.

Iv-B Parallel Efficiency and the limits of cache

In this subsection, we present the plots for parallel efficiency and compute the limits of problem sizes that make efficient use of cache. Parallel efficiency normalized measure for speedup. It indicates how effectively each processor element is used. Figs. 8-13 present the parallel efficiencies of all problems and their primal/dual counterpart. In these normalized versions of speedups, the effect of cache-size limits is even more clear than in Figs.2-7. Throughout the figures, increasing the dimensions of the problem causes an efficiency rise followed by a drop. This effect becomes more predominant at lower number of variables and constraints as we move along from Fig. 8 to Fig.13.

(a) Problems with 256 constraints.
(b) Problems with 256 variables
Fig. 8: Parallel efficiency of the proposed implementation for fixing one of the problem dimensions in 256 for primal (a) and dual (b) formulations.
(a) Problems with 512 constraints.
(b) Problems with 512 variables.
Fig. 9: Parallel efficiency of the proposed implementation for fixing one of the problem dimensions in 512 for primal (a) and dual (b) formulations.
(a) Problems with 1024 constraints.
(b) Problems with 1024 variables.
Fig. 10: Parallel efficiency of the proposed implementation for fixing one of the problem dimensions in 1024 for primal (a) and dual (b) formulations.
(a) Problems with 2048 constraints.
(b) Problems with 2048 variables.
Fig. 11: Parallel efficiency of the proposed implementation for fixing one of the problem dimensions in 2048 for primal (a) and dual (b) formulations.
(a) Problems with 4096 constraints.
(b) Problems with 4096 variables.
Fig. 12: Parallel efficiency of the proposed implementation for fixing one of the problem dimensions in 4096 for primal (a) and dual (b) formulations.
(a) Problems with 8192 constraints.
(b) Problems with 8192 variables.
Fig. 13: Parallel efficiency of the proposed implementation for fixing one of the problem dimensions in 8192 for primal (a) and dual (b) formulations.

In Fig. 14, the effect of cache-size limits can be clearly observed. The figure draws the efficiency of the proposed parallel algorithm with 64 threads for varying constraints and variables.

Fig. 14: Parallel Efficiency for 64 threads. The red line delimits the problem sizes that reach approximately the full use of all L3 caches.

As problem sizes increases, we observe the efficiency scaling up and then it drops regardless of weather the increase is due to more variables or more constraints. There is not enough cache space to accommodate larger problems, causing data to be fetched more frequently from the main memory. This limit can be estimated by calculating the amount of variables and constraints that fits in the four L3 caches. The total amount of bytes being used by the LP problem cannot exceed 64 Mbytes, which is the total size of the four L3 caches. Considering a double floating point precision where each value requires 8 bytes for storage, it follows that:

(2)

where is the amount of constraints and the amount of variables. Solving for we find the maximum number of constraints relative to a given number of variables that avoids cache resource limitation, as follows.

(3)

Replacing by the values of the variables used in the problems dimensions, i.e. 256, 512, 1024, 2048, 4096 and 8192, (3) gives as maximum numbers of constraints 2771, 2651, 2429, 2048, 1499 and 1499, respectively. For example, for 1024 constraints, the number of variables cannot exceed 2429. The red line (Fig. 14) shows the function described by (2), which roughly estimates the limit of the amount of variables and constraints that fits in the L3 caches.

Noticeably, in Fig. 14 indicating that the algorithm is scalable up to a specific problem size. Scalable problems sustain efficiency as the problem size increases. Past a limit, the proposed parallel implementation fails to sustain efficiency for larger problems due to cache resource limitation.

V Conclusions

This paper presents a parallel implementation of the standard Simplex algorithm for multicore processors for solving large-scale LP problems. We described the general scheme of the parallelization, explaining each step of the standard simplex algorithm and detailing important optimization steps of our implementation. We compared the speedups of the proposed parallel algorithm against the standard simplex implementation, using up to 64 threads in a shared memory machine with 64 cores. The proposed parallel simplex algorithm demonstrated scalable performance for various combinations of variables and constraints. We have shown that the implementation is weakly scalable up to the limits of efficient use of caches. We have divided the equations that describe this limit and verified it experimentally.

Most probably this limit to the scalability can be improved by using techniques to maximize cache locality such as a blocking and red polyhedral transformations, which we believe is worthwhile investigating in the future.

References

  • [1] G. Koch, “Discovering multi-core: Extending the benefits of moore?s law,” Intel Corporation, Technical Report, Jul. 2005.
  • [2] S. Borkar, “Thousand core chips-a technology perspective,” 44th ACM/IEEE Design Automation Conference, 2007.
  • [3] G. Yarmish and R. V. Slyke, “retrolp, an implementation of the standard simplex method,” Departament of Computer and Information Science, Tech. Rep., 2003.
  • [4] J. A. J. Hall, “Towards a practical parallelisation of the simplex method,” Computational Management Science, 2007.
  • [5] J. Eckstein, I. Boduroglu, L. C. Polymenakos, and D. Goldfarb, “Data-parallel implementations of dense simplex methods on the connection machine cm-2,” 1995.
  • [6] M. E. Thomadakis and J.-C. Liu, Eds., An Efficient Steepest-Edge SImplex Algorithm for SIMD Computers.   Departament of Computer Science Texas, 1996.
  • [7] J. A. J. Hall and K. McKinnon, “Asynplex, an asynchronous parallel revised simplex algorithm,” APMOD95 Conference, 1998.
  • [8] I. Maros and G. Mitra, “Investigating the sparse simplex algorithm on a distributed memory multiprocessor,” Parallel Computing, vol. 26, no. August 1997, pp. 151–170, 2000.
  • [9] G. Yarmish and R. V. Slyke, “A distributed, scaleable simplex method,” J Supercomput, vol. 49, no. February, p. 373­381, 2009.
  • [10] N. Ploskas, N. Samaras, and A. Sifaleras, Eds., A parallel implementation of an exterior point algorithm for linear programming problems.   University of Macedonia, 2009.
  • [11] M. Lubin, J. A. J. Hall, C. G. Petra, and M. Anitescu, “Parallel distributed-memory simplex for large-scale stochastic LP problems,” Computational Optimization and Applications, vol. 55, no. 3, pp. 571–596, 2013. [Online]. Available: http://link.springer.com/10.1007/s10589-013-9542-y
  • [12] N. Ploskas, N. Samaras, and K. Margaritis, “A Parallel Implementation of the Revised Simplex Algorithm Using OpenMP: Some Preliminary Results,” Springer Proceedings in Mathematics and Statistics, vol. 31, pp. 15–38, 2013. [Online]. Available: http://www.scopus.com/inward/record.url?eid=2-s2.0-84883436297{&}partnerID=tZOtx3y1
  • [13] N. Ploskas and N. Samaras, “Efficient GPU-based implementations of simplex type algorithms,” Applied Mathematics and Computation, vol. 250, pp. 552–570, 2015. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S0096300314014714
  • [14] D. Coutinho and S. Souza, “16 a 19,” in Simpósio Brasileiro de Pesquisa Operacional, 2013, pp. 1332–1343.
  • [15] OpenMP, “Pthe openmp api specification for parallel programming,” May 2013. [Online]. Available: http://openmp.org/wp/
  • [16] S. Ketabchi, H. Moosaei, H. Sahleh, and M. Hedayati, “New Methods for Solving Large Scale Linear Programming Problems in the Windows and Linux computer operating systems,” vol. 1556, no. 4, pp. 1553–1556, 2012. [Online]. Available: http://arxiv.org/abs/1209.4308
  • [17]
  • [18] J. L. Gustafson, “Reevaluating amdahl’s law,” Commun. ACM, vol. 31, no. 5, pp. 532–533, May 1988. [Online]. Available: http://doi.acm.org/10.1145/42411.42415