I Introduction
Parallel processing has arisen as an important leverage for enabling highperformance scientific computing, and sustained productivity. Since mid 2000’s, we have been facing a paradigm change where the computers are not being produced with only one processing core. This tendency has being called Era Multicore, detailed by Kock and Borkar [1, 2], which the idea includes the principle of doubling the number of processing cores in a single chip with each technology generation. Thus, it is important to validate parallel algorithms for scalable performance.
In this context, this work aims to improves performance of Linear Programming (LP) solvers on multicore systems. Solving largescale LP problems may require more time than it could be affordable because of the speed that data is being generated nowadays. Implementations that explore the resources of parallel systems are not only desired to fill this gap but also necessary to make efficient use of the everincreasing number of cores in modern processors.
The simplex method is widely used academically and commercially to solve LP problems. There has been many studies with parallel versions of the simplex method. Many of those have focused on the revised simplex method due to the advantage of solving sparse problems and being most effective where the number of variables is much larger than the amount of restrictions [3]. However, the revised method is not well suited for parallelization [4]. Eckstein et. al [5] describe three simplex parallel implementations, including a revised method for dense LP problems on the Connection Machine CM2. They showed that on the CM2, their implementation can yield faster execution times than a workstation from the same era and processing power. Thomadakis and Liu [6] worked on parallelizing the Standard and Dual simplex algorithm, comparing them on two versions of Maspar (MP1 and MP2). Their results show that as the problems size increases the speedup obtained by 1,024 processors MP1 and 16,384 processors MP2 is 100 times and 1000 times, respectively, over the sequential standard simplex algorithm. Hall and McKinnon [7] studied an asynchronous variant of the revised simplex method on the Cray T3D machine. They present different ways for improving the performance of this algorithm, using distinct T3D routines. They also explain the potential of this variant for shared memory processors, instead of the parallel distributed implementation on the Cray T3D.
In 2000, Maros and Mitra [8] presented a cooperative parallel version of the sparse simplex method for linear programs. They have adopted a distributed memory multiprocessor, the Parsytec CC12 computer with 12 processors and 64 Mbyte RAM per processor node. The objective was to reserve and dedicate each processor node to execute one of the chosen column selection strategies. They identified that a considerable improvement in efficiency can be achieved by changing strategy depending on problem behavior.
Later, in 2009, Yarmish and Slyke [9] presented a scalable simplex implementation for largescale linear programming problems using 7 workstations connected by Ethernet. Their parallel standard algorithm showed to be more efficient than the revised method. In the same year, Ploskas et al. [10] presented a parallel implementation of the standard simplex algorithm using a personal computer with two cores. Due to dense matrices and heavy communication, the ratio of computation to communication is extremely low. Their computational results show that a linear speedup is hard to achieve even with carefully selected partitioning patterns and communication optimization.
In 2013, a parallelization of the revised simplex method for large extensive forms of twostage stochastic linear programming problems was proposed by Lubun et al. [11]. They have developed the linear algebra techniques necessary to exploit the dual blockangular structure of an LP problem inside the revised simplex method and a technique for applying product form updates efficiently in a parallel context. Their result was most effective on large instances that could be considered very difficult to solve using the revised simplex algorithm. In the same year, a parallel implementation of the revised simplex algorithm using OpenMP in a shared memory multiprocessor architecture was suggested by Ploskas [12]. It focused on the reduction of the time taken to perform the basic matrix inverse. Their computational results with a set benchmark problems from Netlib^{1}^{1}1The Netlib repository contains freely available softwares, documents and databases of interest to the scientific computing, numerical and other communities. The repository is maintained by AT & T Bell Laboratories, from the University of Tennessee and Oak Ridge National Laboratory. For more information: http://www.netlib.org/ reported a 1.79 average speedup on a quadcore computer.
A few yeares later, in 2015, Ploskas and Samaras [13] propose two efficient GPUbased implementations of the revised method and a primaldual exterior point simplex algorithm. Both parallel algorithms were implemented in MATLAB using MATLAB’s Parallel Computing Toolbox. They performed a computational study on largescale randomly generated optimal sparse and dense LPs and found that both GPUbased algorithms outperformed MATLAB’s interior point method. The GPUbased primaldual exterior point simplex algorithm shows an average speedup of 2.3 over MATLAB’s interior point method on a set of benchmark LP problems.
Remarkably, the major efforts to parallelize simplexbased solvers for LP problems focused on the revised method while strategies involving the standard method are scarcer. This is probably due to the fact that the revised method is more efficient for sparse LP problems
[3, 9]. Nonetheless, dense LP problems do occur in a number of applications [5]. Hence, since the revised method is less suitable for parallelization [proc519254, 4], and being parallelism the major means of performance increase nowadays, efforts toward the development of scalable versions of the standard simplex method are opportune.In this paper, we present a scalable parallel implementation of the standard simplex algorithm, based in [14], for largescale dense LP problems using OpenMP. We detail how we parallelize each step of the algorithm revealing some solutions for performance bottlenecks that resulted in better performance and efficiency when compared to the previous version [14]. We analyzed the performance gains of our parallel implementation over the sequential version, using a shared memory computer with 64 processing cores. We measured the speedups for various numbers of variables and constraints for different problems. With that we found out for which cases it may be more advantageous to solve problems with more variables than constraints, or vice versa, which implies in a better choice for using the primal or the dual formulation of the LP problem. The performance results show up to 19 factor of improvement for the proposed parallel implementation over the sequential standard simplex. The whole project is publicly at access https://gitlab.com/lappsufrn/MulticoreParallelSimplex.
This paper is organized in the following way: in Section 2, the Standard Simplex Algorithm is described; in Section 3, we present the proposed parallel scheme and its implementation details; in Section 4, we show the scalability and performance results and, finally, in the Conclusion, we make our final considerations about our main contributions.
Ii The Standard Simplex Algorithm
Linear Programming (LP) is the process of minimizing or maximizing a linear objective function subject to a number of linear equality and inequality constraints. Several methods are available for solving LP, among which the Simplex algorithm is the most widely used [13]. A LP problem can be modeled in the standard form as follows.
(1)  
subject to  
where is the constraints matrix,
is the decision variables vector,
is the righthandside (RHS) constants vector and is the objective function vector transposed, where .In this paper, we assume lessthan inequality linear problems for test purposes. In addition, Solving an LP problem using the Simplex method requires it to be in the standard form. Thus, besides the original variables, usually contains columns corresponding to slack variables introduced to transform inequality into equality constraints. These extra variables are initially called basic variables while the original ones are called nonbasic variables.
The following steps describes the Simplex procedure.

[Step 1.]

Build the initial table: the Simplex data is placed in a tableau, which contains a row for the objective function–written as , columns for the basic variables with indexes in the range , and columns for the nonbasic variables with indexes in the range . Table I presents the structured tableau.
x x x x x x RHS 1 0 0 0 1 0 0 0 1 0 0 0 0 TABLE I: Initial tableau for the standard Simplex 
Determine a nonbasic variable to become basic: If all reduced costs with being the index set of the nonbasic variables and the current basis, the optimal basic feasible solution was found and the method must stop, i.e. the stop criteria are met. Otherwise, choose a pivot column such that . The method used to obtain is by finding the maximum absolute value of for all .

Find a basic variable to leave the current basic solution: For each positive element , , where is the element in the th row of the vector , calculate the ratio . Choose a pivot row that corresponds to the smallest ratio. If , then halt; the problem is unbounded.

Pivot: Divide the pivot row by , where is the pivot coefficient found by the previous steps. That is, scale row so that the new value of is 1, and then subtract multiples of row from all the other rows such that every other element of the pivot column becomes zero. Return to step two.
Iii Parallel Scheme and Implementation
The proposed parallel scheme is fundamentally related to the standard simplex scheme presented in Section II. A few modifications to the original scheme was necessary to ensure the code optimizations that aimed to solve performance bottlenecks such as synchronization and load balancing. The resulting scheme contains 6 steps instead of 4. Most of the steps in the proposed parallel scheme retained the same objectives of the original steps. Mainly, steps II and IV had significant changes in their objectives, which are jointly covered in the new scheme by steps II, IV, V and VI. Fig. 1 presents an illustration of the proposed parallel scheme containing its control flow. The parallel scheme was implemented in C++ using the OpenMP [15] programming model. Algorithm 1 presents the code skeleton of the scheme containing the main OpenMP parallelization directives used in each step of the algorithm.
Step I of the scheme assembles the initial simplex table with all constraints and variables. Then, the threads are created and remain active until the end of the execution in order to avoid extra overhead of creating and destroying threads.
Fig. 1 and Algorithm 1 show how step II was modified to enable less synchronization. The step II runs before the main loop structure (lines 3 to 6). Each thread works locally on a set of columns to find the index of the column holding the maximum absolute value among the negative elements in the objective row. Then, it is reduced to a global variable by using the reduction(maximum:max) clause (line 3). The schedule(guided,chunk) clause (line 3) is used for assigning the iterations to threads in the team dynamically in chunks that decrease in size as the iteration number progresses. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be assigned. The implicit barrier of the for directive (line 3) is necessary to ensure that this step completes before the global column index is used in the next step. Using the single directive modified by the nowait clause (line 7), a single the updates control data before asynchronously joining the other threads in the next step.
Step III has a similar structure as step II, except for the threads working with rows instead of columns. The loop selects the row with the minimum ratio, as explained in section II, and then it is globally reduced by the reduction(minimum:min) clause (line 10). An implicit barrier synchronizes the global row among the threads (line 10). Another barrier before the step IV and after the single directive (line 14), is necessary to ensure access to the correct pivot row and column for all threads. The number of negative values in pivot column are counted locally and the reduced to a global count by using the reduction(+:count) clause (line 10). Using the single directive modified by the nowait clause (line 14), a single thread verify if the variable count has the same amount of constraints  condition of the unbounded solution.
Step IV updates the pivot row in the same way as the original scheme described in Section II. The implicit barrier at the end of the for directive (line 17) ensures that this step completes before the values in this row are used in the next step.
Step V updates the remaining constraint rows, as shown described in section II, in a loop without the for directive implicit barrier by using the nowait clause (line 21). A barrier is not necessary because the next step is independent of this one. The GCC ivdep pragma (line 24), allows the loop to be vectorized, assuming that there are no loopcarried dependencies which would prevent consecutive iterations of the following loop from executing concurrently with SIMD (single instruction multiple data) instructions.
Step VI is a new step focusing in a better performance and synchronization. It has the same structure of step II. However, It performs three operations in the same loop. First it updates the objective row, then it selects the maximum absolute value among the negative elements in the objective row, and then it counts the negatives values of the objective row. A barrier ensures the counting is correct before the stopping criterion is assessed. If there are negative values in the objective row, the stopping criterion is not reached and the main loop restarts in the step III in order to select the row with the minimum ratio. Before restarting the loop, there is a single directive (line 33) with a implicit barrier to ensure that the global control variables are reset to prevent steps III and VI have mistaken values. The number of negative values are counted locally and the reduced to a global count by using the reduction(+:count) clause (line 27).
increses
Iv Experimental Results and Discussion
In this section we present some numerical results on various randomly generated problems. The problems are generated using the following Octave^{2}^{2}2GNU Octave is a highlevel language, primarily intended for numerical computations. code[16]:
The program of algorithm 2
generates a random matrix
A for a given number of constraints (m) and number of variables (n) and degree of matrix density d and the vector b. The parameter d can vary between 0 to 1, i.e. sparse to very dense. We generated a very dense matrix (0.9), that is almost all elements are not zero. The elements of Aare uniformly distributed between 0 and 50. The dimension of each problem was defined as a combination of the values: 256, 512, 1024, 2048, 4096 and 8192. With these 6 possible dimensions for the number of variables and number of constraints, we generated 3 different problem instances for each combination, thereby yielding 108 (
) problems. For example, problem named has 256 variables and 512 constraints with 3 different instances: , , . All problems were generated assuming lessthan inequality constraints for maximization. Therefore, slack variables were always added to the initial tableau such that the actual number of variables were equal to the initial variables plus the number of constraints. So, we always have more variables than restrictions. Throughout this section, however, we named the problems by the amount of constraint and variables of the original problem.All experiments where performed in a server with four AMD Opteron 6376 processors with 16 cores, totalizing 64 cores, each running at 2.3GHz, 256 GB DDR3 RAM, 768KB L1 and 16MB L2 individual caches per core, and 16Mbytes L3 caches per socket, running Ubuntu 16.04.2 LTS. The source codes were compiled using GNU GCC 6.3.0. The experiments involved the proposed parallel algorithm with 2, 4, 8, 16, 32 and 64 threads, and our serial implementation of the standard simplex described in Section II. We bound the threads to specific cores, using the system variable GOMP_CPU_AFFINITY.
Table II has the median execution time in seconds over 10 runs of the 3 problems of the same size for all problems and methods analyzed. The first column includes the dimension of the problem, the remaining columns hold the execution times of the proposed parallel simplex implementation with 2, 4, 8, 16, 32 and 64 threads.
Proposed implementation with number threads equal to  
Problem  2  4  8  16  32  64 
256x256  0.036  0.027  0.021  0.031  0.092  0.301 
256x512  0.056  0.038  0.028  0.031  0.091  0.338 
256x1024  0.086  0.052  0.038  0.038  0.082  0.286 
256x2048  0.358  0.169  0.120  0.098  0.156  0.473 
256x4096  0.438  0.221  0.174  0.089  0.119  0.356 
256x8192  1.351  0.627  0.547  0.347  0.291  0.532 
512x256  0.021  0.014  0.010  0.009  0.020  0.064 
512x512  0.084  0.046  0.031  0.026  0.045  0.168 
512x1024  0.775  0.451  0.348  0.231  0.349  0.863 
512x2048  1.155  0.662  0.589  0.332  0.342  0.891 
512x4096  3.103  1.277  1.057  0.674  0.530  0.967 
512x8192  5.759  2.838  2.275  1.172  0.972  0.878 
1024x256  0.327  0.168  0.131  0.060  0.071  0.236 
1024x512  1.684  1.048  0.856  0.468  0.490  1.113 
1024x1024  2.163  1.019  0.882  0.508  0.463  0.865 
1024x2048  1.227  0.519  0.467  0.277  0.173  0.327 
1024x4096  38.935  20.207  15.588  6.817  3.944  4.347 
1024x8192  90.958  63.478  40.201  45.997  46.334  12.040 
2048x256  8.171  4.505  3.661  1.511  0.937  1.269 
2048x512  0.799  0.541  0.421  0.244  0.114  0.128 
2048x1024  15.935  8.810  8.703  7.249  2.680  1.742 
2048x2048  59.883  38.809  28.364  31.859  26.620  6.708 
2048x4096  39.007  24.344  20.444  19.423  22.391  6.058 
2048x8192  911.651  519.929  482.603  405.369  332.439  268.802 
4096x256  54.202  33.635  24.085  24.954  25.742  17.898 
4096x512  46.711  31.147  23.529  22.518  23.134  19.016 
4096x1024  197.108  133.184  92.854  79.553  91.360  67.271 
4096x2048  0.528  0.435  0.377  0.343  0.323  0.305 
4096x4096  1382.832  888.342  799.383  592.892  622.866  373.573 
4096x8192  1071.709  671.992  528.183  398.582  442.300  266.595 
8192x256  274.244  150.246  119.035  104.966  104.110  65.755 
8192x512  18.183  11.740  10.290  9.970  10.635  9.000 
8192x1024  126.048  72.757  60.564  57.594  53.484  36.189 
8192x2048  52.146  28.557  26.544  24.463  25.285  16.443 
8192x4096  944.021  544.355  485.721  385.115  344.792  201.071 
8192x8192  25208.584  16344.648  13628.544  11215.784  9979.548  5458.975 
Iva Speedup
In this Section, we discuss the speedup performance of the proposed algorithm. The speedup is a measure of how much faster the parallel code runs compared to the sequential code. Due to the excessive amount of time that it was necessary to run the sequential simplex algorithm, for all speedup and parallel efficiency calculations, we have assumed that the sequential execution time was twice the execution time of the case with 2 threads. By that, we are assuming a linear speedup for 2 threads and embracing in our results any potential performance inaccuracy arising from that.
Fig. (a)a shows speedups for when we fix the constraints at 256. Notice that the speedup starts equal to two, because of our assumption that the serial time is twice the time for two threads. Observe in the figure that, as the speedup increases and then drops for each problem, it decrease at larger amounts of threads for larger problems. This is an indication that the performance loss is due to the overhead of adding more threads, which does not pay off for smaller problems. In addition, from 32 threads on, the largest speedups were those with the largest amount of variables. This may indicate that the parallel portion of the code is increasing with amount of variables. Fig. (b)b shows speedup for when variables are fixed at 256. The same pattern of Fig. (a)a can be observed. Speedup drops happen at larger number of threads though.
Fig. (a)a shows speedups fixing the number of constraints at 512. All problems with less than 4096 variables have a drop at 32 threads. The drop for the second largest happens at 64 threads, while it was not observed for the largest problem. At 64 treads, we can see that the larger amounts of variables the larger the speedup, indicating that increasing the number of variables improves speedup performance. Similarly, the same patterns of Fig. (b)b appear in Fig. (b)b, where the number of variables is fixed at 512. Here, too, the drops of speedup happens at larger number of threads. Notice that formulating the problems this way, all problems, but the , have the number of variables increased by the number of constraints to accommodate the slack variables. This causes the problems to grow faster in size when the number of constraints increase than in the other formulation when the number of variables increase. For this reason, the speedup drop caused by the extra threads happens at larger amounts of threads in this formulation. Except for the two largest problems, which do not experience the same speedup increase as the two smaller problems ( and ) neither the same speedup drop as all other problems. For being much larger that the other problems, the speedup of these problems are probably limited by the insufficient amount of cache memory. This also explains their expressive speedup increase when going from 32 to 64 threads, because of the increase in the number of L3 caches used in the computation that goes from 2 to 4.
The patterns of Figs. 2 and 3 repeat throughout Figs. 4 to 7. These patterns are characterized by a moderate increase in speedup followed by a drop in the number of threads gets larger. Most likely, the speedup drops have to cause. First, for smaller problems, increasing the number of threads causes the threading overhead to become more significative than the gains in concurrency. For larger problems, however, the drop cannot be attributed to the same cause because the workload assigned to each thread dilutes the extra threads overhead. So, the second cause can be attributed to insufficient cache memory to store the simplex table. The largest evidence for that in some plots of Figs. 2 to 7 is the sudden speedup rise when going from 32 to 64 threads, when the amount of L3 caches goes from 2 to 4. Note that this rise in the speedups happens even when the drop due to threading overhead was already observed for smaller amounts of threads.
IvB Parallel Efficiency and the limits of cache
In this subsection, we present the plots for parallel efficiency and compute the limits of problem sizes that make efficient use of cache. Parallel efficiency normalized measure for speedup. It indicates how effectively each processor element is used. Figs. 813 present the parallel efficiencies of all problems and their primal/dual counterpart. In these normalized versions of speedups, the effect of cachesize limits is even more clear than in Figs.27. Throughout the figures, increasing the dimensions of the problem causes an efficiency rise followed by a drop. This effect becomes more predominant at lower number of variables and constraints as we move along from Fig. 8 to Fig.13.
In Fig. 14, the effect of cachesize limits can be clearly observed. The figure draws the efficiency of the proposed parallel algorithm with 64 threads for varying constraints and variables.
As problem sizes increases, we observe the efficiency scaling up and then it drops regardless of weather the increase is due to more variables or more constraints. There is not enough cache space to accommodate larger problems, causing data to be fetched more frequently from the main memory. This limit can be estimated by calculating the amount of variables and constraints that fits in the four L3 caches. The total amount of bytes being used by the LP problem cannot exceed 64 Mbytes, which is the total size of the four L3 caches. Considering a double floating point precision where each value requires 8 bytes for storage, it follows that:
(2) 
where is the amount of constraints and the amount of variables. Solving for we find the maximum number of constraints relative to a given number of variables that avoids cache resource limitation, as follows.
(3) 
Replacing by the values of the variables used in the problems dimensions, i.e. 256, 512, 1024, 2048, 4096 and 8192, (3) gives as maximum numbers of constraints 2771, 2651, 2429, 2048, 1499 and 1499, respectively. For example, for 1024 constraints, the number of variables cannot exceed 2429. The red line (Fig. 14) shows the function described by (2), which roughly estimates the limit of the amount of variables and constraints that fits in the L3 caches.
Noticeably, in Fig. 14 indicating that the algorithm is scalable up to a specific problem size. Scalable problems sustain efficiency as the problem size increases. Past a limit, the proposed parallel implementation fails to sustain efficiency for larger problems due to cache resource limitation.
V Conclusions
This paper presents a parallel implementation of the standard Simplex algorithm for multicore processors for solving largescale LP problems. We described the general scheme of the parallelization, explaining each step of the standard simplex algorithm and detailing important optimization steps of our implementation. We compared the speedups of the proposed parallel algorithm against the standard simplex implementation, using up to 64 threads in a shared memory machine with 64 cores. The proposed parallel simplex algorithm demonstrated scalable performance for various combinations of variables and constraints. We have shown that the implementation is weakly scalable up to the limits of efficient use of caches. We have divided the equations that describe this limit and verified it experimentally.
Most probably this limit to the scalability can be improved by using techniques to maximize cache locality such as a blocking and red polyhedral transformations, which we believe is worthwhile investigating in the future.
References
 [1] G. Koch, “Discovering multicore: Extending the benefits of moore?s law,” Intel Corporation, Technical Report, Jul. 2005.
 [2] S. Borkar, “Thousand core chipsa technology perspective,” 44th ACM/IEEE Design Automation Conference, 2007.
 [3] G. Yarmish and R. V. Slyke, “retrolp, an implementation of the standard simplex method,” Departament of Computer and Information Science, Tech. Rep., 2003.
 [4] J. A. J. Hall, “Towards a practical parallelisation of the simplex method,” Computational Management Science, 2007.
 [5] J. Eckstein, I. Boduroglu, L. C. Polymenakos, and D. Goldfarb, “Dataparallel implementations of dense simplex methods on the connection machine cm2,” 1995.
 [6] M. E. Thomadakis and J.C. Liu, Eds., An Efficient SteepestEdge SImplex Algorithm for SIMD Computers. Departament of Computer Science Texas, 1996.
 [7] J. A. J. Hall and K. McKinnon, “Asynplex, an asynchronous parallel revised simplex algorithm,” APMOD95 Conference, 1998.
 [8] I. Maros and G. Mitra, “Investigating the sparse simplex algorithm on a distributed memory multiprocessor,” Parallel Computing, vol. 26, no. August 1997, pp. 151–170, 2000.
 [9] G. Yarmish and R. V. Slyke, “A distributed, scaleable simplex method,” J Supercomput, vol. 49, no. February, p. 373381, 2009.
 [10] N. Ploskas, N. Samaras, and A. Sifaleras, Eds., A parallel implementation of an exterior point algorithm for linear programming problems. University of Macedonia, 2009.
 [11] M. Lubin, J. A. J. Hall, C. G. Petra, and M. Anitescu, “Parallel distributedmemory simplex for largescale stochastic LP problems,” Computational Optimization and Applications, vol. 55, no. 3, pp. 571–596, 2013. [Online]. Available: http://link.springer.com/10.1007/s105890139542y
 [12] N. Ploskas, N. Samaras, and K. Margaritis, “A Parallel Implementation of the Revised Simplex Algorithm Using OpenMP: Some Preliminary Results,” Springer Proceedings in Mathematics and Statistics, vol. 31, pp. 15–38, 2013. [Online]. Available: http://www.scopus.com/inward/record.url?eid=2s2.084883436297{&}partnerID=tZOtx3y1
 [13] N. Ploskas and N. Samaras, “Efficient GPUbased implementations of simplex type algorithms,” Applied Mathematics and Computation, vol. 250, pp. 552–570, 2015. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S0096300314014714
 [14] D. Coutinho and S. Souza, “16 a 19,” in Simpósio Brasileiro de Pesquisa Operacional, 2013, pp. 1332–1343.
 [15] OpenMP, “Pthe openmp api specification for parallel programming,” May 2013. [Online]. Available: http://openmp.org/wp/
 [16] S. Ketabchi, H. Moosaei, H. Sahleh, and M. Hedayati, “New Methods for Solving Large Scale Linear Programming Problems in the Windows and Linux computer operating systems,” vol. 1556, no. 4, pp. 1553–1556, 2012. [Online]. Available: http://arxiv.org/abs/1209.4308
 [17]
 [18] J. L. Gustafson, “Reevaluating amdahl’s law,” Commun. ACM, vol. 31, no. 5, pp. 532–533, May 1988. [Online]. Available: http://doi.acm.org/10.1145/42411.42415
Comments
There are no comments yet.