## 1 Introduction

Traditional CPU computations are now increasingly being carried out on CPU-GPU heterogeneous systems, by offloading data parallel tasks to a GPU for accelerating performance. Some of the application domains where GPUs have been used for performance acceleration include medical image processing ^{1, 2}, weather research and forecasting (WRF) ^{3}, Proteomics (to speed-up peptide spectrum matching ^{4}), signal processing for radio astronomy ^{5}, simulation of various physical and mechanical systems (using variants of Monte Carlo algorithm) ^{6, 7} and large scale graph processing ^{8}. However, drawing performance from a GPU requires insights on its architecture. Some of the common challenges of computing in a heterogeneous architecture are load balancing, efficient memory access and an effective mapping of computations in the SIMD paradigm of computing.

Linear Programming is a method of maximizing or minimizing a linear objective function subject to a set of linear constraints. Linear programs (LPs) appear extensively in a large number of application domains such as business process modeling to maximize profit, economics to design optimized demand-supply model ^{9}, transport assignment ^{10}, job scheduling ^{11} and packets routing in computer networks, to name just some. Our work is motivated from applications that require solving a large number of small-size independent LPs. We propose a hybrid CPU-GPU LP solver which can solve a batch of many independent problems simultaneously. Our work assumes the setting that computations begin in a CPU where LPs are created, batched and then offloaded to a GPU for an accelerated solution. The solutions are transferred back to the CPU from the GPU for further processing. As an example, we show an application of our batched LP solver in model-based design of control systems. In model-based design, state-space exploration is a standard technique to analyze models of control systems. State of the art methods and tools for state-space exploration, heavily rely on solving many independent LPs ^{12, 13}. Moreover, the LPs are generally of small size (the size of an LP is the number of constraints number of variables). We show that using our batched LP solver, the state-space exploration tools can improve the performance significantly.

Prior work on solving an LP on a GPU and on multi-GPU architectures are many. The focus of almost all such works has been on methods to improve the performance of algorithms to solve one LP at a time. Performance gain is reported generally when offloading large LPs of size (500 constraints, 500 variables) and above ^{14, 15, 16, 17, 18, 19}. It is seen that for small size LPs, the time spent in offloading the problems from CPU to GPU memory is more than the time gained with parallel execution in the GPU. Our work in this paper target applications that involves solving small to medium size LPs, but many of them. Although modern LP solvers like CPLEX ^{20} and GLPK ^{21} are very efficient in solving small LPs, solving many of them one by one may consume considerable time. Note that using any of the prior work to solve an LP at a time in GPU will not provide acceleration in such applications since they perform well only on large LPs. We show that with batched computation, performance acceleration can be achieved even for small LPs (e.g. LPs of size 5) for a considerably large *batch size*, where *batch size* refers to the number of LPs in a batch. We present a CUDA C/C++ implementation of a solver which implements the simplex method ^{22}, with an effort to keep coalescent memory accesses, efficient CPU-GPU memory transfer and an effective load balancing. To the best of our knowledge, this is the first work in the direction of batched LP solving on a GPU. Batched computations on a GPU to draw performance on small problem instances is recently adopted in ^{23, 24, 25, 26, 27, 28}. The solver source and necessary instructions for repeatability evaluation can be found at https://bitbucket.org/rajgurung777/simplexprojects. We report solutions of LPs of dimension up to ( variables, constraints) with our solver. Beyond a sufficiently large batch size, our implementation shows significant gain in performance compared to solving them sequentially in the CPU using the GLPK library ^{21}, an open source LP solver and the CPLEX solver from IBM. The evaluation on selected LP benchmarks from the Netlib repository displays a maximum speed-up of and with respect to CPLEX and GLPK solver respectively, for a batch of LPs. In addition, we consider a special class of LPs having the feasible region as an hyper-rectangle and exploit the fact that these can be solved cheaply without using the simplex algorithm. We implement this special case LP solver as part of the solver.

The rest of the paper is organized as follows. Related works are discussed in Section 2. A motivating application of batched solving of small LPs is discussed in Section 3. In Section 4, we discuss the simplex method that is needed to appreciate the rest of the paper. Section 5 illustrates our CUDA implementation for solving batched LPs on a GPU, with memory coalescence, effective load balancing and efficient CPU-GPU memory transfer using CUDA streams. Section 6 presents an evaluation of our solver on batched solving of LP benchmarks in comparison to solving the same sequentially with GLPK and CPLEX. Section 7 demonstrates the performance enhancement in state-space exploration of models of control systems, on using our solver. We conclude in Section 8.

## 2 Related Work

A multi-GPU implementation of the simplex algorithm in ^{14} reports a speed-up of on dense randomly generated LPs of dimension . An average speed-up of has been reported for problems of dimension or higher on a single GPU as compared to a sequential CPU implementation. An implementation of the revised simplex method using the OpenGL graphics library is reported in ^{15}. An average speed-up of is reported, compared to the GLPK library, for randomly generated dense problems of size or higher. The paper also reports that the number of iterations in the simplex procedure is considerably less when the “projected steepest-edge” pivoting rule is used as against Dantzig’s “maximum reduced cost” rule. A GPU implementation of the revised simplex algorithm is also reported in ^{16} with a speed-up of to in comparison to a serial ATLAS-based CPU implementation, for LPs of dimension to . Automatically Tuned Linear Algebra Software (ATLAS^{29}) is a library providing BLAS (Basic Linear Algebra Subprograms) APIs for C and Fortran. BLAS ^{30}

is a specification that prescribes routines for basic vector and matrix operations. A GPU-based implementation of the Revised Simplex algorithm and a Primal-Dual Exterior Point Simplex Algorithm (PDEPSA) is presented in

^{17}. Pre-processing techniques on LPs have been proposed in order to gain performance. The LP solving algorithms use the steepest-edge pivoting rule which the authors claim to show the best performance in comparison to the other pivoting rules

^{31}. Experimental results on Netlib

^{32}, Kennington and Mèszàros benchmarks shows that, on average the PDEPSA is and faster than the MATLAB’s multi-threaded built-in function linprog, which implements the interior point method, and a sequential CPU-based PDEPSA respectively. A CPU-GPU hybrid implementation of the Simplex algorithm is proposed in

^{33}. The conditional control instructions are executed in the CPU and the GPU execute the compute intensive tasks. Experimental result display a maximum speed-up of for LPs of size 7000. A single and a multi-GPU CUDA implementation of the standard simplex method with the steepest edge pivoting rule is proposed in

^{18}. A comparison of their implementation with an open source solver, CLP is provided. It is shown empirically, that the performance of the revised simplex implementation (CLP) degrades with an increase in the density of the LP. However, the performance of their simplex implementation remains stable even when the LP density increases. Another implementation of the standard simplex method is presented in

^{19}for solving dense LP problems with double precision arithmetic. An average speed-up of is shown for LPs of size greater than 3000 as compared to the sequential CPU counterpart. We observe that almost all prior works report speed-up when the LP size is generally 500 or above.

^{33}is an exception which shows a marginal speed-up on LPs of size 100.

## 3 Motivating Application

In model-based design of control systems, a standard technique of analysis is to compute the state-space of the model using exploration algorithms. Properties of the control system such as safety and stability are analyzed by observing the computed state-space. In this section, we consider two open-source tools that perform state-space exploration of control systems with linear dynamics, namely SpaceEx ^{12} and XSpeed ^{13}. These tools can analyze systems modeled using a mathematical formalism known as *hybrid automaton* ^{34}. A conservative over-approximation of the exact state space is computed by both the tools. The state of the art state-space exploration algorithm in these tools compute the state-space as a union of convex sets, each having a symbolic representation, known as the support function representation ^{35}. The algorithm requires a conversion of these convex sets from its symbolic support function representation to concrete convex polytope representation, in order to have certain operations efficient. This conversion involves solving a number of linear programs. Moreover, the precision of the conversion and consequently, the precision of the computed state-space depends on the number of LPs solved. Table 1 shows the number of LPs and its dimension that these tools solve for a fairly accurate state-space computation over a time horizon of just 100 seconds, on some standard control systems benchmarks.

Benchmark | LP Dimension | Number of LPs |
---|---|---|

Fourth Order Filtered Oscillator | 6 | |

Eight Order Filtered Oscillator | 10 | |

Helicopter Controller | 28 |

We see that the number of LPs to be solved in the above examples is in the order of which cannot be solved in practical time limits even by the fast modern LP solvers like GLPK or CPLEX, when solved sequentially. Figure 1 shows the computed state-space of a Filtered Oscillator model using the tool XSpeed. This computation required solving LPs. Note that although a solver like CPLEX take approximately seconds to solve an LP of dimension 6 in a modern CPU, it will take nearly 7 hours to solve LPs sequentially. Therefore, we believe that there is a need to accelerate applications where such bulk LP solving is necessary. A brief description of the control systems benchmarks of Table 1 and the performance improvement when the tool XSpeed use our batched LP solver is reported in Section 7.

## 4 Linear Programming

A linear program in *standard form* is maximizing an objective function under the given set of linear constraints, represented as follows:

(1) |

subject to the constraints

(2) |

and

(3) |

In Expression (1), is the objective function to be maximized and Inequality (2) shows the constraints over variables. Inequality (3) shows the non-negativity constraints over variables. An LP in *standard form* can be converted into *slack form* by introducing additional slack variables (), one for each inequality constraint, to convert it into an equality constraint, as shown below:

(4) |

An algorithm that solves LP problems efficiently in practice is the *simplex method* described in ^{22}. The variables on the left-hand side of the Equation (4) are referred as basic variables and those on the right-hand side are non-basic variables. The *initial basic solution* of an LP is obtained by assigning its non-basic variables to zero. The *initial basic solution* may not be always feasible (when one or more of the ’s are negative, resulting in the violation of the non-negativity constraint). For such LPs, the simplex method employs a two-phase algorithm. In the first phase, a new auxiliary LP is formed by having a new objective function , which is the sum of the newly introduced artificial variables. The simplex algorithm is employed on this auxiliary LP and it is checked if the optimal solution to the objective function is zero. If a zero optimal is found then it implies that the original LP has a feasible solution and the simplex method initiates the second phase. In the second phase, the feasible slack form obtained from the first phase is considered and the original objective function is restored with appropriate substitutions and elimination of the artificial variables. The simplex algorithm is then employed to solve the LP.

Prior to the simplex method, many LP solvers apply preconditioning techniques such as a simple geometric mean scaling in combination with equilibration to reduce the condition number of the constraint matrix in order to decrease the computational effort for solving an LP

^{36, 37, 38, 39}. In this work, we do not apply any pre-conditioning on the LP for simplicity and use the simplex algorithm described in the following section.

### 4.1 The Simplex Algorithm

The simplex algorithm is an iterative process of solving a LP. Each iteration of the simplex algorithm attempts to increase the value of the objective function by replacing one of the basic variables (also known as the leaving variable), by a non-basic variable (called the entering variable). The exchange of these two variables is obtained by a *pivot operation*. The index of the leaving and the entering variables are called the pivot row and pivot column respectively. The simplex algorithm iterates on a tabular representation of the LP, called the simplex tableau. The simplex tableau stores the coefficients of the non-basic, slack and artificial variables in its rows. It contains auxiliary columns for storing intermediate computations. In our implementation, we consider a tableau of size , where and . The ()th row stores, the best solution to the objective function found until the last iteration, along with the coefficients of the non-basic variables in the objective function.

There are two auxiliary columns, the first column stores the index of the basic variables and the second stores ’s of inequality (2). Figure 2 shows a schematic of the simplex tableau.

Step 1: Determine the entering variable.

At each iteration, the algorithm identifies a new *entering variable* from the non-basic variables. It is called an entering variable since it enters the set of basic variables. The choice of the entering variable is with the goal that increasing its value from 0 increases the objective function value. The index of the entering variable is referred to as the *pivot column*. The most common rule for selecting an entering variable is by choosing the index of the maximum in the last row of the simplex tableau (excluding the current optimal solution).

Step 2: Determine the leaving variable.

Once the pivot column is determined (say ), the algorithm identifies the row index with the minimum positive ratio , say , called the *pivot row*. The variable is called the leaving variable because it leaves the set of basic variables. This ratio represents the extend to which the entering variable (in Step 1) can be increased without violating the constraints.

Step 3: Obtain the new improved value of the objective function.

The algorithm then performs the *pivot operation* which updates the simplex tableau such that the new set of basic variables are expressed as a linear combination of the non-basic ones, using substitution and rewriting. An improved value for the objective function is found after the pivot operation.

The above steps are iterated until the halt condition is reached. The halt condition is met when either the LP is found to be *unbounded* or the optimal solution is found. An LP is unbounded when no new leaving variable can be computed, i.e., when the ratio () in Step 2 is either negative or undefined for all . An optimal solution is obtained when no new entering variable can be found, i.e., the coefficients of the non-basic variables in the last row of the tableau are all negative values

## 5 Simultaneous Solving of Batched LPs on a GPU

We present our CUDA implementation that solves batched LPs in parallel on a GPU. In this discussion, we shall refer a CPU by *host* and a GPU by *device*. The LP batching is performed on the host and transferred to the device. Our solver implementation assumes that all the LPs in a batch are of the same size. The batch size is adjustable, depending on the device memory size and LP size. Our batching routine considers the maximum batch size that can be accommodated in the device memory.

### 5.1 CPU-GPU Memory Transfer and Load Balancing

First, we allocate device memory (global memory) from the host, that is required for creating a simplex tableau for the LPs in the batch. The maximum number of LPs that can be batched depends on the size of the device global memory in the device. The tableau for every LP in the batch is populated with all the coefficients and indices of the variables in the host side, before transferring to the device. To speed-up populating the tableau in the host, we initialize the tableau in parallel using OpenMP threads. Once initialized, the Simplex tableaux are copied from the host to the device memory (referred to as H2D-ST in Figure 6). The LP batching routine is shown in Algorithm 1. The GPU kernel modifies the tableau to obtain solutions using the simplex method for every LP in the batch and copies back from the device to the host memory (referred to as D2H-res in Figure 6). We discuss further on our CPU-GPU memory transfer using CUDA streams for efficiency in Section 5.4.

#### Load Balancing

We assign a CUDA block of threads to solve an LP in the batch. Since blocks are scheduled to Streaming Multiprocessors (SMs), this ensures that all SMs are busy when there are sufficiently large number of LPs to be solved in the batch. As CUDA blocks execute asynchronously, such a task division emulates solving many LPs independently in parallel. Moreover, each block is made to consist of () threads, which is a multiple of , as threads in GPU are scheduled and executed as warps. The block of threads is utilized in manipulating the simplex tableaux in parallel, introducing another level of parallelism. In Figure 3, we show a block diagram of the logical threads in CUDA, of our GPU kernel.

.

### 5.2 Implementation of the Simplex Algorithm

Finding the pivot column in Step 1 of the simplex algorithm above requires to determine the index of the maximum value from the last row of the tableau. We parallelize Step 1 by utilizing (out of ) threads in parallel to determine the pivot column using parallel reduction described in ^{40}. A parallel reduction is a technique applied to achieve data parallelism in GPU when a single result (e.g. min, max) is to be computed from an array of data. We have implemented a parallel reduction by using two auxiliary arrays, one for storing the data and the other for storing the array indices of the corresponding data. The result of a parallel reduction provides us the maximum value in the data array and its corresponding index in the indices array.

We also apply parallel reduction in Step 2 by utilizing (out of ) threads in parallel to determine the pivot row ( being the row-size of the simplex tableau). It involves finding a minimum positive value from an array of ratios (as described in Step 2 above) and therefore ratios which are not positive needs to be excluded from the minimum computation. This leads to a conditional statement in the parallel reduction algorithm and degrades the performance due to warp divergence. Even if we re-size the array to store only the positive values, the kernel still contains conditional statements to check the threads that need to process this smaller size array. To overcome performance degradation with conditional statements, we substituted a large positive number in place of ratios that are negative or undefined. This creates an array that is suitable for parallel reduction in our kernel implementation.

Data parallelism is also employed in the pivot operation in Step 3, involving substitution and re-writing, using the () threads (out of threads in the block).

### 5.3 Coalescent Memory Access

In this section, we discuss our efforts on keeping a coalescent access to global memory to reduce performance loss due to cache misses. When threads in a warp access contiguous locations in the memory, the access is said to be coalescent. A coalescent memory access results in performance benefits due to an increased cache hit rate.

As discussed earlier, we use global memory to store the simplex tableaux of the LPs in a batch as described in Section 5.1 (Since the global memory in a GPU is of the maximum size in the memory hierarchy, it can accommodate many tableaux). We store the simplex tableau in memory as a two-dimensional array. High level languages like C and C++ use the row-major order by default for representing a two-dimensional array in the memory. CUDA is an extension to C/C++ and also use the row-major order. The choice of row or column major order representation of two-dimensional arrays plays an important role in deciding the efficiency of the implementation, depending on whether the threads in a warp access the adjacent rows or adjacent columns of the array and what is the offset between the consecutive rows and columns.

We use the term column-operation, when elements of all rows from a specific column are accesses simultaneously by each thread in a warp. If the array is in a row-major order, then this operation is not a coalesced memory access, as each thread access elements from the memory separated by the size equal to the column-width of the array. When elements of a specific row are accessed simultaneously by threads of a warp, we called this a row-operation. Note that for a two dimensional array stored in row-major order, a row-operation is coalesced since each thread access data from contiguous locations in the memory.

We now show that in the simplex algorithm described above, there are more column-operations than row operations and thus, storing our data (i.e. simplex tableau) in a column-major order ensure more coalesced memory access in comparison to having a row-major storage.

Step 1 of the simplex algorithm determines the entering variable (also known as the pivot column), which requires finding the index of the maximum positive coefficient from the last row. This requires a row-operation and as mentioned in Section 5.2, we use parallel reduction using two auxiliary arrays, Data and Indices as shown in Figure 4. Although accessing from the last row of the simplex tableau is not coalesced (due to our column-major ordering) but copying into the Data (and Indices) array is coalesced and so is the parallel reduction algorithm on the Data (and Indices) array. We use the technique of *Parallel Reduction: Sequential Addressing* in ^{40}, a technique that ensures coalesced memory access.

Step 2 of the simplex algorithm determines the leaving variable (also called the pivot row) by computing the row index with the minimum positive ratio , as described in Section 4.1. This requires two column-operations involving the access to all elements from columns and as shown in Figure 5. To compute the row index with the minimum positive ratio, we use parallel reduction as described above in Section 5.2. Our tableau being stored in a column-major order, access to columns and are both coalesced. The ratio and its corresponding indices (represented by the thread ID) are stored in the auxiliary arrays, Data and Indices which is also coalesced. Like in Step 1, we use the same technique of *Parallel Reduction: Sequential Addressing* in ^{40} for coalesced memory access.

Step 3 performs the pivot operation that updates the elements of the simplex tableau and is the most expensive of the three steps. It first involves a non-coalescent row-operation which computes the new modified pivot row (denoted by the index ) as {}, where PE is the element in cell at the intersection of the pivot row and the pivot column for that iteration, known as the pivot element. The modified row () is then substituted to update each element of all the rows of the simplex tableau, using the formula (see code Listing 1). The elements of the pivot column are first stored in an array named which is a column-operation, and so is coalesced, due to the column-major representation of the tableau. The crucial operation is updating each element for every row (except the pivot row ) of the simplex tableau, which requires a nested for-loop operation. We parallelize the outer for-loop that maps the rows of the simplex tableau. Our data being represented in a column-major order, parallel access to all rows for each element in the column of the inner for-loop is coalesced.

To verify the performance gain due to coalesced memory access, we experiment with Step 3 which is the most expensive of the three steps in the simplex algorithm, by modifying it to have non-coalesced memory accesses. In the code Listing 1, we interchange the inner for-loop with the outer loop (loop interchange, a common technique to improve cache performance ^{41}). This loop interchanging converts the Step 3 to have non-coalesced memory access since our simplex tableau is represented in a column-major order. Table 2 presents the experimental results to show the gain in performance when the accesses to memory are coalesced, as compared to having non-coalesced memory accesses. The results show a significant gain in performance on a Tesla K40c GPU, for LPs with the initial basic solution as feasible.

We observe that Step 1 has a row-operation, Step 2 has two column-operations and Step 3 has a row and a column operation each with a nested for-loop which can be expressed both in row as well as column operations. We see that there are more column operations than row operations.

### 5.4 Overlapping data transfer with kernel operations using CUDA Streams

The memory bandwidth of host-device data copy is a major bottleneck in CUDA applications. We use nvprof ^{42} to profile time for memory transfer and kernel operation of our implementation discussed above in Section 5. The result of profiling in a Tesla K40c GPU, for LPs with an initial basic solution as feasible, is reported in Table 3. We observe that, for a small batch size (e.g. in the Table 3), the memory copy operation takes a maximum of of the execution time, whereas for larger batch size, the memory copy operation takes to of the execution time.

A standard technique to improve the performance in CPU-GPU heterogeneous applications is by using CUDA streams. CUDA streams allow overlapping memory copy operation with kernel execution. A stream in CUDA consists of a sequence of operations executed on the device in the order in which they are issued by the host procedure. These operations can not only be executed in an interleaved manner, but also be executed concurrently in order to gain performance ^{43}.

A GPU in general, has a separate kernel and a copy engine. All kernel operations are executed using the kernel engine and memory copy operations to and from the device are performed by the copy engine. However, some GPUs have two copy engines, one each for copying data from host to device and from device to host, for better performance. Figure 6 illustrates the overlapping of kernel executions with memory copy operation, when the GPU has one kernel and one copy engine. Streaming by batching similar operations causes more overlap of copies with kernel executions, as depicted in the figure. Adding all host-to-device copy to the CUDA streams followed by all kernel launches and device-to-host data copies, can result in a significant overlap of memory copy operations with kernel executions, resulting in a performance gain. When there are two copy engines, looping the operations in the order of a host-to-device copy followed by kernel launch and device-to-host copy, for all streams may result in a better performance than the former method. For GPUs with compute capability 3.5 and above, both the methods result in the same performance due to the Hyper-Q ^{44} feature.

Although a large number of CUDA streams achieves more concurrency and interleaving among operations, it incurs stream creation overhead. The number of CUDA streams that gives optimal performance is found by experimentation. From our experimental observations, we conclude that with varying the batch size and the LP dimension, the optimal number of streams also varies. In this paper, we report results with streams for batch size more than LPs. We use a single stream when the batch size is less than (for LPs of any dimension).

### 5.5 Limitations of the Implementation

The memory required for an LP (i.e., a tableau) in our implementation can be approximately computed as:

(5) |

where and are the sizes of rows and columns of the simplex tableau respectively, is the number of variables (dimension of the LP problem), is the number of slack variables and is the number of artificial variables in the given LP. The size of each LP is bytes, where is the data type of the LP coefficients and is the size of array used for performing parallel reduction operation, the number in the equation signify the use of two auxiliary arrays. The size of the is described in Subsection 4.1. Thus, if is the size of total global memory (in bytes) available in the GPU, then our threshold limit or the number of LPs that can be solved at a time is determined by the equation . As the current limit on threads per block is in CUDA, our implementation limits the size of LPs having *initial basic solution as feasible* to . The size limit for LPs having *initial basic solution as infeasible* is . This limit is derived from (6). We intend to address this limitation in future work.

(6) |

### 5.6 Solving a Special Case of LP

The feasible region of an LP given by its constraints defines a convex polytope. We observe that when the feasible region is a hyper-rectangle, which is a special case of a convex polytope, the LP can be solved cheaply. Equation (7) shows that maximizing the objective function is the sum of the results on dot products.

(7) |

where is the sampling directions over the given hyper-rectangle .

An implementation for solving this special case of LPs is incorporated in our solver. In order to solve many LPs in parallel, we organize CUDA threads in a one-dimensional block of threads with each block used to solve an LP. Each block is made to consist of only 32 threads, the warp size. Within each block, we used only a single thread to perform the operations of the kernel. A preliminary introduction to this technique is introduced in the paper ^{13}.

## 6 Evaluation

We evaluate our solver on two set of LPs. The first is a set of randomly constructed LPs. The LPs in this set are constructed by randomly selecting the coefficients of the constraint matrix from a range of , the bounds of the constraints from a range of and the coefficients of the objective functions from the range of respectively. The second set of LPs are selected from the Netlib repository. On both the set of LPs, we evaluate the performance of our batched solver on varying batch sizes. In the text that follows, we refer to our solver on a GPU as BLPG, abbreviating Batched LP solver on a GPU. The solver using CUDA streams is referred as BLPG-SM. We perform our experiment in Intel Xeon E5-2670 v3 CPU, GHz, Core (without hyper-threading), GB RAM with Nvidia’s Tesla K40c GPU. The reported performance is an average over runs. A performance comparison of our solver to GLPK is shown in Figure 13, for the first set of randomly generated LPs of various sizes and various batch sizes. We observe a maximum speed-up of for LPs having the *initial basic solution as feasible*, for a batch of 20k LPs of dimension 100. For the same type of LPs, we see a maximum speed-up of , for a batch of LPs of dimension 100, using BLPG-SM. We observe that for LPs of large size, BLPG performs better even with a few LPs in parallel (e.g., batch size=50 for a 500 dimensional LP). However, for small size LPs, BLPG out-performs GLPK only for larger batch sizes (e.g. a batch size of for a 5 dimensional LP).

Table 4 shows a performance comparison of BLPG with GLPK on LPs having the initial basic solution as infeasible. In-spite of the fact that BLPG executes the kernel twice due to the two-phase simplex algorithm as discussed in Section 4 (an extra overhead of data exchange between the two kernels), we observe a better performance. We gain a maximum speed-up of nearly , for a batch of LPs of dimension .

On profiling BLPG-SM, we observe that for small sized LPs, the processing time of the kernel is much larger than the data transfer time, as shown in Table 3. As a result, the gain in performance due to overlapping data transfer with kernel execution is negligible. This is evident from the results in Figures (a)a,(b)b and (c)c. As the LP size increases, the volume of data transfer also significantly increases. Hence, the operation of data transfer for all the streams (except the first) can be overlapped while the first kernel is in execution, thereby saving the time for data transfer in the rest of the streams. This results in performance gain for LPs of large sizes, as evident from Figure (d)d, (e)e and (f)f.

Table 8 shows a comparative evaluation of our solver to CPLEX and GLPK, on a set of selected LPs from the Netlib repository. The experimental platform is a Core Intel Xeon CPU E5-1607 v4, GHz, GB RAM with Nvidia’s Tesla K40m GPU. The LP benchmarks in the Netlib repository are present in MPS format. We use the MATLAB’s built-in function mpsread to read the benchmarks and then convert them into the standard form denoted by Equations 1 and 2. The converted sizes of the benchmarks is shown in Table 5

with column heading “Rows” indicating the number of converted constraints and “Cols” indicating the size of the benchmark. The table also shows the number of floating point operations per second (in Giga flops), giving an estimation of the floating point computations in the Simplex algorithm and the utilization of GPU by our proposed batched LP solver. We use the visual profiler (nvvp) available in the CUDA Toolkit

^{45}. The batched LP solver gives a maximum of Gflops/s for a batch size of LPs on a Nvidia Tesla K40m GPU that has a theoretical peak of 1.43 Tflops/s, for double precision arithmetic.

Due to the LP size limitation of BLPG discussed earlier in Section 5.5, we choose benchmarks that satisfy this limitation. Table (a)a shows the execution time (in seconds) on the Netlib benchmarks by the three solvers and Table (b)b displays the performance comparison of the two LP solvers with that of BLPG. Note that Netlib benchmarks are highly sparse in nature and LP solvers such as IBM CPLEX and GLPK are optimized for sparse LPs. Our implementation is the original Simplex method proposed by Dantzig and does not have any optimization for sparse LPs. We observe that in some benchmarks such as SC205 in Table 8 (and some others, not included in the table) CPLEX performs better than GLPK, whereas in other benchmark instances, GLPK outperforms CPLEX. Our proposed BLPG achieves a maximum of and nearly speed-up on the Netlib benchmarks w.r.t. CPLEX and GLPK respectively.

## 7 Performance on Motivational Application

In this section, we demonstrate the performance enhancement of state-space exploration of models of control systems, using our batched LP solver. As discussed in Section 3, the state-space exploration routines in the state of the art tools requires solving a large number of LPs. We consider two benchmarks, the Helicopter controller and a Five dimensional dynamical system for its state-space computation using the tools SpaceEx-LGG and XSpeed. The Helicopter controller benchmark is a model of a twin-engined multi-purpose military helicopter with 8 continuous variable modeling the motion and 20 controller variables that governs the various controlling actions of the helicopter ^{46, 12}. The Five dimensional dynamical system benchmark is a model of a five dimensional linear continuous system as defined in ^{47}. We direct the reader to the paper ^{47} for details on the dynamics of the model.

In order to show the impact of our solver BLPG, we evaluate the performance of the tools by solving the resulting LPs generated from the state-space exploration routines on these benchmarks. The LPs are sequentially solved using GLPK and in parallel using BLPG. The performance comparison is shown in Table 9, in an experimental setup of Intel Q9950 CPU, 2.84 Ghz, 4 Core (no hyper-threading), 8 GB RAM with a GeForce GTX 670 GPU. We observe a maximum speed-up of and in the tools performance with parallel solving of the LPs in BLPG w.r.t. sequential solving in GLPK. When compared to the tool SpaceEx-LGG, we observe a maximum of and speed-up in XSpeed using our solver. Note that the LPs generated by the tool on these benchmarks have the property that their feasible region is an hyper-rectangle. Therefore, BLPG solves these using the technique mentioned in Section 5.6.

## 8 Conclusion

Solving a linear program on a GPU for an accelerated performance on a CPU-GPU heterogeneous platform has been extensively studied. To the best of our knowledge, all such work report a performance gain only on linear programs of large size. We present a solver implemented in CUDA that can accelerate applications having to solve small to medium size LPs, but a large number of them. Our solver batches the LPs in an application and solves them in parallel on a GPU using the simplex algorithm. We report significant performance gain on benchmarks in comparison to solving them in CPU using GLPK and CPLEX solvers. We show the utility of our solver in an application of state-space exploration of models of control systems which involves solving many small to medium size LPs, by showing significant performance improvement.

## Acknowledgements

This work was supported by the National Institute of Technology Meghalaya, India and by the the DST-SERB, GoI under project grant No. YSS/2014/000623. The authors thank Santibrata Parida, for his help in experimental evaluations.

## References

- 1 Birk Matthias, Dapp Robin, Ruiter Nicole V, Becker Jürgen. GPU-based iterative transmission reconstruction in 3D ultrasound computer tomography. Journal of Parallel and Distributed Computing. 2014;74(1):1730–1743.
- 2 Yan Guorui, Tian Jie, Zhu Shouping, Dai Yakang, Qin Chenghu. Fast cone-beam CT image reconstruction using GPU hardware. Journal of X-Ray Science and Technology. 2008;16(4):225–234.
- 3 Michalakes John, Vachharajani Manish. GPU Acceleration of Numerical Weather Prediction. Parallel Processing Letter. 2008;:1–18.
- 4 Zhang Jian, McQuillan Ian, Wu FangXiang. Speed Improvements of Peptide - Spectrum Matching Using SIMD Instructions. IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW). 2010;:1–22.
- 5 Clark Michael A, La Plante PC, Greenhill Lincoln J. Accelerating radio astronomy cross-correlation with graphics processing units. International Journal of High Performance Computing Applications. 2012;:1094342012444794.
- 6 Karimi Kamran, Dickson Neil, Hamze Firas. High-Performance physics simulations using multi-core CPUs and GPGPUs in a volunteer computing context. International Journal of High Performance Computing Applications. 2011;25(1):61–69.
- 7 Lim Rone Kwei, Pro J William, Begley Matthew R, Utz Marcel, Petzold Linda R. High-performance simulation of fracture in idealized ‘brick and mortar’composites using adaptive Monte Carlo minimization on the GPU. International Journal of High Performance Computing Applications. 2015;:1094342015593395.
- 8 Shirahata Koichi, Sato Hitoshi, Suzumura Toyotaro, Matsuoka Satoshi. A GPU Implementation of generalized graph processing algorithm GIM-V. Proceedings - 2012 IEEE International Conference on Cluster Computing Workshops, Cluster Workshops 2012. 2012;:207–212.
- 9 Chandra Jeya, Schall Susan O. Economic justification of flexible manufacturing systems using the Leontief input-output model. The Engineering Economist. 1988;34(1):27–50.
- 10 Munkres James. Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics. 1957;5(1):32–38.
- 11 Drozdowski Maciej, Lawenda Marcin, Guinand Frédéric. Scheduling multiple divisible loads. International Journal of High Performance Computing Applications. 2006;20(1):19–30.
- 12 Frehse Goran, Le Guernic Colas, Donzé Alexandre, et al. SpaceEx: Scalable Verification of Hybrid Systems. In: Ganesh Gopalakrishnan Shaz Qadeer, ed. Proc. 23rd International Conference on Computer Aided Verification (CAV), LNCSSpringer; 2011.
- 13 Ray Rajarshi, Gurung Amit, Das Binayak, Bartocci Ezio, Bogomolov Sergiy, Grosu Radu. XSpeed: Accelerating Reachability Analysis on Multi-core Processors. In: Piterman Nir, ed. Hardware and Software: Verification and Testing - 11th International Haifa Verification Conference, HVC 2015, Haifa, Israel, November 17-19, 2015, Proceedings, Lecture Notes in Computer Science, vol. 9434: :3–18Springer; 2015.
- 14 Lalami Mohamed Esseghir, El-Baz Didier, Boyer Vincent. Multi GPU implementation of the simplex algorithm. In: ; 2011.
- 15 Bieling Jakob, Peschlow Patrick, Martini Peter. An efficient GPU implementation of the revised simplex method. In: :1–8IEEE; 2010.
- 16 Spampinato Daniele G., Elster Anne C.. Linear optimization on modern GPUs. In: :1–8IEEE; 2009.
- 17 Ploskas Nikolaos, Samaras Nikolaos. Efficient GPU-based implementations of simplex type algorithms. Applied Mathematics and Computation. 2015;250:552–570.
- 18 Meyer Xavier, Albuquerque Paul, Chopard Bastien. A multi-GPU implementation and performance model for the standard simplex method. In: :312–319; 2011.
- 19 Lalami Mohamed Esseghir, Boyer Vincent, El-Baz Didier. Efficient implementation of the simplex method on a CPU-GPU system. In: :1999–2006IEEE; 2011.
- 20 CPLEX IBM ILOG. V12. 1: User’s Manual for CPLEX. International Business Machines Corporation. 2009;46(53):157.
- 21 Makhorin Andrew. GNU Linear Programming Kit, v.4.37. http://www.gnu.org/software/glpk; 2009.
- 22 Dantzig George B., Thapa Mukund N.. Linear Programming 1: Introduction. Springer; 1997.
- 23 Haidar Azzam, Dong Tingxing, Luszczek Piotr, Tomov Stanimire, Dongarra Jack. Batched matrix computations on hardware accelerators based on GPUs. The International Journal of High Performance Computing Applications. 2015;29(2):193–208.
- 24 Abdelfattah Ahmad, Haidar Azzam, Tomov Stanimire, Dongarra Jack. Performance, design, and autotuning of batched GEMM for GPUs. In: :21–38Springer; 2016.
- 25 Jhurani Chetan, Mullowney Paul. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. Journal of Parallel and Distributed Computing. 2015;75:133–140.
- 26 Anderson Michael J, Sheffield David, Keutzer Kurt. A predictive model for solving small linear algebra problems in gpu registers. In: :2–13IEEE; 2012.
- 27 Villa Oreste, Gawande Nitin, Tumeo Antonino. Accelerating subsurface transport simulation on heterogeneous clusters. In: :1–8IEEE; 2013.
- 28 Wainwright Ian, Sweden High Performance Consulting. Optimized LU-decomposition with full pivot for small batched matrices. In: ; 2013.
- 29 Whaley R Clint. ATLAS (Automatically Tuned Linear Algebra Software). In: Springer 2011 (pp. 95–101).
- 30 Geijn Robert, Goto Kazushige. BLAS (Basic Linear Algebra Subprograms). In: Springer 2011 (pp. 157–164).
- 31 Ploskas Nikolaos, Samaras Nikolaos. GPU accelerated pivoting rules for the simplex algorithm. Journal of Systems and Software. 2014;96:1–9.
- 32 Dongarra J, Grosse E, others . Netlib repository. 2006.
- 33 Li Jianming, Lv Renping, Hu Xiangpei, Jiang Zhongqiang. A GPU-based parallel algorithm for large scale linear programming problem. In: Springer 2011 (pp. 37–46).
- 34 Alur Rajeev, Courcoubetis Costas, Henzinger Thomas A, Ho Pei-Hsin. Hybrid automata: An algorithmic approach to the specification and verification of hybrid systems. In: Springer 1993 (pp. 209–229).
- 35 Girard Antoine, Le Guernic Colas. Efficient Reachability Analysis for Linear Systems using Support Functions. In: ; 2008.
- 36 Tomlin John A. On scaling linear programming problems. Computational practice in mathematical programming. 1975;:146–166.
- 37 Larsson T. On scaling linear programs–Some experimental results. Optimization. 1993;27(4):355–373.
- 38 Elble Joseph M, Sahinidis Nikolaos V. Scaling linear optimization problems prior to application of the simplex method. Computational Optimization and Applications. 2012;52(2):345–371.
- 39 Ploskas Nikolaos, Samaras Nikolaos. A computational comparison of scaling techniques for linear optimization problems on a graphical processing unit. International Journal of Computer Mathematics. 2015;92(2):319–336.
- 40 Harris Mark. Optimizing parallel reduction in CUDA. NVIDIA Developer Technology. 2007;.
- 41 Hennessy John L, Patterson David A. Computer architecture: a quantitative approach. Elsevier; 2011.
- 42 Nvidia . CUDA C Programming Guide. 2015.
- 43 Harris Mark. How to Overlap Data Transfers in CUDA C/C++ .
- 44 Bradley Thomas. Hyper-Q Example. 2012.
- 45 Nvidia . CUDA Toolkit Documentation v8.0 (2017). Nvidia Corporation. 2017;.
- 46 Skogestad Sigurd, Postlethwaite Ian. Multivariable Feedback Control: Analysis and Design. John Wiley & Sons; 2005.
- 47 Girard Antoine. Reachability of Uncertain Linear Systems Using Zonotopes. In: Morari Manfred, Thiele Lothar, eds. HSCC, Lecture Notes in Computer Science, vol. 3414: :291-305Springer; 2005.

## Author Biography

Amit Gurung is a Ph.D. student at the National Institute of Technology Meghalaya, India. He received an MCA (Master in Computer Applications) degree in 2010 and B.Sc. in Computer Science in 1999 from North Eastern Hill University, Shillong, India. His research includes High Performance Computing, Parallel Programming and Reachability Analysis of Hybrid Systems.

Rajarshi Ray is an Assistant Professor at the department of Computer Science and Engineering, National Institute of Technology Meghalaya. He received his B.Sc. in Computer Science from University of Pune in 2005 and M.Sc. in Computer Science from the Chennai Mathematical Institute in 2007. He received his Ph.D. in Computer Science from Verimag, University of Grenoble in 2012. His research interests are in formal methods in systems verification, cyber-physical systems and approximate computing.

Comments

There are no comments yet.