Efficient Interleaved Batch Matrix Solvers for CUDA

by   Andrew Gloster, et al.

In this paper we present a new methodology for data accesses when solving batches of Tridiagonal and Pentadiagonal matrices that all share the same LHS matrix. By only storing one copy of this matrix there is a significant reduction in storage overheads and the authors show that there is also a performance increase in terms of compute time. These two results combined lead to an overall more efficient implementation over the current state of the art algorithms cuThomasBatch and cuPentBatch, allowing for a greater number of systems to be solved on a single GPU.


page 7

page 11

page 12


A Batched GPU Methodology for Numerical Solutions of Partial Differential Equations

In this paper we present a methodology for data accesses when solving ba...

Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices

Fast domain propagation of linear constraints has become a crucial compo...

Grassmanian packings: Trust region stochastic tuning for matrix incoherence

We provide a new numerical procedure for constructing low coherence matr...

Programming Matrices as Staged Sparse Rows to Generate Efficient Matrix-free Differential Equation Solver

Solving differential equations is a critical task in scientific computin...

Matrix oriented reduction of space-time Petrov-Galerkin variational problems

Variational formulations of time-dependent PDEs in space and time yield ...

Engineering Kernelization for Maximum Cut

Kernelization is a general theoretical framework for preprocessing insta...

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

We implemented and optimized matrix multiplications between dense and bl...

Code Repositories

Please sign up or login with your details

Forgot password? Click here to reset