
SoaAlloc: Accelerating SingleMethod MultipleObjects Applications on GPUs
We propose SoaAlloc, a dynamic object allocator for SingleMethod Multip...
read it

Efficient SparseDense MatrixMatrix Multiplication on GPUs Using the Customized Sparse Storage Format
Multiplication of a sparse matrix to a dense matrix (SpDM) is widely use...
read it

BandwidthOptimal Random Shuffling for GPUs
Lineartime algorithms that are traditionally used to shuffle data on CP...
read it

GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs
As the emerging trend of the graphbased deep learning, Graph Neural Net...
read it

Automatic Kernel Generation for Volta Tensor Cores
A commonly occurring computation idiom in neural networks is to perform ...
read it

Technical Report about Tiramisu: a ThreeLayered Abstraction for Hiding Hardware Complexity from DSL Compilers
Highperformance DSL developers work hard to take advantage of modern ha...
read it

Backpropagation for long sequences: beyond memory constraints with constant overheads
Naive backpropagation through time has a memory footprint that grows lin...
read it
Automatic acceleration of Numpy applications on GPUs and multicore CPUs
Frameworks like Numpy are a popular choice for application developers from varied fields such as image processing to bioinformatics to machine learning. Numpy is often used to develop prototypes or for deployment since it provides efficient implementation for operations involving arrays. Such an approach requires every operation to be executed eagerly. The result of each operation needs to be stored in memory which increases the memory footprint of the application. It also increases the bandwidth requirements since all uses must read from this memory. We propose an approach that records the sequence of Numpy operations for defered execution. When the values of an array are needed, for example when the values are stored to disk or displayed on screen, the sequence of operations required to compute these value are compiled into a function and executed. This removes the need to store/load intermediates in slow memory, resulting in better performance. In cases where the library implementation is more efficient (like matrixmatrix multiply), those are used instead. The approach also allows us to seamlessly target both multicore CPUs and NVIDIA GPUs, thereby porting the Numpy application to these architectures without changing the user program. The benefit of the approach is evaluated by targeting computation samples from various domains and on average on order of magnitude performance improvement over Numpy is observed.
READ FULL TEXT
Comments
There are no comments yet.