
-
Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?
Matrix engines or units, in different forms and affinities, are becoming...
read it
-
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
The dedicated memory of hardware accelerators can be insufficient to sto...
read it
-
The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism
We present scalable hybrid-parallel algorithms for training large-scale ...
read it
-
A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs
GPUs are playing an increasingly important role in general-purpose compu...
read it
-
A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective
With the end of both Dennard's scaling and Moore's law, computer users a...
read it
-
High-Performance High-Order Stencil Computation on FPGAs Using OpenCL
In this paper we evaluate the performance of FPGAs for high-order stenci...
read it
-
AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs
Stencil computation is one of the most widely-used compute patterns in h...
read it
-
The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface
Supported by their high power efficiency and recent advancements in High...
read it
-
iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction
Computed Tomography (CT) is a widely used technology that requires compu...
read it
-
A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels
This paper proposes a versatile high-performance execution model, inspir...
read it
-
Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks
Graph Convolutional Networks (GCNs) are recently getting much attention ...
read it
-
A New Linear Time Correctness Condition for Multiplicative Linear Logic
In this paper, we give a new linear time correctness condition for proof...
read it
-
Adaptive Pattern Matching with Reinforcement Learning for Dynamic Graphs
Graph pattern matching algorithms to handle million-scale dynamic graphs...
read it
-
Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs
Large-scale distributed training of deep neural networks suffer from the...
read it
-
Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?
Among the (uncontended) common wisdom in High-Performance Computing (HPC...
read it
-
μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching
NVIDIA cuDNN is a low-level library that provides GPU kernels frequently...
read it
-
High-performance sparse matrix-matrix products on Intel KNL and multicore architectures
Sparse matrix-matrix multiplication (SpGEMM) is a computational primitiv...
read it
-
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
Recent developments in High Level Synthesis tools have attracted softwar...
read it