
Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?
Matrix engines or units, in different forms and affinities, are becoming...
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
The dedicated memory of hardware accelerators can be insufficient to sto...
The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism
We present scalable hybridparallel algorithms for training largescale ...
A Study of Single and Multidevice Synchronization Methods in Nvidia GPUs
GPUs are playing an increasingly important role in generalpurpose compu...
A Survey on CoarseGrained Reconfigurable Architectures from a Performance Perspective
With the end of both Dennard's scaling and Moore's law, computer users a...
HighPerformance HighOrder Stencil Computation on FPGAs Using OpenCL
In this paper we evaluate the performance of FPGAs for highorder stenci...
AN5D: Automated Stencil Framework for HighDegree Temporal Blocking on GPUs
Stencil computation is one of the most widelyused compute patterns in h...
The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface
Supported by their high power efficiency and recent advancements in High...
iFDK: A Scalable Framework for Instant Highresolution Image Reconstruction
Computed Tomography (CT) is a widely used technology that requires compu...
A Versatile Software Systolic Execution Model for GPU MemoryBound Kernels
This paper proposes a versatile highperformance execution model, inspir...
Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks
Graph Convolutional Networks (GCNs) are recently getting much attention ...
A New Linear Time Correctness Condition for Multiplicative Linear Logic
In this paper, we give a new linear time correctness condition for proof...
Adaptive Pattern Matching with Reinforcement Learning for Dynamic Graphs
Graph pattern matching algorithms to handle millionscale dynamic graphs...
Secondorder Optimization Method for Large Minibatch: Training ResNet50 on ImageNet in 35 Epochs
Largescale distributed training of deep neural networks suffer from the...
Doubleprecision FPUs in HighPerformance Computing: an Embarrassment of Riches?
Among the (uncontended) common wisdom in HighPerformance Computing (HPC...
μcuDNN: Accelerating Deep Learning Frameworks with MicroBatching
NVIDIA cuDNN is a lowlevel library that provides GPU kernels frequently...
Highperformance sparse matrixmatrix products on Intel KNL and multicore architectures
Sparse matrixmatrix multiplication (SpGEMM) is a computational primitiv...
Combined Spatial and Temporal Blocking for HighPerformance Stencil Computation on FPGAs Using OpenCL
Recent developments in High Level Synthesis tools have attracted softwar...
Satoshi Matsuoka
