
Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?
Matrix engines or units, in different forms and affinities, are becoming...
read it

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
The dedicated memory of hardware accelerators can be insufficient to sto...
read it

The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism
We present scalable hybridparallel algorithms for training largescale ...
read it

A Study of Single and Multidevice Synchronization Methods in Nvidia GPUs
GPUs are playing an increasingly important role in generalpurpose compu...
read it

A Survey on CoarseGrained Reconfigurable Architectures from a Performance Perspective
With the end of both Dennard's scaling and Moore's law, computer users a...
read it

HighPerformance HighOrder Stencil Computation on FPGAs Using OpenCL
In this paper we evaluate the performance of FPGAs for highorder stenci...
read it

AN5D: Automated Stencil Framework for HighDegree Temporal Blocking on GPUs
Stencil computation is one of the most widelyused compute patterns in h...
read it

The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface
Supported by their high power efficiency and recent advancements in High...
read it

iFDK: A Scalable Framework for Instant Highresolution Image Reconstruction
Computed Tomography (CT) is a widely used technology that requires compu...
read it

A Versatile Software Systolic Execution Model for GPU MemoryBound Kernels
This paper proposes a versatile highperformance execution model, inspir...
read it

Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks
Graph Convolutional Networks (GCNs) are recently getting much attention ...
read it

A New Linear Time Correctness Condition for Multiplicative Linear Logic
In this paper, we give a new linear time correctness condition for proof...
read it

Adaptive Pattern Matching with Reinforcement Learning for Dynamic Graphs
Graph pattern matching algorithms to handle millionscale dynamic graphs...
read it

Secondorder Optimization Method for Large Minibatch: Training ResNet50 on ImageNet in 35 Epochs
Largescale distributed training of deep neural networks suffer from the...
read it

Doubleprecision FPUs in HighPerformance Computing: an Embarrassment of Riches?
Among the (uncontended) common wisdom in HighPerformance Computing (HPC...
read it

μcuDNN: Accelerating Deep Learning Frameworks with MicroBatching
NVIDIA cuDNN is a lowlevel library that provides GPU kernels frequently...
read it

Highperformance sparse matrixmatrix products on Intel KNL and multicore architectures
Sparse matrixmatrix multiplication (SpGEMM) is a computational primitiv...
read it

Combined Spatial and Temporal Blocking for HighPerformance Stencil Computation on FPGAs Using OpenCL
Recent developments in High Level Synthesis tools have attracted softwar...
read it
Satoshi Matsuoka
is this you? claim profile