
-
Systolic Computing on GPUs for Productive Performance
We propose a language and compiler to productively build high-performanc...
read it
-
MISIM: An End-to-End Neural Code Similarity System
Code similarity systems are integral to a range of applications from cod...
read it
-
Context-Aware Parse Trees
The simplified parse tree (SPT) presented in Aroma, a state-of-the-art c...
read it
-
K-TanH: Hardware Efficient Activations For Deep Learning
We propose K-TanH, a novel, highly accurate, hardware efficient approxim...
read it
-
A Study of BFLOAT16 for Deep Learning Training
This paper presents the first comprehensive empirical study demonstratin...
read it
-
SysML: The New Frontier of Machine Learning Systems
Machine learning (ML) techniques are enjoying rapidly increasing adoptio...
read it
-
Mixed Precision Training of Convolutional Neural Networks using Integer Operations
The state-of-the-art (SOTA) for mixed precision training is dominated by...
read it
-
On Scale-out Deep Learning Training for Cloud and HPC
The exponential growth in use of large deep neural networks has accelera...
read it
-
Galactos: Computing the Anisotropic 3-Point Correlation Function for 2 Billion Galaxies
The nature of dark energy and the complete theory of gravity are two cen...
read it
-
Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data
This paper presents the first, 15-PetaFLOP Deep Learning system for solv...
read it
-
Ternary Residual Networks
Sub-8-bit representation of DNNs incur some discernible loss of accuracy...
read it
-
Ternary Neural Networks with Fine-Grained Quantization
We propose a novel fine-grained quantization (FGQ) method to ternarize p...
read it
-
Parallelizing Word2Vec in Multi-Core and Many-Core Architectures
Word2vec is a widely used algorithm for extracting low-dimensional vecto...
read it
-
Faster CNNs with Direct Sparse Convolutions and Guided Pruning
Phenomenally successful in practical inference problems, convolutional n...
read it
-
Parallelizing Word2Vec in Shared and Distributed Memory
Word2Vec is a widely used algorithm for extracting low-dimensional vecto...
read it
-
BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies
We propose BlackOut, an approximation algorithm to efficiently train mas...
read it