
CoSA: Scheduling by Constrained Optimization for Spatial Accelerators
Recent advances in Deep Neural Networks (DNNs) have led to active develo...
Avoiding Communication in Logistic Regression
Stochastic gradient descent (SGD) is one of the most widely used optimiz...
Training EfficientNets at Supercomputer Scale: 83 Accuracy in One Hour
EfficientNets are a family of stateoftheart image classification mode...
The Limit of the Batch Size
Largebatch training is an efficient approach for current distributed de...
CommunicationOptimal Tilings for Projective Nested Loops with Arbitrary Bounds
Reducing communication  either between levels of a memory hierarchy or ...
AutoPrecision Scaling for Distributed Deep Learning
In recent years, largebatch optimization is becoming the key of distrib...
An improved analysis and unified perspective on deterministic and randomized low rank matrix approximations
We introduce a Generalized LUFactorization (GLU) for lowrank matrix ap...
A Generalized Randomized RankRevealing Factorization
We introduce a Generalized Randomized QRdecomposition that may be appli...
Reducing BERT PreTraining Time from 3 Days to 76 Minutes
Largebatch training is key to speeding up deep neural network training ...
LargeBatch Training for LSTM and Beyond
Largebatch training approaches have enabled researchers to utilize larg...
A 3D Parallel Algorithm for QR Decomposition
Interprocessor communication often dominates the runtime of large matrix...
Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems
We propose two new methods to address the weak scaling problems of KRR: ...
CommunicationOptimal Convolutional Neural Nets
Efficiently executing convolutional neural nets (CNNs) is important in m...
Avoiding Synchronization in FirstOrder Methods for Sparse Convex Optimization
Parallel computing has played an important role in speeding up convex op...
Avoiding Communication in Proximal Methods for Convex Optimization Problems
The fast iterative soft thresholding algorithm (FISTA) is used to solve ...
ImageNet Training in Minutes
Finishing 90epoch ImageNet1k training with ResNet50 on a NVIDIA M40 G...
Communication Lower Bounds of Bilinear Algorithms for Symmetric Tensor Contractions
Accurate numerical calculations of electronic structure are often domina...
James Demmel
EECS Department Chair and Professor of Mathematics and Computer Science at University of California Berkeley