We introduce a method that dramatically reduces fine-tuning VRAM require...
Generating texts with a large language model (LLM) consumes massive amou...
The emergence of the Internet of Things (IoT) has resulted in a remarkab...
As computing hardware becomes more specialized, designing environmentall...
Deep learning recommendation systems serve personalized content under di...
The ability to accurately predict deep neural network (DNN) inference
pe...
On-device machine learning (ML) inference can enable the use of private ...
Graph Neural Networks (GNNs) are a class of neural networks designed to
...
This paper proposes Impala, a new cryptographic protocol for private
inf...
Autonomous machines (e.g., vehicles, mobile robots, drones) require
soph...
Multiparty computation approaches to secure neural network inference
tra...
Specialized accelerators are increasingly used to meet the power-perform...
The design of heterogeneous systems that include domain specific acceler...
Post-Moore's law area-constrained systems rely on accelerators to delive...
Repeated off-chip memory accesses to DRAM drive up operating power for
d...
The memory wall bottleneck is a key challenge across many data-intensive...
We show that aggregated model updates in federated learning may be insec...
Deep learning recommendation systems must provide high quality, personal...
This work analyzes how attention-based Bidirectional Long Short-Term Mem...
Building domain-specific architectures for autonomous aerial robots is
c...
Neural personalized recommendation models are used across a wide variety...
Transformer-based language models such as BERT provide significant accur...
Given recent algorithm, software, and hardware innovation, computing has...
Deep learning based recommendation systems form the backbone of most
per...
As the application of deep learning continues to grow, so does the amoun...
The current trend for domain-specific architectures (DSAs) has led to re...
Neural personalized recommendation is the corner-stone of a wide collect...
In recent years, there has been tremendous advances in hardware accelera...
We present a new algorithm for training neural networks with binary
acti...
Machine learning is experiencing an explosion of software and hardware
s...
Conventional hardware-friendly quantization methods, such as fixed-point...
Training deep learning models is compute-intensive and there is an
indus...
Low-rank approximation is an effective model compression technique to no...
Model compression techniques, such as pruning and quantization, are beco...
Pruning is an efficient model compression technique to remove redundancy...
This paper takes the position that, while cognitive computing today reli...
The large memory requirements of deep neural networks limit their deploy...