Post-training quantization (PTQ) is the go-to compression technique for ...
This paper accelerates video perception, such as semantic segmentation a...
Quantizing neural networks is one of the most effective methods for achi...
Neural network pruning and quantization techniques are almost as old as
...
Transformer models have been widely adopted in various domains over the ...
Recently, the idea of using FP8 as a number format for neural network
tr...
Neural network quantization is frequently used to optimize model size,
l...
Transformer language models such as GPT-2 are difficult to quantize beca...
When quantizing neural networks for efficient inference, low-bit integer...
In this paper, we introduce a novel method of neural network weight
comp...
Federated Learning (FL) is a machine learning paradigm to distributively...
When training neural networks with simulated quantization, we observe th...
Current methods for pruning neural network weights iteratively apply
mag...
While neural networks have advanced the frontiers in many machine learni...
We propose a method to compress full-resolution video sequences with imp...
Transformer-based architectures have become the de-facto standard models...
While neural networks have advanced the frontiers in many applications, ...
Quantization techniques applied to the inference of deep neural networks...
We introduce Bayesian Bits, a practical method for joint mixed precision...
When quantizing neural networks, assigning each floating-point weight to...
Unlike ReLU, newer activation functions (like Swish, H-swish, Mish) that...
The success of deep neural networks in many real-world applications is
l...
We introduce a data-free quantization method for deep neural networks th...