Fast matrix multiplication for binary and ternary CNNs on ARM CPU

05/18/2022
by   Anton Trusov, et al.
0

Low-bit quantized neural networks are of great interest in practical applications because they significantly reduce the consumption of both memory and computational resources. Binary neural networks are memory and computationally efficient as they require only one bit per weight and activation and can be computed using Boolean logic and bit count operations. QNNs with ternary weights and activations and binary weights and ternary activations aim to improve recognition quality compared to BNNs while preserving low bit-width. However, their efficient implementation is usually considered on ASICs and FPGAs, limiting their applicability in real-life tasks. At the same time, one of the areas where efficient recognition is most in demand is recognition on mobile devices using their CPUs. However, there are no known fast implementations of TBNs and TNN, only the daBNN library for BNNs inference. In this paper, we propose novel fast algorithms of ternary, ternary-binary, and binary matrix multiplication for mobile devices with ARM architecture. In our algorithms, ternary weights are represented using 2-bit encoding and binary - using one bit. It allows us to replace matrix multiplication with Boolean logic operations that can be computed on 128-bits simultaneously, using ARM NEON SIMD extension. The matrix multiplication results are accumulated in 16-bit integer registers. We also use special reordering of values in left and right matrices. All that allows us to efficiently compute a matrix product while minimizing the number of loads and stores compared to the algorithm from daBNN. Our algorithms can be used to implement inference of convolutional and fully connected layers of TNNs, TBNs, and BNNs. We evaluate them experimentally on ARM Cortex-A73 CPU and compare their inference speed to efficient implementations of full-precision, 8-bit, and 4-bit quantized matrix multiplications.

READ FULL TEXT
research
09/14/2020

Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices

Quantized low-precision neural networks are very popular because they re...
research
05/19/2017

Espresso: Efficient Forward Propagation for BCNNs

There are many applications scenarios for which the computational perfor...
research
10/01/2020

BCNN: A Binary CNN with All Matrix Ops Quantized to 1 Bit Precision

This paper describes a CNN where all CNN style 2D convolution operations...
research
09/04/2019

Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures

We study the problem of multiplying two bit matrices with entries either...
research
02/19/2020

Fast Implementation of Morphological Filtering Using ARM NEON Extension

In this paper we consider speedup potential of morphological image filte...
research
10/25/2018

Automating Generation of Low Precision Deep Learning Operators

State of the art deep learning models have made steady progress in the f...
research
11/13/2022

FullPack: Full Vector Utilization for Sub-Byte Quantized Inference on General Purpose CPUs

Although prior art has demonstrated negligible accuracy drop in sub-byte...

Please sign up or login with your details

Forgot password? Click here to reset