DeepAI
Log In Sign Up

Energy Efficient Hardware Acceleration of Neural Networks with Power-of-Two Quantisation

09/30/2022
by   Dominika Przewlocka-Rus, et al.
AGH
0

Deep neural networks virtually dominate the domain of most modern vision systems, providing high performance at a cost of increased computational complexity.Since for those systems it is often required to operate both in real-time and with minimal energy consumption (e.g., for wearable devices or autonomous vehicles, edge Internet of Things (IoT), sensor networks), various network optimisation techniques are used, e.g., quantisation, pruning, or dedicated lightweight architectures. Due to the logarithmic distribution of weights in neural network layers, a method providing high performance with significant reduction in computational precision (for 4-bit weights and less) is the Power-of-Two (PoT) quantisation (and therefore also with a logarithmic distribution). This method introduces additional possibilities of replacing the typical for neural networks Multiply and ACcumulate (MAC – performing, e.g., convolution operations) units, with more energy-efficient Bitshift and ACcumulate (BAC). In this paper, we show that a hardware neural network accelerator with PoT weights implemented on the Zynq UltraScale + MPSoC ZCU104 SoC FPGA can be at least 1.4x more energy efficient than the uniform quantisation version. To further reduce the actual power requirement by omitting part of the computation for zero weights, we also propose a new pruning method adapted to logarithmic quantisation.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/01/2020

Optimisation of a Siamese Neural Network for Real-Time Energy Efficient Object Tracking

In this paper the research on optimisation of visual object tracking usi...
01/25/2021

AdderNet and its Minimalist Hardware Design for Energy-Efficient Artificial Intelligence

Convolutional neural networks (CNN) have been widely used for boosting t...
11/16/2016

Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning

Deep convolutional neural networks (CNNs) are indispensable to state-of-...
11/09/2020

Nanopore Base Calling on the Edge

We developed a new base caller DeepNano-coral for nanopore sequencing, w...
02/18/2022

PISA: A Binary-Weight Processing-In-Sensor Accelerator for Edge Image Processing

This work proposes a Processing-In-Sensor Accelerator, namely PISA, as a...
11/30/2020

Robust error bounds for quantised and pruned neural networks

With the rise of smartphones and the internet-of-things, data is increas...

1 Introduction

Computer vision algorithms are a central component in many of the most modern and emerging technologies, such as autonomous vehicles (drones, cars), augmented and virtual reality, and advanced driver assistance systems (ADAS). These systems must be characterised with three features: reliability, real-time operation, and, due to the battery power source, energy efficiency. The highest performance of vision systems is currently guaranteed by machine learning algorithms, in particular deep neural networks, which at the same time are known for high computational complexity. To guarantee real-time operation, these calculations need to be massively parallelised, with a third factor – the requirement for energy efficiency – to be taken into account. Since regular CPUs are not designed for such multi-threaded operations, and high-end GPUs consume hundreds of watts, embedded devices seem to be a promissing solution. However, these platforms are relatively small in computational resources and neural network acceleration requires some model optimisation. The most commonly used methods to reduce the number of parameters, as well as the computational complexity of the network, are quantisation and pruning.

Reducing the precision of a computation using quantisation leads to a significant reduction in memory requirements – e.g., going from 32-bit floating-point numbers to an 8-bit integer representation, the needed storage is reduced –, but also a simplification of the computations is achieved (multiplying integers is less complex than multiplying floating-point numbers). The gain will be greater with a significant reduction in precision, such as to 4 or even fewer bits. Not to lead to a clear decrease in network performance, one should use a quantisation method which takes into account the logarithmic distribution of weights in the neural network layer, which are concentrated around zero and more sparse elsewhere. One possibility is to use weights that are powers of two (PoT weights) – Fig. 1. Such quantisation has another major advantage, as it allows multiplication operation (in convolution filters and fully connected layers) to be converted into a very simple bit-shift operation. Another method reducing the memory complexity of the network is pruning, which zeroes very small weights, as they have little effect on the final result of input processing by the network.

In this paper, we focus on demonstrating the energy gains for FPGA-based (Filed Programmable Gate Array) neural network accelerators with PoT weights. FPGA devices are programmed by defining connections between the electronic components available in the device. They allow massive parallelisation of the operations and usually consume only a few watts. With the appropriate methods of model optimisation, FPGAs can ensure the fulfilment of all the constraints discussed for the vision systems.

The main contributions of this paper are:

  • Design of a Bitshift ACcumulate (BAC) module implementing convolutional filtering with PoT weights, thus with bit shifting instead of multiplication, along with appropriate encoding of the weights.

  • Implementation and comparison of hardware accelerator of the convolution layer based on uniform quantisation and logarithmic quantisation, in particular, demonstrating a significant reduction in power demand when using PoT weights, also for high clock frequencies.

  • A pruning algorithm adapted to logarithmic quantisation, so that the computational complexity and thus the energy requirement can be further reduced.

The remainder of the article is organised as follows. In Section 2 we describe the algorithmic basis of PoT quantisation. In Section 3, we present a brief summary of related work. Section 4 describes the hardware implementation of convolution filter modules, which are then used to implement a hardware accelerator, the benchmark results of which are summarised in Section 4.1. The proposed pruning algorithm is described in Section 4.2. Section 5 summarises the results of the experiments performed.

2 Power-of-Two Quantisation

By default (e.g. in high-end GPUs), neural network computations are implemented using at least 32-bit floating point numbers. Quantisation involves reducing the precision of the weights, but also the activations, in such a way that they can be represented on fewer bits, in particular as integers. The most common quantisation method is uniform quantisation, which for a given set of numbers (e.g., weights in a neural network layer) defines equidistant quantisation levels. Logarithmic quantisation works differently, guaranteeing a higher density of quantisation levels around zero and thus mimicking well the distribution of weights in a neural network.

Logarithmic quantification is defined by Equation (1), where is the floating point weight and the Full Scale Range (FSR) constant is responsible for determining the minimum and maximum quantisation levels.

(1)

Following [pot:DPR], the weights of a given network layer are first normalised to the interval : , where SF is a Scaling Factor equal to . Then, is transformed according to Eq. (1) and the quantised weights are obtained. The are powers of two, allowing the multiplication in the convolution operation to be reduced to a bit shift, and thus greatly simplifying the computations.

Consider an example convolution filter

The subsequent steps for 4 bit width quantisation will be as follows:

The next step is to transform according to Eq. (1) – only the absolute values of the weights are quantised. We assume the value of FSR equal to 0, which corresponds to the maximum absolute value of the normalised weights (). For the exemplar filter, we obtain:

The exponent vector is actually a vector of bit-shift values used for simplified multiplication with input activation. The direction of the shift is fixed for all weights (minus sign – to the right), so it can be omitted. In its place, we insert the weight sign, so the exponent vector now has the form:

Thus, each weight can be stored on 4 bits: the first for the weight sign and the rest for the bit shift value.

is constant across the layer (the same for all filters in the layer), so it can be combined with batch normalisation gain and will not introduce additional computations load.

Figure 1:

The left plot shows the distribution of an exemplar layer of ResNet18 trained on ImageNet, whose logarithmic shape motivates the use of the Power-of-Two quantisation method. The weights’ distribution after quantisation to the 4 bit width is shown on the right.

3 Related Work

Weights quantisation is the basic method for reducing the computational and memory complexity of neural network. The variety of existing approaches is well surveyed in [gholami21], with comparison between uniform and non-uniform quantization, as well as different quantisation granularity (layerwise, channelwise…), symmetry and quantisation algorithms (Quantisation Aware Training or Post Training Quantisation).

There are several papers on logarithmic quantisation in the literature – in particular, on the quantisation algorithm itself, or Quantisation Aware Training (QAT). Logarithmic quantisation for neural networks was first introduced in [Miyashita16]. The paper [deepshift19] proposes an approach in which the network learns the bit shift values. The authors implemented GPU kernels for PoT-weighted computation and showed gains over standard multiplication-based filtering. The work of [APoT19] extends the PoT weights to Additive-Powers-of-Two thereby increasing the resolution of the quantisation levels. Although this solution represents a State-of-the-Art in terms of achieved accuracy, as shown in [pot:DPR]

, the need to sum two partial bit shift results, as well as the additional logic to decode the weights, introduces a higher degree of complexity compared to PoT quantisation. This latter paper also proposes learning PoT networks using STE (Straight Through Estimator – quantised weights are used for forward pass, while a full precision version is used for backpropagation) and demonstrates gains in using bit shifting over multiplication.

The papers [9577718] and [Chmiel21] are devoted to learning algorithms, as well as [Vogel18], where the authors extend the considerations to a hardware implementation of accelerator based on logarithmic quantisation. The proposed PE based on bit-shifting for 5 bit weights allowed for a reduction in power consumption compared to the PE with 8 bit fixed-point multiplier design, for implementation in the Xilinx FPGA Virtex 7. However, it also introduced additional operations resulting from the use of a logarithmic base different from 2. Similarly, in [Xu20] the authors drop the logarithmic base 2, and show energy gains for using 5-bit logarithmic quantisation over the standard 16-bit multiplier. Finally, the article [Lee17] presents dynamic power consumption for a hardware implementation of matrix multiplication for uniformly and PoT quantised values. With 3 bit width weights, one can assume a decrease in power consumption of about , for 5 bits – , in favour of PoT weights.

This paper complements the research described above by showing energy gains resulting from the use of PoT weights for a hardware accelerator implemented in FPGA.

4 Hardware Design

To compare the acceleration of a convolution layer for networks with uniform and PoT quantisation, two modules were designed to perform the convolution operation – one based on Multiply and ACcumulate (MAC – Fig. 2) and the second on Bitshift and ACcumulate (BAC – Fig. 3) unit. In both cases, a test has been added to see if the weight is equal to zero, which allows the multiplication (bit shift) to be possibly skipped and thus reduces the actual power consumption. The obvious difference between the modules is the way the multiplication is performed – simplified to a bit shift for the PoT weights. An additional aspect that needs to be solved is how to encode the logarithmic weights, which consist of a sign and a absolute bit shift value – thus, they are not written in the two’s complement code. First, it is necessary to distinguish between a zero weight and a weight with a power equal to (i.e., thus with absolute value ). According to Equation (1), for weights encoded on 4 bits, we will obtain different levels ({6, -5, …, -0, 0, …, 5, 6}), which therefore leaves the possibility of encoding the zero weight in two ways. Second, the sign of the weight is encoded independently of the bit-shift value itself, so partial product accumulation must distinguish between the case where the weight is negative (in which case subtraction occurs) and positive (addition). Both modules – MAC and BAC – operate with the same latency, i.e., they need two clock cycles to compute the partial weighted sum of the inputs. The computation is streamlined, which means that for a filter of size , the convolution result will be obtained after clock cycles. Both designs have clock, enable, and reset (for accumulator reset) signals, which are used by the external module to control the data flow.

Figure 2: Schematic of a simple Multiply Accumulate unit with zero weight detection.
Figure 3: Schematic of Bitshift Accumulate unit with zero weight detection and proper weight encoding. For bitshifting we use only the absolute value of the weight (we always do a right bit shift). After bitshifting, we use the weight sign bit to decide whether the obtained partial product should be added or subtracted from accumulator (it is mandatory since weights are not coded in two’s complement).

4.1 Benchmark Results

The proposed modules were implemented in programmable logic (FPGA) of the Zynq UltraScale + MPSoC ZCU104 platform using the Vivado 2020.2 software. For a proper and fair comparison of the two modules, the use of DSPs (customarily used for multiplication operations) was disabled in the synthesis process, which forced the implementation of multiplication using LUTs for the MAC unit. Finally, the Aera Explore implementation strategy was chosen, although it should be noted that for the others (Power Explore, default) the obtained results were not significantly different – due to the low complexity of the proposed designs. Table 1 shows the resource usage for single MAC and BAC modules, assuming that network weights are, regardless of method, quantised to 4 bits and activations to 8 bits. Despite the additional logic needed to accumulate bit-shift results depending on the sign of the weight, the BAC module needs about fewer LUTs. FFs are responsible for registers (memory cells) and the difference in their use, although slight, is also due to the change from a multiplication operation to a bit shift.

MAC PE LUT FF
Uniform 8x4 63 45
PoT 8x4 44 41
Table 1: Comparison of the post-implementation resources used for single MAC and BAC units with input activation x weight bit widths values.

To investigate the power requirements of neural network accelerators based on the quantisation methods analysed, a sample convolution layer (from the ResNet family of networks) module was designed, consisting of filters of size . The construction of a complete accelerator is another challenge, and several approaches are encountered in the literature (as well as commercial solutions), including the implementation of a “universal” layer, i.e., configurable in such a way that it can perform computations for both the “smallest” and the “largest” layer. Since most modern networks have a regular structure, it usually comes down to implementing the “largest” layer. Such an accelerator then iteratively computes the results of each layer, using memory (BRAM or DRAM – Block or Distributed RAM) to read and write the results of intermediate layers. This approach makes it possible to implement deep networks even in small embedded devices – for devices with more resources, instead of a single universal layer, multiple layers can be implemented, where streamlined computations are performed, and thus reduce to number of reads and writes to and from memory. In view of the above, the implementation of a single layer, representative in size, will allow a demonstration of the performance of the entire accelerator.

For both layers based on uniform and PoT quantisation, a similar test design was proposed.

  • we implemented a layer module, consisting of units, with input clock, enable and reset signals, and input weights for each filter (unit) and input activation,

  • we created a wrapper, with BRAM instances for weights and activation, logic controlling value reading and MAC/BAC units, as well as a clock signal generator,

  • in order to confirm the correctness of the calculations (and also to prevent optimisation of the outputs of the unconnected modules 111Vivado performs multiple optimisations during design synthesis, which remove the unused elements), we added a built-in logic state analyser (ILA) to the layer output data.

Table 2 shows the resource usage and power consumption for both layers (excluding the other elements of the test design). The clock was set to MHz and the dynamic power consumption was estimated using Vivado Tools, based on the implemented design. Using PoT weights, and thus replacing the typical multiplication for convolution operations with a bit shift, allowed for an increase in power efficiency of about . At the same time, it should also be noted, as shown in the paper [pot:DPR], that the accuracy of low-precision networks (e.g., 4 bit width) is much higher using logarithmic quantisation than for those with a linear distribution. The above also confirms the results of the synthesis of single MAC and BAC units shown in [pot:DPR].

MAC PE LUT FF Clock [MHz] Power [W]
Uniform 8x4 24193 23265 350
PoT 8x4 19557 17024 350
Table 2: Resources usage and dynamic power consumption (estimated using Vivado 2020.2) for MAC and BAC based layers with filters.

The actual processing time of the input frame depends on the latency of the modules (identical for both MAC and BAC units), as well as the clock frequency. To investigate the approximate limit of this frequency, we conducted a series of experiments summarised in Fig. 4, showing the relationship between the clock rate and the dynamic power consumed by the convolution layer module. For the MAC layer, the highest operating frequency is MHz, for the higher value of MHz the implementation failed to meet the timing constraints. For the BAC layer, a similar problem occurred only at MHz.

Figure 4: Relation between the clock frequency and power consumption for both the uniform and the PoT layer: for higher frequencies the gap between the MAC and the BAC layer is more visible, with the BAC layer consuming around of the power needed by the MAC layer.

As described in Section 3, most of the existing works show decreases in the power used relative to standard (e.g. 16 bits) multipliers. For low-precision computation processing elements, the work of [Lee17] indicated a decrease of about , with 5 bits – , in favour of PoT weights over uniformly quantised ones. Our implementation allows for a reduction in dynamic power at 4 bit width weights. Because the aforementioned publication is sparing in its description of the design, we are unable to make a more accurate comparison.

4.2 Going Lower on Power - Pruning

A further reduction in energy consumption can be achieved by introducing more zero weights using pruning. Uniform quantisation for weights with very low precision (e.g., 4 bit width) introduces automatic pruning that does not affect the number of quantisation steps in any way – values below the minimum quantised weight (different from zero) are rounded to zero, resulting in a relatively high percentage of zero weights. This is different for logarithmic quantisation: due to the dense distribution of quantisation levels around zero, the number of quantisation levels is reduced by discarding the lowest ones, when weights with low absolute values are zeroed out. This can significantly affect the efficiency of the network at very low computational precision.

At the same time, properly performed pruning can result in the zeroing of up to a few tens of percent of weights having little effect on the final network indication, which in turn can result in further gains (after introducing a bit shift in place of multiplication) gains in hardware implementation:

  • In [pot:DPR], authors show that zeroing out the weights allows compression and thus reduces the memory requirements (which also implies a decrease in energy requirements).

  • The designed MAC and BAC units skip the multiplication/bit shifting operations when zero weight is detected. Depending on the network architecture and the degree of pruning, this will result in smaller or larger energy gains during system operation.

In view of the above, a pruning algorithm adapted to the peculiarities of logarithmic quantisation is proposed, introducing the so-called dead zone around zero (in the distribution of weights of a given layer) by the following modification of the algorithm described in Section 2:

  1. After normalising the absolute values of the weights to the interval (and getting ) we implement pruning () with a fixed degree, e.g. relative to the weight with the maximum modulus (i.e. 1).

  2. We normalise the pruned weights, excluding zero weights, to the interval .

  3. We perform logarithmic quantisation.

Thus, the value of the Scaling Factor is equal to , where is different from zero. To reconstruct the proper quantised weight value, we must then introduce an offset equal to , so now . Thus, we move the minimum quantisation level away from zero, creating a large enough ’dead zone’ while keeping the logarithmic distribution outside of it – Fig. 5.

Figure 5:

Visualisation of the proposed pruning method: the left plot shows the distribution of 4-bit width PoT quantised weights of an exemplar layer of ResNet18 trained on ImageNet. After applying pruning, the minimum quantisation value is shifted away from zero, creating a dead zone – marked in red in the right plot.

To evaluate the proposed pruning method, ResNet20 was trained on the CIFAR10 dataset, with varying degrees of pruning – weights below a value set of were set to zero. The STE algorithm is used to train the quantised, PoT 4 bit width network. That is, we initialise the weights with the full precision version, and then in each training iteration, we do forward pass with the quantised weights and backward pass using floating point values.

PF Accuracy Number of zeros
no pruning
Table 3: Pruning experiments for the 4-bit width PoT quantised ResNet20 trained on CIFAR10. Increasing the value of Pruning Factor causes more weights to be set to zero, while retaining very good accuracy of the 4-bit width quantised network.

Table 3 shows the results of the experiment. Since the tested network is very redundant, and due to the introduction of a ’dead zone’ that allowed all quantisation levels to be maintained, the decrease in network performance is small even when of weights are set to zero. We also observed a slight increase in performance for lower PF values, indicating that the introduced pruning prevented overfitting.

The presented pruning algorithm, in combination with the designed BAC module presented in Section 4, which allows one to skip the computation for zero weights, will therefore allow one to further significantly reduce the power requirements of the real hardware accelerator for sparse neural networks.

5 Conclusion

In this article, we demonstrate the effect of logarithmic quantisation on energy demand in an FPGA-based accelerator of convolutional neural networks. The replacement of the multiplication operation with a bit shift reduced the power requirement by

. Moreover, a pruning algorithm adapted to logarithmic quantisation is proposed, so that while maintaining high network efficiency, power consumption can be further reduced. In view of this and previous works, it is important to conclude that neural network quantisation using PoT weights, compared to uniform quantisation, allows for highly energy-efficient acceleration in embedded devices. Further work should focus on the hardware implementation of a complete vision system based on the presented solution. The use of PoT weights and the proposed convolution filter hardware design shall allow for real-time and energy-efficient implementation of particularly complex, state-of-the-art deep convolutional neural networks.

5.0.1 Acknowledgements

The work presented in this paper was supported by the AGH University of Science and Technology project no. 16.16.120.773.

References