1 Introduction
Computer vision algorithms are a central component in many of the most modern and emerging technologies, such as autonomous vehicles (drones, cars), augmented and virtual reality, and advanced driver assistance systems (ADAS). These systems must be characterised with three features: reliability, realtime operation, and, due to the battery power source, energy efficiency. The highest performance of vision systems is currently guaranteed by machine learning algorithms, in particular deep neural networks, which at the same time are known for high computational complexity. To guarantee realtime operation, these calculations need to be massively parallelised, with a third factor – the requirement for energy efficiency – to be taken into account. Since regular CPUs are not designed for such multithreaded operations, and highend GPUs consume hundreds of watts, embedded devices seem to be a promissing solution. However, these platforms are relatively small in computational resources and neural network acceleration requires some model optimisation. The most commonly used methods to reduce the number of parameters, as well as the computational complexity of the network, are quantisation and pruning.
Reducing the precision of a computation using quantisation leads to a significant reduction in memory requirements – e.g., going from 32bit floatingpoint numbers to an 8bit integer representation, the needed storage is reduced –, but also a simplification of the computations is achieved (multiplying integers is less complex than multiplying floatingpoint numbers). The gain will be greater with a significant reduction in precision, such as to 4 or even fewer bits. Not to lead to a clear decrease in network performance, one should use a quantisation method which takes into account the logarithmic distribution of weights in the neural network layer, which are concentrated around zero and more sparse elsewhere. One possibility is to use weights that are powers of two (PoT weights) – Fig. 1. Such quantisation has another major advantage, as it allows multiplication operation (in convolution filters and fully connected layers) to be converted into a very simple bitshift operation. Another method reducing the memory complexity of the network is pruning, which zeroes very small weights, as they have little effect on the final result of input processing by the network.
In this paper, we focus on demonstrating the energy gains for FPGAbased (Filed Programmable Gate Array) neural network accelerators with PoT weights. FPGA devices are programmed by defining connections between the electronic components available in the device. They allow massive parallelisation of the operations and usually consume only a few watts. With the appropriate methods of model optimisation, FPGAs can ensure the fulfilment of all the constraints discussed for the vision systems.
The main contributions of this paper are:

Design of a Bitshift ACcumulate (BAC) module implementing convolutional filtering with PoT weights, thus with bit shifting instead of multiplication, along with appropriate encoding of the weights.

Implementation and comparison of hardware accelerator of the convolution layer based on uniform quantisation and logarithmic quantisation, in particular, demonstrating a significant reduction in power demand when using PoT weights, also for high clock frequencies.

A pruning algorithm adapted to logarithmic quantisation, so that the computational complexity and thus the energy requirement can be further reduced.
The remainder of the article is organised as follows. In Section 2 we describe the algorithmic basis of PoT quantisation. In Section 3, we present a brief summary of related work. Section 4 describes the hardware implementation of convolution filter modules, which are then used to implement a hardware accelerator, the benchmark results of which are summarised in Section 4.1. The proposed pruning algorithm is described in Section 4.2. Section 5 summarises the results of the experiments performed.
2 PowerofTwo Quantisation
By default (e.g. in highend GPUs), neural network computations are implemented using at least 32bit floating point numbers. Quantisation involves reducing the precision of the weights, but also the activations, in such a way that they can be represented on fewer bits, in particular as integers. The most common quantisation method is uniform quantisation, which for a given set of numbers (e.g., weights in a neural network layer) defines equidistant quantisation levels. Logarithmic quantisation works differently, guaranteeing a higher density of quantisation levels around zero and thus mimicking well the distribution of weights in a neural network.
Logarithmic quantification is defined by Equation (1), where is the floating point weight and the Full Scale Range (FSR) constant is responsible for determining the minimum and maximum quantisation levels.
(1) 
Following [pot:DPR], the weights of a given network layer are first normalised to the interval : , where SF is a Scaling Factor equal to . Then, is transformed according to Eq. (1) and the quantised weights are obtained. The are powers of two, allowing the multiplication in the convolution operation to be reduced to a bit shift, and thus greatly simplifying the computations.
Consider an example convolution filter
The subsequent steps for 4 bit width quantisation will be as follows:
The next step is to transform according to Eq. (1) – only the absolute values of the weights are quantised. We assume the value of FSR equal to 0, which corresponds to the maximum absolute value of the normalised weights (). For the exemplar filter, we obtain:
The exponent vector is actually a vector of bitshift values used for simplified multiplication with input activation. The direction of the shift is fixed for all weights (minus sign – to the right), so it can be omitted. In its place, we insert the weight sign, so the exponent vector now has the form:
Thus, each weight can be stored on 4 bits: the first for the weight sign and the rest for the bit shift value.
is constant across the layer (the same for all filters in the layer), so it can be combined with batch normalisation gain and will not introduce additional computations load.
3 Related Work
Weights quantisation is the basic method for reducing the computational and memory complexity of neural network. The variety of existing approaches is well surveyed in [gholami21], with comparison between uniform and nonuniform quantization, as well as different quantisation granularity (layerwise, channelwise…), symmetry and quantisation algorithms (Quantisation Aware Training or Post Training Quantisation).
There are several papers on logarithmic quantisation in the literature – in particular, on the quantisation algorithm itself, or Quantisation Aware Training (QAT). Logarithmic quantisation for neural networks was first introduced in [Miyashita16]. The paper [deepshift19] proposes an approach in which the network learns the bit shift values. The authors implemented GPU kernels for PoTweighted computation and showed gains over standard multiplicationbased filtering. The work of [APoT19] extends the PoT weights to AdditivePowersofTwo thereby increasing the resolution of the quantisation levels. Although this solution represents a StateoftheArt in terms of achieved accuracy, as shown in [pot:DPR]
, the need to sum two partial bit shift results, as well as the additional logic to decode the weights, introduces a higher degree of complexity compared to PoT quantisation. This latter paper also proposes learning PoT networks using STE (Straight Through Estimator – quantised weights are used for forward pass, while a full precision version is used for backpropagation) and demonstrates gains in using bit shifting over multiplication.
The papers [9577718] and [Chmiel21] are devoted to learning algorithms, as well as [Vogel18], where the authors extend the considerations to a hardware implementation of accelerator based on logarithmic quantisation. The proposed PE based on bitshifting for 5 bit weights allowed for a reduction in power consumption compared to the PE with 8 bit fixedpoint multiplier design, for implementation in the Xilinx FPGA Virtex 7. However, it also introduced additional operations resulting from the use of a logarithmic base different from 2. Similarly, in [Xu20] the authors drop the logarithmic base 2, and show energy gains for using 5bit logarithmic quantisation over the standard 16bit multiplier. Finally, the article [Lee17] presents dynamic power consumption for a hardware implementation of matrix multiplication for uniformly and PoT quantised values. With 3 bit width weights, one can assume a decrease in power consumption of about , for 5 bits – , in favour of PoT weights.
This paper complements the research described above by showing energy gains resulting from the use of PoT weights for a hardware accelerator implemented in FPGA.
4 Hardware Design
To compare the acceleration of a convolution layer for networks with uniform and PoT quantisation, two modules were designed to perform the convolution operation – one based on Multiply and ACcumulate (MAC – Fig. 2) and the second on Bitshift and ACcumulate (BAC – Fig. 3) unit. In both cases, a test has been added to see if the weight is equal to zero, which allows the multiplication (bit shift) to be possibly skipped and thus reduces the actual power consumption. The obvious difference between the modules is the way the multiplication is performed – simplified to a bit shift for the PoT weights. An additional aspect that needs to be solved is how to encode the logarithmic weights, which consist of a sign and a absolute bit shift value – thus, they are not written in the two’s complement code. First, it is necessary to distinguish between a zero weight and a weight with a power equal to (i.e., thus with absolute value ). According to Equation (1), for weights encoded on 4 bits, we will obtain different levels ({6, 5, …, 0, 0, …, 5, 6}), which therefore leaves the possibility of encoding the zero weight in two ways. Second, the sign of the weight is encoded independently of the bitshift value itself, so partial product accumulation must distinguish between the case where the weight is negative (in which case subtraction occurs) and positive (addition). Both modules – MAC and BAC – operate with the same latency, i.e., they need two clock cycles to compute the partial weighted sum of the inputs. The computation is streamlined, which means that for a filter of size , the convolution result will be obtained after clock cycles. Both designs have clock, enable, and reset (for accumulator reset) signals, which are used by the external module to control the data flow.
4.1 Benchmark Results
The proposed modules were implemented in programmable logic (FPGA) of the Zynq UltraScale + MPSoC ZCU104 platform using the Vivado 2020.2 software. For a proper and fair comparison of the two modules, the use of DSPs (customarily used for multiplication operations) was disabled in the synthesis process, which forced the implementation of multiplication using LUTs for the MAC unit. Finally, the Aera Explore implementation strategy was chosen, although it should be noted that for the others (Power Explore, default) the obtained results were not significantly different – due to the low complexity of the proposed designs. Table 1 shows the resource usage for single MAC and BAC modules, assuming that network weights are, regardless of method, quantised to 4 bits and activations to 8 bits. Despite the additional logic needed to accumulate bitshift results depending on the sign of the weight, the BAC module needs about fewer LUTs. FFs are responsible for registers (memory cells) and the difference in their use, although slight, is also due to the change from a multiplication operation to a bit shift.
MAC PE  LUT  FF 

Uniform 8x4  63  45 
PoT 8x4  44  41 
To investigate the power requirements of neural network accelerators based on the quantisation methods analysed, a sample convolution layer (from the ResNet family of networks) module was designed, consisting of filters of size . The construction of a complete accelerator is another challenge, and several approaches are encountered in the literature (as well as commercial solutions), including the implementation of a “universal” layer, i.e., configurable in such a way that it can perform computations for both the “smallest” and the “largest” layer. Since most modern networks have a regular structure, it usually comes down to implementing the “largest” layer. Such an accelerator then iteratively computes the results of each layer, using memory (BRAM or DRAM – Block or Distributed RAM) to read and write the results of intermediate layers. This approach makes it possible to implement deep networks even in small embedded devices – for devices with more resources, instead of a single universal layer, multiple layers can be implemented, where streamlined computations are performed, and thus reduce to number of reads and writes to and from memory. In view of the above, the implementation of a single layer, representative in size, will allow a demonstration of the performance of the entire accelerator.
For both layers based on uniform and PoT quantisation, a similar test design was proposed.

we implemented a layer module, consisting of units, with input clock, enable and reset signals, and input weights for each filter (unit) and input activation,

we created a wrapper, with BRAM instances for weights and activation, logic controlling value reading and MAC/BAC units, as well as a clock signal generator,

in order to confirm the correctness of the calculations (and also to prevent optimisation of the outputs of the unconnected modules ^{1}^{1}1Vivado performs multiple optimisations during design synthesis, which remove the unused elements), we added a builtin logic state analyser (ILA) to the layer output data.
Table 2 shows the resource usage and power consumption for both layers (excluding the other elements of the test design). The clock was set to MHz and the dynamic power consumption was estimated using Vivado Tools, based on the implemented design. Using PoT weights, and thus replacing the typical multiplication for convolution operations with a bit shift, allowed for an increase in power efficiency of about . At the same time, it should also be noted, as shown in the paper [pot:DPR], that the accuracy of lowprecision networks (e.g., 4 bit width) is much higher using logarithmic quantisation than for those with a linear distribution. The above also confirms the results of the synthesis of single MAC and BAC units shown in [pot:DPR].
MAC PE  LUT  FF  Clock [MHz]  Power [W] 

Uniform 8x4  24193  23265  350  
PoT 8x4  19557  17024  350 
The actual processing time of the input frame depends on the latency of the modules (identical for both MAC and BAC units), as well as the clock frequency. To investigate the approximate limit of this frequency, we conducted a series of experiments summarised in Fig. 4, showing the relationship between the clock rate and the dynamic power consumed by the convolution layer module. For the MAC layer, the highest operating frequency is MHz, for the higher value of MHz the implementation failed to meet the timing constraints. For the BAC layer, a similar problem occurred only at MHz.
As described in Section 3, most of the existing works show decreases in the power used relative to standard (e.g. 16 bits) multipliers. For lowprecision computation processing elements, the work of [Lee17] indicated a decrease of about , with 5 bits – , in favour of PoT weights over uniformly quantised ones. Our implementation allows for a reduction in dynamic power at 4 bit width weights. Because the aforementioned publication is sparing in its description of the design, we are unable to make a more accurate comparison.
4.2 Going Lower on Power  Pruning
A further reduction in energy consumption can be achieved by introducing more zero weights using pruning. Uniform quantisation for weights with very low precision (e.g., 4 bit width) introduces automatic pruning that does not affect the number of quantisation steps in any way – values below the minimum quantised weight (different from zero) are rounded to zero, resulting in a relatively high percentage of zero weights. This is different for logarithmic quantisation: due to the dense distribution of quantisation levels around zero, the number of quantisation levels is reduced by discarding the lowest ones, when weights with low absolute values are zeroed out. This can significantly affect the efficiency of the network at very low computational precision.
At the same time, properly performed pruning can result in the zeroing of up to a few tens of percent of weights having little effect on the final network indication, which in turn can result in further gains (after introducing a bit shift in place of multiplication) gains in hardware implementation:

In [pot:DPR], authors show that zeroing out the weights allows compression and thus reduces the memory requirements (which also implies a decrease in energy requirements).

The designed MAC and BAC units skip the multiplication/bit shifting operations when zero weight is detected. Depending on the network architecture and the degree of pruning, this will result in smaller or larger energy gains during system operation.
In view of the above, a pruning algorithm adapted to the peculiarities of logarithmic quantisation is proposed, introducing the socalled dead zone around zero (in the distribution of weights of a given layer) by the following modification of the algorithm described in Section 2:

After normalising the absolute values of the weights to the interval (and getting ) we implement pruning () with a fixed degree, e.g. relative to the weight with the maximum modulus (i.e. 1).

We normalise the pruned weights, excluding zero weights, to the interval .

We perform logarithmic quantisation.
Thus, the value of the Scaling Factor is equal to , where is different from zero. To reconstruct the proper quantised weight value, we must then introduce an offset equal to , so now . Thus, we move the minimum quantisation level away from zero, creating a large enough ’dead zone’ while keeping the logarithmic distribution outside of it – Fig. 5.
To evaluate the proposed pruning method, ResNet20 was trained on the CIFAR10 dataset, with varying degrees of pruning – weights below a value set of were set to zero. The STE algorithm is used to train the quantised, PoT 4 bit width network. That is, we initialise the weights with the full precision version, and then in each training iteration, we do forward pass with the quantised weights and backward pass using floating point values.
PF  Accuracy  Number of zeros 

no pruning  
Table 3 shows the results of the experiment. Since the tested network is very redundant, and due to the introduction of a ’dead zone’ that allowed all quantisation levels to be maintained, the decrease in network performance is small even when of weights are set to zero. We also observed a slight increase in performance for lower PF values, indicating that the introduced pruning prevented overfitting.
The presented pruning algorithm, in combination with the designed BAC module presented in Section 4, which allows one to skip the computation for zero weights, will therefore allow one to further significantly reduce the power requirements of the real hardware accelerator for sparse neural networks.
5 Conclusion
In this article, we demonstrate the effect of logarithmic quantisation on energy demand in an FPGAbased accelerator of convolutional neural networks. The replacement of the multiplication operation with a bit shift reduced the power requirement by
. Moreover, a pruning algorithm adapted to logarithmic quantisation is proposed, so that while maintaining high network efficiency, power consumption can be further reduced. In view of this and previous works, it is important to conclude that neural network quantisation using PoT weights, compared to uniform quantisation, allows for highly energyefficient acceleration in embedded devices. Further work should focus on the hardware implementation of a complete vision system based on the presented solution. The use of PoT weights and the proposed convolution filter hardware design shall allow for realtime and energyefficient implementation of particularly complex, stateoftheart deep convolutional neural networks.5.0.1 Acknowledgements
The work presented in this paper was supported by the AGH University of Science and Technology project no. 16.16.120.773.