AdderNet and its Minimalist Hardware Design for Energy-Efficient Artificial Intelligence

01/25/2021 ∙ by Yunhe Wang, et al. ∙ HUAWEI Technologies Co., Ltd. The University of Sydney 35

Convolutional neural networks (CNN) have been widely used for boosting the performance of many machine intelligence tasks. However, the CNN models are usually computationally intensive and energy consuming, since they are often designed with numerous multiply-operations and considerable parameters for the accuracy reason. Thus, it is difficult to directly apply them in the resource-constrained environments such as 'Internet of Things' (IoT) devices and smart phones. To reduce the computational complexity and energy burden, here we present a novel minimalist hardware architecture using adder convolutional neural network (AdderNet), in which the original convolution is replaced by adder kernel using only additions. To maximally excavate the potential energy consumption, we explore the low-bit quantization algorithm for AdderNet with shared-scaling-factor method, and we design both specific and general-purpose hardware accelerators for AdderNet. Experimental results show that the adder kernel with int8/int16 quantization also exhibits high performance, meanwhile consuming much less resources (theoretically  81 In addition, we deploy the quantized AdderNet on FPGA (Field Programmable Gate Array) platform. The whole AdderNet can practically achieve 16 speed, 67.6 decrease in power consumption compared to CNN under the same circuit architecture. With a comprehensive comparison on the performance, power consumption, hardware resource consumption and network generalization capability, we conclude the AdderNet is able to surpass all the other competitors including the classical CNN, novel memristor-network, XNOR-Net and the shift-kernel based network, indicating its great potential in future high performance and energy-efficient artificial intelligence applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural network has become ubiquitous in modern artificial intelligence tasks such as computer vision 

Deng et al. (2009); He et al. (2016); LeCun et al. (2015)

and natural language processing 

Vaswani et al. (2017)

due to its outstanding performance. In order to achieve higher accuracy, it has been a general trend to create deeper and more complicated neural network models, which are usually computationally intensive and energy consuming. However, in many real world applications, especially in the edge device platforms such as the sensors, drones, robotics and mobile phones, the complex recognition tasks have to be carried out in a resource-constrained condition with limited battery capacity, computation resource and chip area. Thus new deep learning models that can perform artificial intelligence algorithms with both high performance and low power consuming are strongly desired.

Many significant works have been proposed to address the model compression issue in the past several years Han et al. (2016); Forrest et al. (2017); Li et al. (2017). To directly decrease the computation amounts, network pruning Li et al. (2017) is proposed to remove redundant weights thus compressing the original network. To construct efficient neural network with smaller model size, new convolution blocks and architectures are proposed. For example, SqueezeNet Forrest et al. (2017) introduces a bottleneck architecture and get the AlexNet-level accuracy by using only 2% numbers of parameters. MobileNetV1 Howard et al. (2017) decomposes the conventional convolution filters into the point-wise and depth-wise convolution filters, thus much fewer (11%) multiplication and accumulation (MAC) operations are involved in the new algorithm. MobileNetV2 Sandler et al. (2018) uses the inverted residuals and linear bottlenecks to get further improvement compared to MobileNetV1. However, these efficient network architectures still rely on the multiplication-based convolution that is energy consuming. To make the convolution operation hardware-friendly and reduce the computation complexity of networks, low-bit model quantization methods are developed, in which the FIX16 and INT8 models are widely used due to its high accuracy and moderate model size Jacob et al. (2018). The binary/ternary weighted neural networks are proposed to further reduce the energy consumption Hubara et al. (2016); Rastegari et al. (2016)

. However, these low-bit networks usually suffer from large degradation in their accuracy, especially for large datasets such as ImageNet.

Instead of the conventional multiplication-based convolution LeCun et al. (1998), several new deep learning computation manners including Deep-shift CNN Elhoushi et al. (2019), memristor CNN Yao et al. (2020); Wang et al. (2019), XNORnet Rastegari et al. (2016) and mixed-precision CNN Chakraborty et al. (2020) are proposed. The Deep-shift CNN replaces all multiplications with bit-wise shift, sign flipping operations and additions, but it still shows performance gap compared to the CNN baseline. The multi-bit version of Deep-shift exhibits much higher performance, but the energy consumption also increases. The memristor CNN integrates novel memristor Strukov et al. (2008) (a new kind of electronic device) arrays to implement the parallel MAC operations. By directly using Ohm’s law for the multiplication and Kirchhoff’s law for the accumulation, the in-memory computing structure of memristor network has greatly improved the energy-efficiency Wang et al. (2019); Li et al. (2019). However, the network performance and its integration level are far from a practical artificial intelligence application. What is worse, the analogue circuit has to involve numbers of digital-to-analog and analog-to-digital converters (DAC/ADC), which will inevitably largely increase both the chip area and the power consumption Yao et al. (2017). Moreover, the conductance variation issue of the memristor is still highly desired to be conquered. The XNORnet and its improved version, namely the mixed-precision CNN, mainly take advantages of the the bit-level logic operation for the calculation, thus the energy consumption can be largely suppressed Chakraborty et al. (2020). However, their performance is also constrained by their low precision. In addition, the mixed-precision CNN allocates quantized weights with different bit-widths for different layers. It can be a meaningful compromise proposal to bridge the gap between the high performance and low power consumption, but it still contains massive multiplication operations and suffers the accuracy loss issue.

In this work, we present the adder-kernel convolutional neural network, where the convolutional kernel contains only adder operations Chen et al. (2020). AdderNet can fundamentally decrease both the logic resource consumption and the energy consumption, because adder operation is much lightweight than the multiplication operation Horowitz (2014)

. Besides, the AdderNet shows high generalization ability in versatile neural network architectures and performs well from pattern recognition to super-resolution reconstruction 

Xu et al. (2020). Experiments are conducted on several benchmark datasets and neural networks to verify the effectiveness of our approach. The performance of quantized AdderNet is very closed to or even exceeding to that of the CNN baselines with the same architecture. Finally, we thoroughly analyze the practical power and chip area consumption of AdderNet and CNN on FPGA. The whole AdderNet can achieve 67.6%-71.4% decrease in logic resource utilization (equivalent to the chip area) and 47.85%-77.9% decrease in energy consumption compared to the traditional CNN. In short, with a comprehensive discussion on the network performance, accuracy, quantization performance, power consumption, generalization capability, and the practical implementations on FPGA platform, the AdderNet is verified to surpass all the other competitors such as the novel memristor-based neural network and the shift-operation based neural network.

2 Preliminaries

Figure 1: Structure of Convolutional Neural Networks. a) The typical structure of convolutional neural network: input layer, a series of convolutional layers and pool layers, full-connection layer, and output layer. (b) The classical multiply core used in CNN. (c) The shift operation core used in Shift-CNN, in which both shift-operation and sign-operation are needed. If the data width of weight is larger than 1bit, adder-operations are also needed. (d) The analogue memristor core used in memristor-CNN. (e) The logic XNOR-operation core used in XNORnet (binary neural network). (f) The adder core used in AdderNet (this work).

2.1 Comparison of Different Convolutional Kernels

In deep convolutional neural networks (Figure 1(a)), the convolutional layer is designed to detect local conjunctions of features using a series of convolution kernels (filters) by calculating the similarity between filters and inputs LeCun et al. (1998, 2015). Generally, there are 5 kinds of convolution kernels worth discussing, namely the multiplication-kernel in the traditional CNN (Figure 1(b)), the shift-operation kernel (Figure 1(c)), the analogue memristor kernel (Figure 1(d)), the XNOR logic operation kernel (Figure 1(e)), and the proposed adder-kernel (Figure 1(f)).

For the representative CNN models, the vector multiplication (cross-correlation) is used to measure the similarity between input feature and convolution filter. Consider that the weight of filter is denoted as

], where K*K is kernel size, CH is input channel, CH is output channel, and the input feature is denoted as , where is the number of feature channel, H and W are the height and width of the input feature, respectively. Then the output F can be calculated by

(1)

where S(F, W) is a pre-defined similarity metric function. In CNN models, the metric function S is multiplication, namely . While in AdderNet, the L1-norm is used to measure the similarity between filters and inputs by calculating the absolute differences of F and W, namely , in which the negative sign is to rectify the output result to be positive related with the L1-distance Chen et al. (2020).

The Adder kernel proposed in this work is of great significance. On one hand, it has been widely believed that the custom large-scale convolutional neural networks have involved massive multiplication operations which consume much energy and computation resources. On the other hands, though there exists some other different kernels such as the shift-kernel Elhoushi et al. (2019); You et al. (2020) and the memristor-kernel Yao et al. (2020), they are exactly included in the classical multiplier-kernel. Particularly, the shift-operation is mathematically the same as multiplying by a positive (or negative) power of 2, and the memristor-kernel is only an analogue implementation of multiplier. The key point of this work is that we have experimentally shown that the CNN convolution kernel itself also contains redundant information and can be further simplified. In AdderNets, the L1-norm instead of cross-correlation in CNNs is taken to measure the similarity between features and filters, which is fundamentally different from the other kernels.

Figure 2: Comparison of different kernels. (a) Recognition accuracy comparison of the different kernels, in which the CNN and our AdderNet exhibit best performance in all of the neural networks. Note, the current largest memristor-CNN model contains only 2 convolutional layers, thus there is no available data on the comparison table. (b) Normalized performance for different kernel methods. The data is summarized from table (a) and CNN is taken as the baseline. (c) Energy consumption of the different kernel operations

2.2 AdderNet: High-Performance and Energy-Efficient Network

Performance is the most important metric for a neural network. Among the five kernels, the multiplication-kernel of CNN has been well explored and is considered to have the best performance Simonyan and Zisserman (2015); Ioffe and Szegedy (2015). Impressively, the performance of our adder-kernel based AdderNet is comparable to or even better than that of CNN in many computer vision tasks. As shown in the performance comparison table of Figure 2

(a), the top-1 and top-5 accuracy of AdderNet ResNet-50 on ImageNet 

Deng et al. (2009) is 76.8% and 93.3%, respectively, and that of CNN is 76.13% and 92.86%. Besides, The accuracy of AdderNet ResNet-20 on CIFAR100 is 69.93%, meanwhile that of CNN is only 68.75%, where a 1.2% accuracy improvement is achieved Xu et al. (2020). By contrast, the other kinds of convolution kernels (or their variations) show inferior performance. For example, the 6bit-weight Deepshift network and low-bit CNN usually exhibits 1% and 4% accuracy degradation compared to CNN, respectively. And the 1bit-weight Deepshift Elhoushi et al. (2019), shiftaddnet You et al. (2020), mix-precision CNN Chakraborty et al. (2020) and XNORnet Rastegari et al. (2016) show even worse performance .

For the memristor network (or the other types of electronic synaptic devices such as the ferroelectric devices and phase change memory devives), though it has attracted great attentions in recent years, the performance and integration level are yet still far from a practical artificial intelligence application. To date, the state-of-the-art memristor network is just the demonstration of 2 layers CNN model Yao et al. (2020)

. And its practical performance (recognition accuracy) on a small dataset (MNIST 

LeCun et al. (1998)) is only about 79.76%, which is much lower than all of the digital computation systems (95%) mentioned above. What is worse, it could hardly get further improved due to the device conductance variation issue Huang et al. (2020).

Figure 2(b) shows the normalized network performance comparison for different convolution kernels, in which the data is extracted from Figure 2(a). Obviously, the performance of all the other methods are lower than that of the original CNN baseline and our AdderNet. From high to low, the performance ranking list is AdderNet, CNN, 8-5 bit CNN, DeepShift, mix-precision CNN (depends on the precision), ShiftAdder, XNORnet(BNN) and memristor CNN, respectively.

Next, we would like to compare the circuit complexity and energy consumption for each kernel. The adder-kernel consists of one-Comparator-one-Adder (1C1A) or two Adders (2A) to calculate the absolute difference of weight and feature (Supplemental Information S1). The CNN multiplication-kernel contains one Multiplier. The 1bit shift-kernel consists of Serial-Shift-Register for the shift-operation and one Multiplexer as well as one N-bit*1-bit (N is the data width of feature) Multiplier for the sign function. Note, if the data width of weight (denoted as M-bit weight) in DeepShift is larger than 1, it also needs (M-1) numbers of N-bit adders and M groups of N-bit Serial-Shift-Registers. The memristor-kernel is made of two parallel 1-transistor-1-Memristor (1T1R) together with one differential circuit afterwards (Supplemental Information S2). The XNOR-kernel is the simplest one which contains only several AND/NAND logic gates (Supplemental Information S3).

Supplemental Information S4 lists the power consumption for the different kernels Horowitz (2014); You et al. (2020); Thakre and Srivastava (2015). It can be seen that the bit-level operation and analogue memristor cost the least consumption. The adder-operation and the shift-operation consume similar hardware resources and energy, and both can save a mass of consumption compared to the multiply-operation. For instance, the FP32 (float-point 32bit, a typical data type used in CPU or GPU) multiplication operation consumes 4.11 and 1.84 times of energy costs and logic area cost compared to the FP32 adder. And the FIX16 (fixed 16bit, a typical data type usually used in FPGA) multiplication operation consumes 15.7 and 14.8 times of energy and logic area cost more than that in FIX16 adder. The typical circuit area of each kernel can be checked in Supplemental Information S5.

According to the above analysis, we have summarized the energy consumption in single kernel operation for the different networks in Figure 2(c). The BNN and memristor network consume the least energy due to the bit-level operation and analogue circuits, respectively. However, these two achieves the worst network performance. Besides, the memristor networks have to involve numbers of digital-to-analog and analog-to-digital converters and will indeed largely increase the chip area as well as power consumption, which has not been considered in the kernel energy comparison. Our AdderNet exhibits extremely low power due to the low cost of adder-operation and comparator-operation. The low-bit (usually 4-8 bit) CNN and the shift-based (contains numbers of flip-operation or adder-operation) network exhibit a little bit higher computation resources but can still theoretically achieve about 50%-90% decrease in energy dissipation compared to CNN.

Figure 3: Quantization with shared scaling factor. (a) and (b) Distribution of the input features and weights in different layers of AdderNet Resnet18. (c) Symmetric-mode of quantizations. (d) Quantization result of the AdderNet ResNet18. A Shared Scaling Factor for Feature and Weight quantization method is proposed, and the zero accuracy loss is achieved at 8bit quantization.

3 Energy-efficient AdderNet on FPGA

3.1 Quantization with Shared Scaling Factor

Figure 4: Universal AdderNet accelerator. (a) Detailed structure of the FPGA accelerator that can be used for universal Convolutional Neural Networks. (b) Parallel computation on the convolution kernel. The AdderNet convolutional kernel mainly contains two adders and one multiplexer, and the CNN kernel contains one multiplier. Though the AdderNet kernel seems to be more complex, actually it is much lightweight than that of CNN. (c1-c2) The component of 16bit CNN and 16bit AdderNet at different parallelism levels synthesized in FPGA, respectively. To make a fair comparison, no DSP resources are used. It can be seen that the convolutional kernel occupies more and more logic resources during the increasing of parallelism. (c3) The convolutional part can save about 80%-off in logic resource utilization, and the total system is able to achieve 67.6%-off. (d1-d3) Comparison results of AdderNet and CNN in 8bit network. The convolutional part and total network can save about 70% and 58% in logic resource utilization, respectively.

Quantization is an important technique of model compression to represent weights and features with low-bit values Jacob et al. (2018); Hubara et al. (2016). To achieve a balance between the high accuracy and low power consumption, 8bit or 16bit quantization scheme is often used. In CNN network, the input features and the weights are usually quantized with different scaling factors, thus the decimal points of features and weights are also different. And this would not introduce troubles in the low-bit MAC operation. However in AdderNet, if the features and weights were compressed with different scaling factors, the convolution parameters will have to be firstly point-aligned with shift operation before the adder operation, thus increasing the power and logic consumption in hardware.

Figure 3(a) and 3(b) show the distribution of input features and weights in AdderNet respectively, in which the different colors represent different convolution layers in ResNet18. It can be seen that the majority (90%) of input feature ranges from 2 to 2, and that of weight is from 2 to 2. The relative small difference make it possible to quantize these two data sets together, namely the input features and weights can be quantized with the same scaling factor, which is different from that of CNN (separated scaling factor for feature and weight). Figure 3(c) exhibits the typical quantization method. If the clip region was chosen to be 2 to 2in 8bit quantization, 96% of the feature information and 98% of the weight information could be remained, thus guaranteeing the high performance. Besides, the shared scaling factor ensures that the adder convolution kernel in hardware can directly start the calculation without the need of point aligning, which is hardware friendly.

The detailed performance of quantized AdderNet ResNet-18 has been shown in Figure 3(d). The original Top-1 and Top-5 accuracies of the full-precision network are 68.8% and 88.6%, respectively. After the 8-bit quantization, the Top-1 and Top-5 accuracies remain 68.8% and 88.5% with near zero accuracy loss. Then, we further study the compression performance at even lower bit. As plotted in Figure 3(d), the Top-1 and Top-5 accuracy can still reach 65.5% Top-1 and 86.2% Top-5 accuracies at 5-bit precision, respectively. While the 4-bit quantization exhibits relative large performance degradation, which is consistent with the above discussion with respect to the parameters distribution. The compression of ResNet-50 and its comparison to quantized CNN can be checked in Supplemental Information S6 and S7, respectively.

4 Hardware Design for Energy-efficient AdderNet

The state-of-the-art hardware platforms for accelerating deep learning algorithms are field programmable gate array (FPGA), application specific integrated circuit (ASIC), Central processing unit (CPU), graphic processing unit (GPU), and digital signal processors (DSP). Among these approaches, FPGA based accelerators have attracted great attentions thanks to its good performance (much better than CPU, comparable to GPU), high energy efficiency, fast development round (several months, much faster than ASIC), and capability of reconfiguration .

The structure of FPGA convolutional accelerator Chen et al. (2014a, b); Zhang et al. (2016, 2015); Ma et al. (2018) can be roughly divided into 4 parts: parallel kernel-operation core, data storage unit and Input/Output port, data path control module, and the others such as the Pooling and BN unit. Generally, the convolutional kernel core occupies most of the logic resources due to the Single Instruction Multiple Data (SIMD) of FPGA Zhang et al. (2015). Therefore, the hardware simplification and optimization of convolutional kernel is of great significance.

The numbers of input and output channels in convolutional network are usually to be the power of 2 (such as 64/128/256/512), thus making them suitable for the parallel computing. The typical examples are the Cambricon’s DianNao Chen et al. (2014b, a), Google’s TPU Jouppi et al. (2017)

, NVIDIA’s open-source CNN accelerator, namely NVIDIA Deep Learning Accelerator (NVDLA) 

27. In this work, we consider there are Pinput channels to be summed out in the adder tree for the P output channels (Figure 4(a)). Suppose the data width in network is DW, then the consumption of AdderNet in parallel convolutional kernel is ”P*(P*DW*2)”, where the number of ”2” comes from the fact that there are two adders in each AdderNet kernel. The consumption of AdderNet in adder-tree is ”P*([DW+log(P)]*(P-1))”, in which the ”(P-1)” and ”[DW+log(P)]” represents the numbers of adder and the data width in adder tree, respectively. Therefore the theoretical logic resource consumption (and also energy consumption) of AdderNet is

(2)

The consumption of CNN in parallel convolutional kernel can be calculated by ”P*(P*DW*DW)”, where the second ”DW” represents the theoretical value of multiply to adder ratio. The consumption in adder-tree is ””, in which ”(P-1)” represents the numbers of adder, ”2*DW” is the data width after multiplication kernel, and ”[2*DW+log(P)-1]” is the data width in adder tree. Thus the theoretical resource consumption of CNN is

(3)

If the DW is fixed at 16 and is designed to be 64, the AdderNet will theoretically get 81.6%-off in logic resource utilization (and also power consumption) compared to CNN. However, the practical count of energy-efficiency improvement for AdderNet is still a challenge. This mainly comes from the Data Input/Output overhead, where the data move from the outside main Memory (Dynamic Random Access Memory, DRAM) to the computation part will cause an enormous amount of energy consumption. To verify the accurate energy and logic resource consumption, we have designed two kinds of hardware implementations as following.

Figure 4(b) shows the overview design of our general purpose accelerator implemented on Xilinx Zynq UltraScale+ MPSoC series Board which integrates the Processing System (PS, namely the ARM Cortex-series CPU) and the Programmable Logic (PL, namely the programmable FPGA) together. The convolution accelerator is composed of external memory, on/off-chip interconnect, on-chip buffer, and the parallel computation procession element. The data transportation between the PS and PL are controlled by the Advanced eXtensible Interface (AXI) interconnection protocol, in which the AXI-lite protocol is used for the parameter configuration and AXI-full protocol is used for the weight/feature transposition.

Figure 5: FPGA AdderNet without off-chip data excess. (a) Left: consists of a typical FPGA Convolutional Neural Networks accelerator: the multi-channel kernel-operation core, data storage (including both data buffer and Data Input/Output), data path control module, and the other modules such as the Pooling and BN unit. Righ: Structure of the FPGA based CNN accelerator for LeNet-5, in which no Data Input/Output overhead is needed and all the calculations and data storage will be implemented on board. (b) and (c) Comparison of AdderNet and CNN on the logic resource utilization and energy consumption, respectively.

Figure 4(c) and  4(d) shows the synthesized result of the 16bit and 8bit convolution path in FPGA, respectively. To make a fair comparison, no DSP resources are used here and all of the modules in the accelerator are constructed by using the LUT (look-up-table in FPGA) resources. Figure 4(c1) draws the component (including the convolution kernel, data storage/buffer, data path control module, and others) distribution of logic resources at different parallelism levels. It can be seen that the convolution kernel occupies more and more logic resources during the increasing of parallelism. For example, when the parallelism is 128, the computation unit occupies 50.48% in the whole system. When the parallelism is 2048, the computation unit will dominate and take up about 83.9% in the total logic resources. Similar result can be achieved in the 8 bit AdderNet as shown in Figure 4(c2).

The more important thing is how much of the logic resource that AdderNet can save compared to CNN. Figure 4(c3) plots the result, in which the red dots represent the convolution kernel and the black dot represent the total accelerator. It is clear to see that the total logic resource that AdderNet can save compared to CNN increases with the parallelism level. When parallelism = 2048, the AdderNet can get 67.6%-off for the total system, and 80%-off in the convolutional part, which is consistent with the theoretical value according to equation (2) and (3). Figure 4(d) shows the similar comparison results of AdderNet and CNN in 8bit network. The 8bit AdderNet can save about 70%-off in convolution computation and 58%-off in the total system, and the value can be further improved if the parallelism goes even larger.

Then we will tend to the practical comparison of AdderNet and CNN on board (Xilinx Zynq UltraScale+ MPSoC ZCU104). Due to the limited logic resources in ZCU104, the parallelism of CNN is restrained to be 1024, and so is the AdderNet. The first advantage of AdderNet is the capability for operating at higher frequency with higher performance. Because the multiplier (in combination logic circuit design) owns much higher logic gate delay (T) compares to adder, it is difficult for CNN to get the positive setup-time and hold-time in static timing analysis especially in high frequency. Here in our design, the highest operation frequency of CNN is 214 MHz, and that of AdderNet is 250 MHz. For the running of ResNet18, the CNN exhibits 424 GOPs (Giga operations per second) for the convolution calculation and 307 GOPs for the whole network. While the AdderNet is able to achieve as high as 495 GOPs for the convolution and 358.6 GOPs for the whole network, which is among the best results in the similar hardware platform (Supplemental Information S8).

Besides, the operating power of adder convolution is much lower than that of CNN. We have measured the practical power consumption of AdderNet and CNN at 214 MHz during convolution, in which all of the activities are counted, such as the off-chip data excess, data input/output transportation, control module, and the convolution. After subtracting the ZCU104 embedded system operation power (can be treated as the baseline or background noise, 14 W), the intrinsic power of AdderNet convolution is 1.34 W, while that of CNN is 2.57 W, which means AdderNet is able to get 47.85%-off in the practical energy consumption.

The above deviation from the theoretical value mainly comes from the data transportation between on-chip computation part and the off-chip memory. To avoid this, we deploy a small-scale network (i.e., LeNet-5) on Zynq-7020 (with Xilinx XC7Z020 chip), in which no Data Input/Output overhead is needed, and all of the weight parameters and intermediary computation results are stored on board. As shown in Figure 5(a), the LeNet-5 contains 2 convolutional layers, where the first one is with 1 input channel and 6 output channels, and the second one is with 6 input channels and 16 output channels. To accelerate the convolution operation, we use the parallel computation technology in both input and output channels. Therefore, there are 6 parallel kernel operators for the first convolutional layer and 96 for the second layer.

Figure 5(b) and 5

(c) show the comparison results of CNN and AdderNet. For the 16bit network, our AdderNet gets 70.3%-off, 80.32%-off and 71.4%-off compared to CNN in number of LUTs (equivalent to the chip area) in the conv-layer1, conv-layer2 and total network, respectively. Besides, the drop of energy consumption in AdderNet can reach 70.22% -off, 88.29%-off, and 77.91%-off, respectively. The result is quite close to the estimated value

81%. For the 8bit network, the AdderNet gets 46.76%-off, 66.86%-off and 61.63%-off than CNN in logic resource utilization, and 48.33%-off, 72.96%-off, 56.57%-off in energy consumption, which is also close to the estimated value.

5 Implementation Details

Optimization of AdderNet.

The back-propagation algorithm is utilized for calculating the gradients of AdderNet Chen et al. (2020)

just as the conventional neural networks. The loss function for image recognition task is cross-entropy loss. To improve the performance of AdderNet, we also apply the distillation loss 

Xu et al. (2020) on AdderNet by using CNN as teacher networks (Supplemental Information S9).

Datasets.

We have verified the performance of AdderNet on various computer vision datasets including CIFAR100 and ImageNet. CIFAR100 dataset Krizhevsky and Hinton (2009) consists of 50,000 32x32 colour training images and 10,000 test images from 100 classes. ImageNet ILSVRC2012 Deng et al. (2009) is a large-scale image classification dataset which is composed of about 1.2 million training images and 50,000 validation images belonging to 1000 categories. The commonly used data augmentation strategy including random crop and random flip is adopted for image pre-processing He et al. (2016).

Training details.

All the networks are trained on NVIDIA Tesla V100 GPUs using PyTorch. For CIFAR100, the models are trained for 400 epochs with a batch size of 256. The learning rate starts from 0.1 and decays using cosine scheduler. The weight decay is set as 0.0005. For ImageNet, the models are trained for 150 epochs with a batch size of 256. The learning rate starts from 0.1 and decays using cosine scheduler. The weight decay is set as 0.0001.

Hardware information.

The accelerator is designed and simulated with Vivado (v2018.1). The hardware platform for the large-scale network is the Zynq UltraScale + MPSoC ZCU104 board which has a Xilinx FPGA chip of XCZU7EV-2FFVC1156 MPSoC, and the small-scale network is deployed on ZYNQ7020 board with Xilinx XC7Z020 chip. The hardware implementation runs on PYNQ (Python Productivity for ZYNQ) zcu104_v2.5 environment.

6 Conclusion

This paper studies the low-bit quantization algorithm for adder neural network (AdderNet) and the hardware design for achieving the state-of-the-art performance in terms of both accuracy and hardware overhead. Specifically, the AdderNet exhibits high performance and low power advantages over its counterparts such as the conventional multiply-kernel CNN, the novel memristor-kernel CNN, and the shift-kernel CNN. Besides, we propose the shared-scaling-factor quantization method for AdderNet, which is not only hardware friendly but also showing guaranteed performance with nearly zero accuracy loss down to 6bit network. Finally, we design both specific and general-purpose convolution accelerators for implementing neural networks on FPGA. In practice, the adder kernel can theoretically obtain an about 81%-off in resource consumption compared to multiplication kernel (i.e., the conventional CNN). Experimental results illustrate that by applying the AdderNet, we can achieve about 1.16 speed-up ratio, 67.6%-71.4% decrease in logic resource utilization (chip area), 47.85%-77.9% decrease in power consumption compared to CNN with exactly the same circuits design, respectively.

Acknowledgments

This work was supported by Shenzhen Science and Technology Innovation Committee JCYJ 20200109115210307.

Supplemental Information

S1: Details of adder convolution kernel

To calculate the absolute difference of weight and feature, the adder-kernel can be designed by using one-Comparator-one-Adder (1C1A) scheme or two Adders (2A) scheme (Figure 8). The 1C1A scheme is to compare the input of and firstly, then get the subtraction value of the larger one minus the smaller one. The advantage of such design is the less resource consumption of digital comparator. However, it suffers from the relative larger gate-delay, which may decrease the circuit speed. The 2A scheme is able to operate at higher frequency due to the parallel two adders, but the adder is more complex than that of comparator, thus the circuit area will be higher. In this work, we choose the 2A scheme for a higher performance.

S2: Memristor and memristor network

The reason that memristor can be used in neural network is that its conductance can be modulated by electrical pulse. Figure 9) shows one typical memristor and its performance. The conductance can be tuned from to in several discrete levels, which is similar to the weight parameters in network. However, the conductance level in memristor is usually limited to be only 4-6 bit, and the conductance variation issue is avoidless.

The memristor network is usually designed with one-Transistor-one-Memristor (1T1R) structure, in which the transistor is to individually control the memristor by using the write-line (WL) and select-line (SL). For the conductance in memristor is always positive, while the weight parameter in network can be either positive or negative, the differential circuit of 1T1R-1T1R is used (green region in figure S2b). The multiplication accumulation (MAC) can be realized by directly using Ohm’s law and Kirchhoff’s law in memristor array. However, it needs great numbers of digital-to-analog and analog-to-digital converters (DAC/ADC) in the periphery controller module, which will inevitably largely increase both the chip area and the power consumption.

S3: Binary neural network and Binary memristor network

The XNOR-kernel based binary neural network (BNN) consumes the lest logic resources and energy, because the XNOR contains only several AND/NAND logic gates, which is much lightweight than all of the other kernels.

The combination of binary neural network and memristor network is the novel Binary memristor network Huang et al. (2020). It is light-weighted, highly robust and can work well on challenging visual tasks. However, the device may suffer the stunk of Set/Reset state (Figure 10(d)), and it also needs DAC/ADC in the periphery circuits.

Figure 6: The original Top-1 and Top-5 accuracies of the full-precision ResNet-50 network are 76.8% and 93.3%, respectively. After the 8-bit quantization, the Top-1 and Top-5 accuracies can still reach 76.6% and 93.2%, respectively.
Figure 7: The accuracy of CNN-ResNet20 at 8bit and 4bit is 91.76% and 89.54%, respectively. While that of AdderNet-ResNet20 is 91.78% and 87.57%, respectively. It can be seen that the AdderNet shows similar performance with CNN at higher bit precision. However, if the data width decreased down to 4bit, the AdderNet accuracy will degrade largely. This is because the Shared-Scale-Factor in AdderNet quantization may loss more information.

S4: Detailed comparison of energy consumption in different convolution kernels

We compare energy consumption in different convolution kernels including AdderNet, CNN, DeepShift Elhoushi et al. (2019), XNOR-net Rastegari et al. (2016) and memristor Yao et al. (2020). The results are shown in Table 11.

Figure 8: (a) and (b) Logic circuit of 1-bit comparator and 1-bit Adder. The adder is more complex than that of comparator. (c) and (d) The typical design of adder convolution kernel in 1C1A and 2A, respectively. The multiplex (MUX) is constructed by several AND gates, and is much lightweight than other logic parts.
Figure 9: (a) Performance of a typical memristor. (b) Structure of memristor network.

S5: Detailed comparison of logic circuit area in different convolution kernels

We compare energy consumption in different convolution kernels including AdderNet, CNN, DeepShift Elhoushi et al. (2019), XNOR-net Rastegari et al. (2016) and memristor Yao et al. (2020). The results are shown in Figure 12.

S6: Quantization of ResNet-50

The quantization results with the proposed shared-scale quantization method are shown in Figure 6.

S7: Performance comparison of AdderNet and CNN after Quantization

We compare the quantization resultsof AdderNet and CNN as shown in Figure 7.

S8: Performance comparison of different FPGA neural network

We compare energy consumption in different FPGA neural networks including models in Rahman et al. (2016), Li et al. (2016), Aydonat et al. (2017), Guo et al. (2017), Zhang et al. (2016), Wei et al. (2017), Guan et al. (2017). The results are shown in Figure 13.

Figure 10: (a) The XNOR gate used in binary neural network. (b) to (d) Performance of the Binary memristor network.
Figure 11: Detailed comparison of energy consumption in different convolution kernel Horowitz (2014); Thakre and Srivastava (2015); You et al. (2020).
Figure 12: Detailed comparison of logic circuit area in different convolution kernels Thakre and Srivastava (2015).
Figure 13: Performance comparison of different FPGA neural network.

S9: Accuracy and loss of AdderNet in training

We plot the accuracy and loss curves for better understanding the training of AdderNet (Figure 14).

Figure 14: Accuracy and loss of AdderNet in training.

References

  • [1] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu (2017) An opencl™ deep learning accelerator on arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 55–64. Cited by: S8: Performance comparison of different FPGA neural network.
  • [2] I. Chakraborty, D. Roy, I. Garg, A. Ankit, and K. Roy (2020)

    Constructing energy-efficient mixed-precision neural networks through principal component analysis for edge intelligence

    .
    Nature Machine Intelligence 2 (1), pp. 43–55. Cited by: §1, §2.2.
  • [3] H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu (2020) AdderNet: do we really need multiplications in deep learning?. In CVPR, Cited by: §1, §2.1, §5.
  • [4] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam (2014)

    Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

    .
    ACM SIGARCH Computer Architecture News 42 (1), pp. 269–284. Cited by: §4, §4.
  • [5] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al. (2014) Dadiannao: a machine-learning supercomputer. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. Cited by: §4, §4.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §1, §2.2, §5.
  • [7] M. Elhoushi, Z. Chen, F. Shafiq, Y. H. Tian, and J. Y. Li (2019) Deepshift: towards multiplication-less neural networks. arXiv preprint arXiv:1905.13298. Cited by: §1, §2.1, §2.2, S4: Detailed comparison of energy consumption in different convolution kernels, S5: Detailed comparison of logic circuit area in different convolution kernels.
  • [8] N. I. Forrest, H. Song, W. Matthew, A. Khalid, and J. W. Dally (2017) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. In ICLR, Cited by: §1.
  • [9] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong (2017) FP-dnn: an automated framework for mapping deep neural networks onto fpgas with rtl-hls hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 152–159. Cited by: S8: Performance comparison of different FPGA neural network.
  • [10] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang (2017) Angel-eye: a complete design flow for mapping cnn onto embedded fpga. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37 (1), pp. 35–47. Cited by: S8: Performance comparison of different FPGA neural network.
  • [11] S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, Cited by: §1.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §5.
  • [13] M. Horowitz (2014) 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: §1, §2.2, Figure 11.
  • [14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
  • [15] M. Huang, G. Zhao, X. Wang, W. Zhang, P. Coquet, B. K. Tay, G. Zhong, and J. Li (2020) Global-gate controlled one-transistor one-digital-memristor structure for low-bit neural network. IEEE Electron Device Letters 42 (1), pp. 106–109. Cited by: §2.2, S3: Binary neural network and Binary memristor network.
  • [16] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In NeurIPS, pp. 4107–4115. Cited by: §1, §3.1.
  • [17] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §2.2.
  • [18] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR, Cited by: §1, §3.1.
  • [19] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. (2017)

    In-datacenter performance analysis of a tensor processing unit

    .
    In Proceedings of the 44th annual international symposium on computer architecture, pp. 1–12. Cited by: §4.
  • [20] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.
  • [21] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1, §2.1.
  • [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §2.1, §2.2.
  • [23] C. Li, Z. Wang, M. Rao, D. Belkin, W. Song, H. Jiang, P. Yan, Y. Li, P. Lin, M. Hu, et al. (2019) Long short-term memory networks in memristor crossbar arrays. Nature Machine Intelligence 1 (1), pp. 49–57. Cited by: §1.
  • [24] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient convnets. In ICLR, Cited by: §1.
  • [25] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang (2016) A high performance fpga-based accelerator for large-scale convolutional neural networks. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–9. Cited by: S8: Performance comparison of different FPGA neural network.
  • [26] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo (2018) Optimizing the convolution operation to accelerate deep neural networks on fpga. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26 (7), pp. 1354–1367. Cited by: §4.
  • [27] (2018) NVIDIA deep learning accelerator. Note: http://nvdla.org/primer.html Cited by: §4.
  • [28] A. Rahman, J. Lee, and K. Choi (2016) Efficient fpga acceleration of convolutional neural networks using logical-3d compute array. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1393–1398. Cited by: S8: Performance comparison of different FPGA neural network.
  • [29] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In ECCV, pp. 525–542. Cited by: §1, §1, §2.2, S4: Detailed comparison of energy consumption in different convolution kernels, S5: Detailed comparison of logic circuit area in different convolution kernels.
  • [30] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520. Cited by: §1.
  • [31] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §2.2.
  • [32] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams (2008) The missing memristor found. nature 453 (7191), pp. 80–83. Cited by: §1.
  • [33] S. Thakre and P. Srivastava (2015) Design and analysis of low-power high-speed clocked digital comparator. In 2015 Global Conference on Communication Technologies (GCCT), pp. 650–654. Cited by: §2.2, Figure 11, Figure 12.
  • [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010. Cited by: §1.
  • [35] Z. Wang, C. Li, P. Lin, M. Rao, Y. Nie, W. Song, Q. Qiu, Y. Li, P. Yan, J. P. Strachan, et al. (2019) In situ training of feed-forward and recurrent convolutional memristor networks. Nature Machine Intelligence 1 (9), pp. 434–442. Cited by: §1.
  • [36] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong (2017) Automated systolic array architecture synthesis for high throughput cnn inference on fpgas. In Proceedings of the 54th Annual Design Automation Conference 2017, pp. 1–6. Cited by: S8: Performance comparison of different FPGA neural network.
  • [37] Y. Xu, C. Xu, X. Chen, W. Zhang, C. Xu, and Y. Wang (2020) Kernel based progressive distillation for adder neural networks. In NeurIPS, Cited by: §1, §2.2, §5.
  • [38] P. Yao, H. Wu, B. Gao, S. B. Eryilmaz, X. Huang, W. Zhang, Q. Zhang, N. Deng, L. Shi, H. P. Wong, et al. (2017)

    Face classification using electronic synapses

    .
    Nature communications 8 (1), pp. 1–8. Cited by: §1.
  • [39] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, and H. Qian (2020) Fully hardware-implemented memristor convolutional neural network. Nature 577 (7792), pp. 641–646. Cited by: §1, §2.1, §2.2, S4: Detailed comparison of energy consumption in different convolution kernels, S5: Detailed comparison of logic circuit area in different convolution kernels.
  • [40] H. You, X. Chen, Y. Zhang, C. Li, S. Li, Z. Liu, Z. Wang, and Y. Lin (2020) Shiftaddnet: a hardware-inspired deep network. In NeurIPS, Cited by: §2.1, §2.2, §2.2, Figure 11.
  • [41] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong (2015) Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays, pp. 161–170. Cited by: §4.
  • [42] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong (2016) Energy-efficient cnn implementation on a deeply pipelined fpga cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, pp. 326–331. Cited by: §4, S8: Performance comparison of different FPGA neural network.