1 Introduction
Deep neural network has become ubiquitous in modern artificial intelligence tasks such as computer vision
Deng et al. (2009); He et al. (2016); LeCun et al. (2015)and natural language processing
Vaswani et al. (2017)due to its outstanding performance. In order to achieve higher accuracy, it has been a general trend to create deeper and more complicated neural network models, which are usually computationally intensive and energy consuming. However, in many real world applications, especially in the edge device platforms such as the sensors, drones, robotics and mobile phones, the complex recognition tasks have to be carried out in a resourceconstrained condition with limited battery capacity, computation resource and chip area. Thus new deep learning models that can perform artificial intelligence algorithms with both high performance and low power consuming are strongly desired.
Many significant works have been proposed to address the model compression issue in the past several years Han et al. (2016); Forrest et al. (2017); Li et al. (2017). To directly decrease the computation amounts, network pruning Li et al. (2017) is proposed to remove redundant weights thus compressing the original network. To construct efficient neural network with smaller model size, new convolution blocks and architectures are proposed. For example, SqueezeNet Forrest et al. (2017) introduces a bottleneck architecture and get the AlexNetlevel accuracy by using only 2% numbers of parameters. MobileNetV1 Howard et al. (2017) decomposes the conventional convolution filters into the pointwise and depthwise convolution filters, thus much fewer (11%) multiplication and accumulation (MAC) operations are involved in the new algorithm. MobileNetV2 Sandler et al. (2018) uses the inverted residuals and linear bottlenecks to get further improvement compared to MobileNetV1. However, these efficient network architectures still rely on the multiplicationbased convolution that is energy consuming. To make the convolution operation hardwarefriendly and reduce the computation complexity of networks, lowbit model quantization methods are developed, in which the FIX16 and INT8 models are widely used due to its high accuracy and moderate model size Jacob et al. (2018). The binary/ternary weighted neural networks are proposed to further reduce the energy consumption Hubara et al. (2016); Rastegari et al. (2016)
. However, these lowbit networks usually suffer from large degradation in their accuracy, especially for large datasets such as ImageNet.
Instead of the conventional multiplicationbased convolution LeCun et al. (1998), several new deep learning computation manners including Deepshift CNN Elhoushi et al. (2019), memristor CNN Yao et al. (2020); Wang et al. (2019), XNORnet Rastegari et al. (2016) and mixedprecision CNN Chakraborty et al. (2020) are proposed. The Deepshift CNN replaces all multiplications with bitwise shift, sign flipping operations and additions, but it still shows performance gap compared to the CNN baseline. The multibit version of Deepshift exhibits much higher performance, but the energy consumption also increases. The memristor CNN integrates novel memristor Strukov et al. (2008) (a new kind of electronic device) arrays to implement the parallel MAC operations. By directly using Ohm’s law for the multiplication and Kirchhoff’s law for the accumulation, the inmemory computing structure of memristor network has greatly improved the energyefficiency Wang et al. (2019); Li et al. (2019). However, the network performance and its integration level are far from a practical artificial intelligence application. What is worse, the analogue circuit has to involve numbers of digitaltoanalog and analogtodigital converters (DAC/ADC), which will inevitably largely increase both the chip area and the power consumption Yao et al. (2017). Moreover, the conductance variation issue of the memristor is still highly desired to be conquered. The XNORnet and its improved version, namely the mixedprecision CNN, mainly take advantages of the the bitlevel logic operation for the calculation, thus the energy consumption can be largely suppressed Chakraborty et al. (2020). However, their performance is also constrained by their low precision. In addition, the mixedprecision CNN allocates quantized weights with different bitwidths for different layers. It can be a meaningful compromise proposal to bridge the gap between the high performance and low power consumption, but it still contains massive multiplication operations and suffers the accuracy loss issue.
In this work, we present the adderkernel convolutional neural network, where the convolutional kernel contains only adder operations Chen et al. (2020). AdderNet can fundamentally decrease both the logic resource consumption and the energy consumption, because adder operation is much lightweight than the multiplication operation Horowitz (2014)
. Besides, the AdderNet shows high generalization ability in versatile neural network architectures and performs well from pattern recognition to superresolution reconstruction
Xu et al. (2020). Experiments are conducted on several benchmark datasets and neural networks to verify the effectiveness of our approach. The performance of quantized AdderNet is very closed to or even exceeding to that of the CNN baselines with the same architecture. Finally, we thoroughly analyze the practical power and chip area consumption of AdderNet and CNN on FPGA. The whole AdderNet can achieve 67.6%71.4% decrease in logic resource utilization (equivalent to the chip area) and 47.85%77.9% decrease in energy consumption compared to the traditional CNN. In short, with a comprehensive discussion on the network performance, accuracy, quantization performance, power consumption, generalization capability, and the practical implementations on FPGA platform, the AdderNet is verified to surpass all the other competitors such as the novel memristorbased neural network and the shiftoperation based neural network.2 Preliminaries
2.1 Comparison of Different Convolutional Kernels
In deep convolutional neural networks (Figure 1(a)), the convolutional layer is designed to detect local conjunctions of features using a series of convolution kernels (filters) by calculating the similarity between filters and inputs LeCun et al. (1998, 2015). Generally, there are 5 kinds of convolution kernels worth discussing, namely the multiplicationkernel in the traditional CNN (Figure 1(b)), the shiftoperation kernel (Figure 1(c)), the analogue memristor kernel (Figure 1(d)), the XNOR logic operation kernel (Figure 1(e)), and the proposed adderkernel (Figure 1(f)).
For the representative CNN models, the vector multiplication (crosscorrelation) is used to measure the similarity between input feature and convolution filter. Consider that the weight of filter is denoted as
], where K*K is kernel size, CH is input channel, CH is output channel, and the input feature is denoted as , where is the number of feature channel, H and W are the height and width of the input feature, respectively. Then the output F can be calculated by(1)  
where S(F, W) is a predefined similarity metric function. In CNN models, the metric function S is multiplication, namely . While in AdderNet, the L1norm is used to measure the similarity between filters and inputs by calculating the absolute differences of F and W, namely , in which the negative sign is to rectify the output result to be positive related with the L1distance Chen et al. (2020).
The Adder kernel proposed in this work is of great significance. On one hand, it has been widely believed that the custom largescale convolutional neural networks have involved massive multiplication operations which consume much energy and computation resources. On the other hands, though there exists some other different kernels such as the shiftkernel Elhoushi et al. (2019); You et al. (2020) and the memristorkernel Yao et al. (2020), they are exactly included in the classical multiplierkernel. Particularly, the shiftoperation is mathematically the same as multiplying by a positive (or negative) power of 2, and the memristorkernel is only an analogue implementation of multiplier. The key point of this work is that we have experimentally shown that the CNN convolution kernel itself also contains redundant information and can be further simplified. In AdderNets, the L1norm instead of crosscorrelation in CNNs is taken to measure the similarity between features and filters, which is fundamentally different from the other kernels.
2.2 AdderNet: HighPerformance and EnergyEfficient Network
Performance is the most important metric for a neural network. Among the five kernels, the multiplicationkernel of CNN has been well explored and is considered to have the best performance Simonyan and Zisserman (2015); Ioffe and Szegedy (2015). Impressively, the performance of our adderkernel based AdderNet is comparable to or even better than that of CNN in many computer vision tasks. As shown in the performance comparison table of Figure 2
(a), the top1 and top5 accuracy of AdderNet ResNet50 on ImageNet
Deng et al. (2009) is 76.8% and 93.3%, respectively, and that of CNN is 76.13% and 92.86%. Besides, The accuracy of AdderNet ResNet20 on CIFAR100 is 69.93%, meanwhile that of CNN is only 68.75%, where a 1.2% accuracy improvement is achieved Xu et al. (2020). By contrast, the other kinds of convolution kernels (or their variations) show inferior performance. For example, the 6bitweight Deepshift network and lowbit CNN usually exhibits 1% and 4% accuracy degradation compared to CNN, respectively. And the 1bitweight Deepshift Elhoushi et al. (2019), shiftaddnet You et al. (2020), mixprecision CNN Chakraborty et al. (2020) and XNORnet Rastegari et al. (2016) show even worse performance .For the memristor network (or the other types of electronic synaptic devices such as the ferroelectric devices and phase change memory devives), though it has attracted great attentions in recent years, the performance and integration level are yet still far from a practical artificial intelligence application. To date, the stateoftheart memristor network is just the demonstration of 2 layers CNN model Yao et al. (2020)
. And its practical performance (recognition accuracy) on a small dataset (MNIST
LeCun et al. (1998)) is only about 79.76%, which is much lower than all of the digital computation systems (95%) mentioned above. What is worse, it could hardly get further improved due to the device conductance variation issue Huang et al. (2020).Figure 2(b) shows the normalized network performance comparison for different convolution kernels, in which the data is extracted from Figure 2(a). Obviously, the performance of all the other methods are lower than that of the original CNN baseline and our AdderNet. From high to low, the performance ranking list is AdderNet, CNN, 85 bit CNN, DeepShift, mixprecision CNN (depends on the precision), ShiftAdder, XNORnet(BNN) and memristor CNN, respectively.
Next, we would like to compare the circuit complexity and energy consumption for each kernel. The adderkernel consists of oneComparatoroneAdder (1C1A) or two Adders (2A) to calculate the absolute difference of weight and feature (Supplemental Information S1). The CNN multiplicationkernel contains one Multiplier. The 1bit shiftkernel consists of SerialShiftRegister for the shiftoperation and one Multiplexer as well as one Nbit*1bit (N is the data width of feature) Multiplier for the sign function. Note, if the data width of weight (denoted as Mbit weight) in DeepShift is larger than 1, it also needs (M1) numbers of Nbit adders and M groups of Nbit SerialShiftRegisters. The memristorkernel is made of two parallel 1transistor1Memristor (1T1R) together with one differential circuit afterwards (Supplemental Information S2). The XNORkernel is the simplest one which contains only several AND/NAND logic gates (Supplemental Information S3).
Supplemental Information S4 lists the power consumption for the different kernels Horowitz (2014); You et al. (2020); Thakre and Srivastava (2015). It can be seen that the bitlevel operation and analogue memristor cost the least consumption. The adderoperation and the shiftoperation consume similar hardware resources and energy, and both can save a mass of consumption compared to the multiplyoperation. For instance, the FP32 (floatpoint 32bit, a typical data type used in CPU or GPU) multiplication operation consumes 4.11 and 1.84 times of energy costs and logic area cost compared to the FP32 adder. And the FIX16 (fixed 16bit, a typical data type usually used in FPGA) multiplication operation consumes 15.7 and 14.8 times of energy and logic area cost more than that in FIX16 adder. The typical circuit area of each kernel can be checked in Supplemental Information S5.
According to the above analysis, we have summarized the energy consumption in single kernel operation for the different networks in Figure 2(c). The BNN and memristor network consume the least energy due to the bitlevel operation and analogue circuits, respectively. However, these two achieves the worst network performance. Besides, the memristor networks have to involve numbers of digitaltoanalog and analogtodigital converters and will indeed largely increase the chip area as well as power consumption, which has not been considered in the kernel energy comparison. Our AdderNet exhibits extremely low power due to the low cost of adderoperation and comparatoroperation. The lowbit (usually 48 bit) CNN and the shiftbased (contains numbers of flipoperation or adderoperation) network exhibit a little bit higher computation resources but can still theoretically achieve about 50%90% decrease in energy dissipation compared to CNN.
3 Energyefficient AdderNet on FPGA
3.1 Quantization with Shared Scaling Factor
Quantization is an important technique of model compression to represent weights and features with lowbit values Jacob et al. (2018); Hubara et al. (2016). To achieve a balance between the high accuracy and low power consumption, 8bit or 16bit quantization scheme is often used. In CNN network, the input features and the weights are usually quantized with different scaling factors, thus the decimal points of features and weights are also different. And this would not introduce troubles in the lowbit MAC operation. However in AdderNet, if the features and weights were compressed with different scaling factors, the convolution parameters will have to be firstly pointaligned with shift operation before the adder operation, thus increasing the power and logic consumption in hardware.
Figure 3(a) and 3(b) show the distribution of input features and weights in AdderNet respectively, in which the different colors represent different convolution layers in ResNet18. It can be seen that the majority (90%) of input feature ranges from 2 to 2, and that of weight is from 2 to 2. The relative small difference make it possible to quantize these two data sets together, namely the input features and weights can be quantized with the same scaling factor, which is different from that of CNN (separated scaling factor for feature and weight). Figure 3(c) exhibits the typical quantization method. If the clip region was chosen to be 2 to 2in 8bit quantization, 96% of the feature information and 98% of the weight information could be remained, thus guaranteeing the high performance. Besides, the shared scaling factor ensures that the adder convolution kernel in hardware can directly start the calculation without the need of point aligning, which is hardware friendly.
The detailed performance of quantized AdderNet ResNet18 has been shown in Figure 3(d). The original Top1 and Top5 accuracies of the fullprecision network are 68.8% and 88.6%, respectively. After the 8bit quantization, the Top1 and Top5 accuracies remain 68.8% and 88.5% with near zero accuracy loss. Then, we further study the compression performance at even lower bit. As plotted in Figure 3(d), the Top1 and Top5 accuracy can still reach 65.5% Top1 and 86.2% Top5 accuracies at 5bit precision, respectively. While the 4bit quantization exhibits relative large performance degradation, which is consistent with the above discussion with respect to the parameters distribution. The compression of ResNet50 and its comparison to quantized CNN can be checked in Supplemental Information S6 and S7, respectively.
4 Hardware Design for Energyefficient AdderNet
The stateoftheart hardware platforms for accelerating deep learning algorithms are field programmable gate array (FPGA), application specific integrated circuit (ASIC), Central processing unit (CPU), graphic processing unit (GPU), and digital signal processors (DSP). Among these approaches, FPGA based accelerators have attracted great attentions thanks to its good performance (much better than CPU, comparable to GPU), high energy efficiency, fast development round (several months, much faster than ASIC), and capability of reconfiguration .
The structure of FPGA convolutional accelerator Chen et al. (2014a, b); Zhang et al. (2016, 2015); Ma et al. (2018) can be roughly divided into 4 parts: parallel kerneloperation core, data storage unit and Input/Output port, data path control module, and the others such as the Pooling and BN unit. Generally, the convolutional kernel core occupies most of the logic resources due to the Single Instruction Multiple Data (SIMD) of FPGA Zhang et al. (2015). Therefore, the hardware simplification and optimization of convolutional kernel is of great significance.
The numbers of input and output channels in convolutional network are usually to be the power of 2 (such as 64/128/256/512), thus making them suitable for the parallel computing. The typical examples are the Cambricon’s DianNao Chen et al. (2014b, a), Google’s TPU Jouppi et al. (2017)
, NVIDIA’s opensource CNN accelerator, namely NVIDIA Deep Learning Accelerator (NVDLA)
27. In this work, we consider there are Pinput channels to be summed out in the adder tree for the P output channels (Figure 4(a)). Suppose the data width in network is DW, then the consumption of AdderNet in parallel convolutional kernel is ”P*(P*DW*2)”, where the number of ”2” comes from the fact that there are two adders in each AdderNet kernel. The consumption of AdderNet in addertree is ”P*([DW+log(P)]*(P1))”, in which the ”(P1)” and ”[DW+log(P)]” represents the numbers of adder and the data width in adder tree, respectively. Therefore the theoretical logic resource consumption (and also energy consumption) of AdderNet is(2) 
The consumption of CNN in parallel convolutional kernel can be calculated by ”P*(P*DW*DW)”, where the second ”DW” represents the theoretical value of multiply to adder ratio. The consumption in addertree is ””, in which ”(P1)” represents the numbers of adder, ”2*DW” is the data width after multiplication kernel, and ”[2*DW+log(P)1]” is the data width in adder tree. Thus the theoretical resource consumption of CNN is
(3)  
If the DW is fixed at 16 and is designed to be 64, the AdderNet will theoretically get 81.6%off in logic resource utilization (and also power consumption) compared to CNN. However, the practical count of energyefficiency improvement for AdderNet is still a challenge. This mainly comes from the Data Input/Output overhead, where the data move from the outside main Memory (Dynamic Random Access Memory, DRAM) to the computation part will cause an enormous amount of energy consumption. To verify the accurate energy and logic resource consumption, we have designed two kinds of hardware implementations as following.
Figure 4(b) shows the overview design of our general purpose accelerator implemented on Xilinx Zynq UltraScale+ MPSoC series Board which integrates the Processing System (PS, namely the ARM Cortexseries CPU) and the Programmable Logic (PL, namely the programmable FPGA) together. The convolution accelerator is composed of external memory, on/offchip interconnect, onchip buffer, and the parallel computation procession element. The data transportation between the PS and PL are controlled by the Advanced eXtensible Interface (AXI) interconnection protocol, in which the AXIlite protocol is used for the parameter configuration and AXIfull protocol is used for the weight/feature transposition.
Figure 4(c) and 4(d) shows the synthesized result of the 16bit and 8bit convolution path in FPGA, respectively. To make a fair comparison, no DSP resources are used here and all of the modules in the accelerator are constructed by using the LUT (lookuptable in FPGA) resources. Figure 4(c1) draws the component (including the convolution kernel, data storage/buffer, data path control module, and others) distribution of logic resources at different parallelism levels. It can be seen that the convolution kernel occupies more and more logic resources during the increasing of parallelism. For example, when the parallelism is 128, the computation unit occupies 50.48% in the whole system. When the parallelism is 2048, the computation unit will dominate and take up about 83.9% in the total logic resources. Similar result can be achieved in the 8 bit AdderNet as shown in Figure 4(c2).
The more important thing is how much of the logic resource that AdderNet can save compared to CNN. Figure 4(c3) plots the result, in which the red dots represent the convolution kernel and the black dot represent the total accelerator. It is clear to see that the total logic resource that AdderNet can save compared to CNN increases with the parallelism level. When parallelism = 2048, the AdderNet can get 67.6%off for the total system, and 80%off in the convolutional part, which is consistent with the theoretical value according to equation (2) and (3). Figure 4(d) shows the similar comparison results of AdderNet and CNN in 8bit network. The 8bit AdderNet can save about 70%off in convolution computation and 58%off in the total system, and the value can be further improved if the parallelism goes even larger.
Then we will tend to the practical comparison of AdderNet and CNN on board (Xilinx Zynq UltraScale+ MPSoC ZCU104). Due to the limited logic resources in ZCU104, the parallelism of CNN is restrained to be 1024, and so is the AdderNet. The first advantage of AdderNet is the capability for operating at higher frequency with higher performance. Because the multiplier (in combination logic circuit design) owns much higher logic gate delay (T) compares to adder, it is difficult for CNN to get the positive setuptime and holdtime in static timing analysis especially in high frequency. Here in our design, the highest operation frequency of CNN is 214 MHz, and that of AdderNet is 250 MHz. For the running of ResNet18, the CNN exhibits 424 GOPs (Giga operations per second) for the convolution calculation and 307 GOPs for the whole network. While the AdderNet is able to achieve as high as 495 GOPs for the convolution and 358.6 GOPs for the whole network, which is among the best results in the similar hardware platform (Supplemental Information S8).
Besides, the operating power of adder convolution is much lower than that of CNN. We have measured the practical power consumption of AdderNet and CNN at 214 MHz during convolution, in which all of the activities are counted, such as the offchip data excess, data input/output transportation, control module, and the convolution. After subtracting the ZCU104 embedded system operation power (can be treated as the baseline or background noise, 14 W), the intrinsic power of AdderNet convolution is 1.34 W, while that of CNN is 2.57 W, which means AdderNet is able to get 47.85%off in the practical energy consumption.
The above deviation from the theoretical value mainly comes from the data transportation between onchip computation part and the offchip memory. To avoid this, we deploy a smallscale network (i.e., LeNet5) on Zynq7020 (with Xilinx XC7Z020 chip), in which no Data Input/Output overhead is needed, and all of the weight parameters and intermediary computation results are stored on board. As shown in Figure 5(a), the LeNet5 contains 2 convolutional layers, where the first one is with 1 input channel and 6 output channels, and the second one is with 6 input channels and 16 output channels. To accelerate the convolution operation, we use the parallel computation technology in both input and output channels. Therefore, there are 6 parallel kernel operators for the first convolutional layer and 96 for the second layer.
(c) show the comparison results of CNN and AdderNet. For the 16bit network, our AdderNet gets 70.3%off, 80.32%off and 71.4%off compared to CNN in number of LUTs (equivalent to the chip area) in the convlayer1, convlayer2 and total network, respectively. Besides, the drop of energy consumption in AdderNet can reach 70.22% off, 88.29%off, and 77.91%off, respectively. The result is quite close to the estimated value
81%. For the 8bit network, the AdderNet gets 46.76%off, 66.86%off and 61.63%off than CNN in logic resource utilization, and 48.33%off, 72.96%off, 56.57%off in energy consumption, which is also close to the estimated value.5 Implementation Details
Optimization of AdderNet.
The backpropagation algorithm is utilized for calculating the gradients of AdderNet Chen et al. (2020)
just as the conventional neural networks. The loss function for image recognition task is crossentropy loss. To improve the performance of AdderNet, we also apply the distillation loss
Xu et al. (2020) on AdderNet by using CNN as teacher networks (Supplemental Information S9).Datasets.
We have verified the performance of AdderNet on various computer vision datasets including CIFAR100 and ImageNet. CIFAR100 dataset Krizhevsky and Hinton (2009) consists of 50,000 32x32 colour training images and 10,000 test images from 100 classes. ImageNet ILSVRC2012 Deng et al. (2009) is a largescale image classification dataset which is composed of about 1.2 million training images and 50,000 validation images belonging to 1000 categories. The commonly used data augmentation strategy including random crop and random flip is adopted for image preprocessing He et al. (2016).
Training details.
All the networks are trained on NVIDIA Tesla V100 GPUs using PyTorch. For CIFAR100, the models are trained for 400 epochs with a batch size of 256. The learning rate starts from 0.1 and decays using cosine scheduler. The weight decay is set as 0.0005. For ImageNet, the models are trained for 150 epochs with a batch size of 256. The learning rate starts from 0.1 and decays using cosine scheduler. The weight decay is set as 0.0001.
Hardware information.
The accelerator is designed and simulated with Vivado (v2018.1). The hardware platform for the largescale network is the Zynq UltraScale + MPSoC ZCU104 board which has a Xilinx FPGA chip of XCZU7EV2FFVC1156 MPSoC, and the smallscale network is deployed on ZYNQ7020 board with Xilinx XC7Z020 chip. The hardware implementation runs on PYNQ (Python Productivity for ZYNQ) zcu104_v2.5 environment.
6 Conclusion
This paper studies the lowbit quantization algorithm for adder neural network (AdderNet) and the hardware design for achieving the stateoftheart performance in terms of both accuracy and hardware overhead. Specifically, the AdderNet exhibits high performance and low power advantages over its counterparts such as the conventional multiplykernel CNN, the novel memristorkernel CNN, and the shiftkernel CNN. Besides, we propose the sharedscalingfactor quantization method for AdderNet, which is not only hardware friendly but also showing guaranteed performance with nearly zero accuracy loss down to 6bit network. Finally, we design both specific and generalpurpose convolution accelerators for implementing neural networks on FPGA. In practice, the adder kernel can theoretically obtain an about 81%off in resource consumption compared to multiplication kernel (i.e., the conventional CNN). Experimental results illustrate that by applying the AdderNet, we can achieve about 1.16 speedup ratio, 67.6%71.4% decrease in logic resource utilization (chip area), 47.85%77.9% decrease in power consumption compared to CNN with exactly the same circuits design, respectively.
Acknowledgments
This work was supported by Shenzhen Science and Technology Innovation Committee JCYJ 20200109115210307.
Supplemental Information
S1: Details of adder convolution kernel
To calculate the absolute difference of weight and feature, the adderkernel can be designed by using oneComparatoroneAdder (1C1A) scheme or two Adders (2A) scheme (Figure 8). The 1C1A scheme is to compare the input of and firstly, then get the subtraction value of the larger one minus the smaller one. The advantage of such design is the less resource consumption of digital comparator. However, it suffers from the relative larger gatedelay, which may decrease the circuit speed. The 2A scheme is able to operate at higher frequency due to the parallel two adders, but the adder is more complex than that of comparator, thus the circuit area will be higher. In this work, we choose the 2A scheme for a higher performance.
S2: Memristor and memristor network
The reason that memristor can be used in neural network is that its conductance can be modulated by electrical pulse. Figure 9) shows one typical memristor and its performance. The conductance can be tuned from to in several discrete levels, which is similar to the weight parameters in network. However, the conductance level in memristor is usually limited to be only 46 bit, and the conductance variation issue is avoidless.
The memristor network is usually designed with oneTransistoroneMemristor (1T1R) structure, in which the transistor is to individually control the memristor by using the writeline (WL) and selectline (SL). For the conductance in memristor is always positive, while the weight parameter in network can be either positive or negative, the differential circuit of 1T1R1T1R is used (green region in figure S2b). The multiplication accumulation (MAC) can be realized by directly using Ohm’s law and Kirchhoff’s law in memristor array. However, it needs great numbers of digitaltoanalog and analogtodigital converters (DAC/ADC) in the periphery controller module, which will inevitably largely increase both the chip area and the power consumption.
S3: Binary neural network and Binary memristor network
The XNORkernel based binary neural network (BNN) consumes the lest logic resources and energy, because the XNOR contains only several AND/NAND logic gates, which is much lightweight than all of the other kernels.
The combination of binary neural network and memristor network is the novel Binary memristor network Huang et al. (2020). It is lightweighted, highly robust and can work well on challenging visual tasks. However, the device may suffer the stunk of Set/Reset state (Figure 10(d)), and it also needs DAC/ADC in the periphery circuits.
S4: Detailed comparison of energy consumption in different convolution kernels
S5: Detailed comparison of logic circuit area in different convolution kernels
S6: Quantization of ResNet50
The quantization results with the proposed sharedscale quantization method are shown in Figure 6.
S7: Performance comparison of AdderNet and CNN after Quantization
We compare the quantization resultsof AdderNet and CNN as shown in Figure 7.
S8: Performance comparison of different FPGA neural network
S9: Accuracy and loss of AdderNet in training
We plot the accuracy and loss curves for better understanding the training of AdderNet (Figure 14).
References
 [1] (2017) An opencl™ deep learning accelerator on arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 55–64. Cited by: S8: Performance comparison of different FPGA neural network.

[2]
(2020)
Constructing energyefficient mixedprecision neural networks through principal component analysis for edge intelligence
. Nature Machine Intelligence 2 (1), pp. 43–55. Cited by: §1, §2.2.  [3] (2020) AdderNet: do we really need multiplications in deep learning?. In CVPR, Cited by: §1, §2.1, §5.

[4]
(2014)
Diannao: a smallfootprint highthroughput accelerator for ubiquitous machinelearning
. ACM SIGARCH Computer Architecture News 42 (1), pp. 269–284. Cited by: §4, §4.  [5] (2014) Dadiannao: a machinelearning supercomputer. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. Cited by: §4, §4.
 [6] (2009) Imagenet: a largescale hierarchical image database. In CVPR, pp. 248–255. Cited by: §1, §2.2, §5.
 [7] (2019) Deepshift: towards multiplicationless neural networks. arXiv preprint arXiv:1905.13298. Cited by: §1, §2.1, §2.2, S4: Detailed comparison of energy consumption in different convolution kernels, S5: Detailed comparison of logic circuit area in different convolution kernels.
 [8] (2017) SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. In ICLR, Cited by: §1.
 [9] (2017) FPdnn: an automated framework for mapping deep neural networks onto fpgas with rtlhls hybrid templates. In 2017 IEEE 25th Annual International Symposium on FieldProgrammable Custom Computing Machines (FCCM), pp. 152–159. Cited by: S8: Performance comparison of different FPGA neural network.
 [10] (2017) Angeleye: a complete design flow for mapping cnn onto embedded fpga. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 37 (1), pp. 35–47. Cited by: S8: Performance comparison of different FPGA neural network.
 [11] (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, Cited by: §1.
 [12] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §5.
 [13] (2014) 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: §1, §2.2, Figure 11.
 [14] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
 [15] (2020) Globalgate controlled onetransistor onedigitalmemristor structure for lowbit neural network. IEEE Electron Device Letters 42 (1), pp. 106–109. Cited by: §2.2, S3: Binary neural network and Binary memristor network.
 [16] (2016) Binarized neural networks. In NeurIPS, pp. 4107–4115. Cited by: §1, §3.1.
 [17] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §2.2.
 [18] (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In CVPR, Cited by: §1, §3.1.

[19]
(2017)
Indatacenter performance analysis of a tensor processing unit
. In Proceedings of the 44th annual international symposium on computer architecture, pp. 1–12. Cited by: §4.  [20] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.
 [21] (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1, §2.1.
 [22] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §2.1, §2.2.
 [23] (2019) Long shortterm memory networks in memristor crossbar arrays. Nature Machine Intelligence 1 (1), pp. 49–57. Cited by: §1.
 [24] (2017) Pruning filters for efficient convnets. In ICLR, Cited by: §1.
 [25] (2016) A high performance fpgabased accelerator for largescale convolutional neural networks. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–9. Cited by: S8: Performance comparison of different FPGA neural network.
 [26] (2018) Optimizing the convolution operation to accelerate deep neural networks on fpga. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26 (7), pp. 1354–1367. Cited by: §4.
 [27] (2018) NVIDIA deep learning accelerator. Note: http://nvdla.org/primer.html Cited by: §4.
 [28] (2016) Efficient fpga acceleration of convolutional neural networks using logical3d compute array. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1393–1398. Cited by: S8: Performance comparison of different FPGA neural network.
 [29] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In ECCV, pp. 525–542. Cited by: §1, §1, §2.2, S4: Detailed comparison of energy consumption in different convolution kernels, S5: Detailed comparison of logic circuit area in different convolution kernels.
 [30] (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520. Cited by: §1.
 [31] (2015) Very deep convolutional networks for largescale image recognition. In ICLR, Cited by: §2.2.
 [32] (2008) The missing memristor found. nature 453 (7191), pp. 80–83. Cited by: §1.
 [33] (2015) Design and analysis of lowpower highspeed clocked digital comparator. In 2015 Global Conference on Communication Technologies (GCCT), pp. 650–654. Cited by: §2.2, Figure 11, Figure 12.
 [34] (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010. Cited by: §1.
 [35] (2019) In situ training of feedforward and recurrent convolutional memristor networks. Nature Machine Intelligence 1 (9), pp. 434–442. Cited by: §1.
 [36] (2017) Automated systolic array architecture synthesis for high throughput cnn inference on fpgas. In Proceedings of the 54th Annual Design Automation Conference 2017, pp. 1–6. Cited by: S8: Performance comparison of different FPGA neural network.
 [37] (2020) Kernel based progressive distillation for adder neural networks. In NeurIPS, Cited by: §1, §2.2, §5.

[38]
(2017)
Face classification using electronic synapses
. Nature communications 8 (1), pp. 1–8. Cited by: §1.  [39] (2020) Fully hardwareimplemented memristor convolutional neural network. Nature 577 (7792), pp. 641–646. Cited by: §1, §2.1, §2.2, S4: Detailed comparison of energy consumption in different convolution kernels, S5: Detailed comparison of logic circuit area in different convolution kernels.
 [40] (2020) Shiftaddnet: a hardwareinspired deep network. In NeurIPS, Cited by: §2.1, §2.2, §2.2, Figure 11.
 [41] (2015) Optimizing fpgabased accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA international symposium on fieldprogrammable gate arrays, pp. 161–170. Cited by: §4.
 [42] (2016) Energyefficient cnn implementation on a deeply pipelined fpga cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, pp. 326–331. Cited by: §4, S8: Performance comparison of different FPGA neural network.