1 Introduction
Powerful deep neural networks (DNNs) come at the price of prohibitive resource costs during both DNN inference and training, limiting the application feasibility and scope of DNNs in resourceconstrained device for more pervasive intelligence. DNNs are largely composed of multiplication operations for both forward and backward propagation, which are much more computationally costly than addition [horowitz2014energy]. The above roadblock has driven several attempts to design new types of hardwarefriendly deep networks, which rely less on heavy multiplications in order for higher energy efficiency. ShiftNet [wu2018shift, chen2019all] adopted spatial shift operations paired with pointwise convolutions, to replace a large portion of convolutions. DeepShift [elhoushi2019deepshift] employed an alternative of bitwise shifts, which are equivalent to multiplying the input with powers of 2. Lately, AdderNet [chen2019addernet] pioneered to demonstrate the feasibility and promise of replacing all convolutions with merely addition operations.
This paper takes one step further along this direction of multiplicationless deep networks, by drawing a very fundamental idea in the hardwaredesign practice, computer processors, and even digital signal processing. It has been known for long that multiplications can be performed with additions and logical bitshifts [xue1986adaptive, 7780065], whose hardware implementation are very simple and much faster [marchesi1993fast]
, without compromising the result quality or precision. Also on currently available processors, a bitshift instruction is faster than a multiply instruction and can be leveraged to multiply (shift left) and divide (shift right) by powers of two. Multiplication (or division) by a constant is then implemented using a sequence of shifts and adds (or subtracts). The above clever “shortcut” saves arithmetic operations, and can readily be applied to accelerating the hardware implementation of any machine learning algorithm involving multiplication (either scalar, vector, or matrix). But our curiosity is well beyond this:
can we learn from this hardwarelevel “shortcut" to design efficient learning algorithms?The above uniquely motivates our work: in order to be more “hardwarefriendly”, we strive to redesign our model to be “hardwareinspired”, leveraging the successful experience directly form the efficient hardware design community. Specifically, we explicit reparameterize our deep networks by replacing all convolutional and fullyconnected layers (both built on multiplications) with two multiplicationfree layers: bitshift and add. Our new type of deep model, named ShiftAddNet, immediately lead to both energyefficient inference and training algorithms.
We note that ShiftAddNet seamlessly integrates bitshift and addition together, with strong motivations that address several pitfalls in prior arts [wu2018shift, chen2019all, elhoushi2019deepshift, chen2019addernet]. Compared to utilizing spatial or bitshifts alone [wu2018shift, elhoushi2019deepshift], ShiftAddNet can be fully expressive as standard DNNs, while [wu2018shift, elhoushi2019deepshift] only approximate the original expressive capacity since shift operations cannot span the entire continuous space of multiplicative mappings (e.g., bitshifts can only represent the subset of powerof2 multiplications). Compared to the fully additive model [chen2019addernet], we note that while repeated additions can in principle replace any multiplicative mapping, they do so in a very inefficient way. In contrast, by also exploiting bitshifts, ShiftAddNet is expected to be more parameterefficient than [chen2019addernet] which relies on adding templates. As a bonus, we notice the bitshift and add operations naturally correspond to coarse and finegrained input manipulations. We can exploit this property to more flexibly tradeoffs between training efficiency and achievable accuracy, e.g., by freezing all bitshifts and only training add layers. ShiftAddNet with fixed shift layers can achieve up to 90% and 82.8% energy savings than fully additive models [chen2019addernet] and shift models [elhoushi2019deepshift] under floatingpoint or fixedpoint precisions trainings, while leading to a comparable or better accuracies (3.7% +31.2% and 3.5% 23.6%), respectively. Our contributions can be summarized as follow:

Uniquely motivated by the hardware design expertise, we combine two multiplicationless and complementary operations (bitshift and add) to develop a hardwareinspired network called ShiftAddNet that is fully expressive and ultraefficient.

We develop training and inference algorithms for ShiftAddNet. Leveraging the two operations’ distinct granularity levels, we also investigate ShiftAddNet tradeoffs between training efficiency and achievable accuracy, e.g., by freezing all the bitshift layers.

We conduct extensive experiments to compare ShiftAddNet with existing DNNs or multiplicationless models. Results on multiple benchmarks demonstrate its superior compactness, accuracy, training efficiency, and robustness. Specifically, we implement ShiftAddNet on a ZYNQ7 ZC706 FPGA board [guide2012zynq] and collect all real energy measurements for benchmarking.
2 Related works
Multiplicationless DNNs. Shrinking the costdominate multiplications has been widely considered in many DNN designs for reducing the computational complexity [howard2017mobilenets, 8050797]: [howard2017mobilenets] decomposes the convolutions into separate depthwise and pointwise modules which require fewer multiplications; and [lin2015neural, courbariaux2016binarized, NIPS2019_8971]binarize the weights or activations to construct DNNs consisting of sign changes paired with much fewer multiplications. Another trend is to replace the multiplication operations with other cheaper operations. Specifically, [chen2019all, wu2018shift] leverage spatial shift operations to shift feature maps, which needs to be cooperated with pointwise convolution to aggregate spatial information; [elhoushi2019deepshift] fully replaces multiplications with both bitwise shift operations and sign changes; and [chen2019addernet, song2020addersr, xu2020kernel] trade multiplications for cheaper additions and develop a special backpropogation scheme for effectively training the addonly networks.
Hardware costs of basic operations. As compared to shift and add, multipliers can be very inefficient in hardware as they require high hardware costs in terms of consumed energy/time and chip area. Shift and add operations can be a substitute for such multipliers. For example, they have been adopted for saving computer resources and can be easily and efficiently performed by a digital processor [sanchez2013approach]
. This hardware idea has been adopted to accelerate multilayer perceptrons (MLP) in digital processors
[marchesi1993fast]. We here motivated by such hardware expertise to fully replace multiplications in modern DNNs with merely shift and add, aiming to solve the drawbacks in existing shiftonly or addonly replacements methods and to boost the network efficiency over multiplicationbased DNNs.Relevant observations in DNN training. It has been shown that DNN training contains redundancy in various aspects [wang2019e2, you2019drawing, yang2019legonet, han2020ghostnet, zhou2020go, chen2020frequency]. For example, [liu2020orthogonal]
explores an orthogonal weight training algorithm which overparameterizes the networks with the multiplication between a learnable orthogonal matrix and fixed randomly initialized weights, and argue that fixing weights during training and only learning a proper coordinate system can yield good generalization for overparameterized networks; and
[juefei2017local] separates the convolution into spatial and pointwise convolutions, while freezing the binary spatial convolution filters (called anchor weights) and only learning the pointwise convolutions. These works inspire the ablation study of fixing shift parameters in our ShiftAddNet.3 The proposed model: ShiftAddNet
In this section, we present our proposed ShiftAddNet. First, we will introduce the motivation and hypothesis beyond ShiftAddNet, and then discuss ShiftAddNet’s component layers (i.e., shift and add layers) from both hardware cost and algorithmic perspectives, providing highlevel background and justification for ShiftAddNet. Finally, we discuss a more efficient variant of ShiftAddNet.
3.1 Motivation and hypothesis
Driven from the longstanding tradition in the field of energyefficient hardware implementation to replace expensive multiplication with lowercost bitshifts and adds, we redesign DNNs by pipelining the shift and add layers. We hypothesize that (1) while DNNs with merely either shift and add layers in general are less capable compared to their multiplicationbased DNN counterparts, integrating these two weak players can lead to networks with much improved expressive capacity, while maintaining their hardware efficient advantages; and (2) thanks to the coarse and finegrained input manipulations resulted from the complementary shift and add layers, there is a possibility that such new network pipeline can even lead to new models which are comparable with multiplicationbased DNNs in terms of task accuracy, while offering superior hardware efficiency.
3.2 ShiftAddNet: shift Layers
This subsection discusses the shift layers adopted in our proposed ShiftAddNet in terms of hardware efficiency and algorithmic capacity.
Format  ASIC (45nm)  FPGA (ZYNQ7 ZC706)  

Operation  Format  Energy (pJ)  Improv.  Energy (pJ)  Improv. 
Mult.  FP32  3.7    18.8   
FIX32  3.1    19.6    
FIX8  0.2    0.2    
Add  FP32  0.9  4.1x  0.4  47x 
FIX32  0.1  31x  0.1  196x  
FIX8  0.03  6.7x  0.1  2x  
Shift  FIX32  0.13  24x  0.1  196x 
FIX8  0.024  8.3x  0.025  8x 
Hardware perspective. Shift operation is a well known efficient hardware primitive, motivating recent development of various shiftbased efficient DNNs [elhoushi2019deepshift, wu2018shift, chen2019all]. Specifically, the shift layers in [elhoushi2019deepshift] reduce DNNs’ computation and energy costs by replacing the regular costdominant multiplicationbased convolution and linear operations (a.k.a fullyconnected layers) with bitshiftbased convolution and linear operations, respectively. Mathematically, such bitshift operations are equivalent to multiplying by powers of 2. As summarized in Tab. 1, such shift operations can be extremely efficient as compared to their corresponding multiplications. In particular, bitshifts can save as high as 196 and 24 energy costs over their multiplication couterpart, when implemented in a 45nm CMOS technology and SOTA FPGA [zc706]
, respectively. In addition, for a 16bit design, it has been estimated that the average power and area of multipliers are at least 9.7
and 1.45, respectively, of the bitshifts [elhoushi2019deepshift].Algorithmic perspective. Despite its promising hardware efficiency, networks constructed with bitshifts can compare unfavorably with its multiplicationbased counterpart in terms of expressive efficiency. Formally, expressive efficiency of architecture A is higher than architecture B if any functions realized by B could be replicated by A, but there exists functions realized by A, which cannot be replicated by B unless its size grows significantly larger [sharir2018on]. For example, it is commonly adopted that DNNs are exponentially efficient as compared to shallow networks because a shallow network must grow exponentially large for approximating the functions represented by a DNN of polynomial sizes. For ease of discussion, we refer to [zhong2018shift] and use a loosely defined metric of expressiveness called expressive capacity (accuracy) in this paper without loss of generality. Specifically, expressive capacity refers to the achieved accuracy of networks under the same or similar hardware cost, i.e., network A is deemed to have a better expressive capacity compared to network B if the former achieves a higher accuracy at a cost of the same or even fewer FLOPs (or energy cost). From this perspective, networks with bitshift layers without fullprecision latent weights are observed to be inferior to networks with add layers or multiplicationbased convolution layers as shown in prior arts [chen2019addernet, howard2017mobilenets] and validated in our experiments (see Sec. 4) under various settings and datasets.
3.3 ShiftAddNet: add Layers
Similar to the aforementioned subsection, here we discuss the add layers adopted in our proposed ShiftAddNet in terms of hardware efficiency and algorithmic capacity.
Hardware perspective. Addition is another well known efficient hardware primitive. This has motivated the design of efficient DNNs mostly using additions [chen2019addernet], and there are many works trying to trade multiplications for additions in order to speed up DNNs [afrasiyabi2018non, chen2019addernet, Wang_2019_CVPR]. In particular, [chen2019addernet] investigates the feasibility of replacing multiplications with additions in DNNs, and presents AdderNets which trade the massive multiplications in DNNs for much cheaper additions to reduce computational costs. As a concrete example in Tab. 1, additions can save up to 196 and 31 energy costs over multiplications in fixedpoint formats, and can save 47 and 4.1 energy costs in flaotingpoint formats (more expensive), when being implemented in a 45nm CMOS technology and SOTA FPGA [zc706], respectively. Note that the pioneering work [chen2019addernet] which investigates addition dominant DNNs presents their networks in flaotingpoint formats.
Algorithmic perspective. While there have been no prior works studying add layers in terms of expressive efficiency or capacity. The results in SOTA bitshiftbased networks [elhoushi2019deepshift] and addbased networks [chen2019addernet] as well as our experiments under various settings show that addbased networks in general have better expressive capacity than their bitshiftbased counterparts. In particular, AdderNets [chen2019addernet] achieves a 1.37% higher accuracy than that of DeepShift [elhoushi2019deepshift]
at a cost of similar or even lower FLOPs on ResNet18 with the ImageNet dataset. Furthermore, when it comes to DNNs, the diversity of learned features’ granularity is another factor that is important to the achieved accuracy
[7280542]. In this regard, shift layers are deemed to extract largegrained feature extraction as compared to smallgrained features learned by add layers.
3.4 ShiftAddNet implementation
3.4.1 Overview of the structure
To better validate our aforementioned Hypothesis (1), i.e., integrating the two weak players (shift and add) into one can lead to networks with much improved task accuracy and hardware efficiency as compared to networks with merely one of the two weak players, we adopt SOTA bitshiftbased and addbased networks’ design to implement the shift and add layers of our ShiftAddNet in this paper. In this way, we can better evaluate that the resulting designs’ improved performance comes merely from the integration effect of the two, ruling out potential impact due to a different design. The overall structure of ShiftAddNet is illustrated in Fig. 1, which can be formulated as follows:
(1) 
where I and O denote the input and output activations, respectively; and are kernel functions to perform the inner products and subtractions of a convolution; denotes the weights in the add layers, and represents the weights in the shift layers, in which s are sign flip operators and the powers of 2 parameters p can represent the bitwise shift.
Dimensions of shift and add layers.
A shift layer in ShiftAddNet adopts the same strides and weight dimensions as that in the corresponding multificationbased DNNs (e.g., ConvNet), followed by an add layer which adapts its kernel sizes and input channels to match the reduced feature maps. Although in this way ShiftAddNet contains slightly more weights than that of ConvNet/AdderNet (e.g.,
1.3MB vs. MB in ConvNet/AdderNet (FP32) on ResNet20), it consumes less energy costs to achieve similar accuracies because data movement is the cost bottleneck in hardware acceleration [zhao2020smartexchange, li2020timely, 9197673]. ShiftAddNet can be further quantized to MB (FIX8) without hurting the accuracy, which will be demonstrated using experiments in Sec. 4.2.3.4.2 Backpropagation in ShiftAddNet
ShiftAddNet adopts SOTA bitshiftbased and addbased networks’ design during backpropagation. Here we explicitly formulate both the inference and backpropagation of the shift and add layers. The add layers during inference can be expressed as:
(2) 
where , and specifically, and , and , and stand for the number of the input and output channels, the size of the input and output feature maps, and the size of the weight filters, respectively; and , and denote the output and input activations, and the weights, respectively. Based on the above notation, we formulate the add layers’ backpropagation in the following equations:
(3) 
(4) 
where denotes the HardTanh function following AdderNet [chen2019addernet] to prevent gradients from exploding. Note that the difference over AdderNet is that the strides of ShiftAddNet’s add layers are always equal to one, while its shift layers share the same strides with its corresponding ConvNet.
Next, we use the above notation to introduce both the inference and backpropagation design of ShiftAddNet’s shift layers, with one additional symbol denoting the stride:
(5) 
(6) 
where , and denote the output and input activations, and the weights of shift layers, respectively; follows Equ. (4) to perform backpropagation since a shift layer is followed by an add layer in ShiftAddNet.
3.4.3 ShiftAddNet variant: fixing the shift layers
Inspired by the recent success of freezing anchor weights [juefei2017local, liu2020orthogonal] for overparameterized networks, we hypothesize that freezing the “overparameterized” shift layers (largegrained anchor weights) in ShiftAddNet can potentially lead to a good generalization ability, motivating us to develop a variant of ShiftAddNet with fixed shift layers. In particular, ShiftAddNet with fixed shift layers simply means the shift weight filters s and p in Equ. (1) remain the same after initialization. Training ShiftAddNet with fixed shift layers is straightforward because the shift weight filters (i.e., s and p in Equ.(1)) do not need to be updated (i.e., skipping the corresponding gradient calculations) while the error can be backpropagated through the fixed shift layers in the same way as they are backpropagated through the learnable shift layers (see Equ. (6)). Moreover, we further prune the fixed shift layers to only reserve the necessary largegrained anchor weights to design a more energyefficient ShiftAddNet.
4 Experiment results
In this section, we first describe our experiment setup, and then benchmark ShiftAddNet over SOTA DNNs. After that, we evaluate ShiftAddNet in the context of domain adaptation. Finally, we present ablation studies of ShiftAddNet’s shift and add layers.
4.1 Experiment setup
Models and datasets. We consider two DNN models (i.e., ResNet20 [He_2016_CVPR] and VGG19small models [simonyan2014very]
) on six datasets: two classification datasets (i.e., CIFAR10/100) and four IoT datasets (including MHEALTH
[banos2014mhealthdroid], FlatCam Face [FlatCamFace], USCHAD [zhang2012usc], and Headpose detection
[gourier2004estimating]). Specifically, the Headpose dataset contains 2,760 images, and we adopt randomly sampled 80% for training and the remaining 20% for testing the correctness of three outputs: front, left, and right [boominathan2020phlatcam]; the FlatCam Face dataset contains 23,838 face images captured using a FlatCam lensless imaging system [FlatCamFace], which are resized to 76 76 before being used.Training settings.
For the CIFAR10/100 and Headpose datasets, the training takes a total of 160 epochs with a batch size of 256, where the initial learning rate is set to 0.1 and then divided by 10 at the 80th and 120th epochs, respectively, and a SGD solver is adopted with a momentum of 0.9 and a weight decay of
following [liu2018rethinking]. For the FlatCam Face dataset, we follow [Cao18] to pretrain the network on the VGGFace 2 dataset for 20 epochs before adapting to the FlatCam Face images. For the trainable shift layers, we follow [elhoushi2019deepshift] to adopt 50% sparsity by default; and for the MHEALTH and USCHAD datasets, we follow [jiang2015human] to use a DCNN model and train it for 40 epochs.Baselines and evaluation metrics.
Baselines: We evaluate the proposed ShiftAddNet over two SOTA multiplicationless networks, including AdderNet [chen2019addernet] and DeepShift (use DeepShift (PS) by default) [elhoushi2019deepshift], and also compare it to the multiplicationbased ConvNet [frankle2018the] under a comparable energy cost ( 30% more than AdderNet (FP32)). Evaluation metrics: For evaluating real hardware efficiency, we measure the energy cost of all DNNs on a SOTA FPGA platform, ZYNQ7 ZC706 [guide2012zynq]. Note that our energy measurements in all experiments include the DRAM access costs.4.2 ShiftAddNet over SOTA DNNs on standard training
Experiment settings. For this set of experiments, we consider the general ShiftAddNet with learnable shift layers. For the two SOTA multiplicationless baselines: AdderNet [chen2019addernet] and DeepShift [elhoushi2019deepshift], the latter of which quantizes its activations to 16bit fixedpoint for shifting purposes while its backpropagation uses a floatingpoint precision. As floatingpoint additions are more expensive than multiplications [FPadder], we refer to SOTA quantization techniques [yang2020training, banner2018scalable] for quantizing both the forward (weights and activations) and backward (errors and gradients) parameters to 32/8bit fixedpoint (FIX32/8), for evaluating the potential energy savings of both the ShiftAddNet and AdderNet.
ShiftAddNet over SOTA on classification. The results on four datasets and two DNNs in Fig. 2 (a), (b), (e), and (d) show that ShiftAddNet can consistently outperform all competitors in terms of the measured energy cost, while improving the task accuracies. Specifically, with fullprecision floatingpoint (FP32) ShiftAddNet even surpasses both the multiplicationbased ConvNet and the AdderNet: when training ResNet20 on CIFAR10, ShiftAddNet reduces 33.7% and 44.6% of the training energy costs as compared to AdderNet and ConvNet [frankle2018the], respectively, outperforming SOTA multiplicationbased ConvNet and thus validating our Hypothesis (2) in Section 3.1; and ShiftAddNet demonstrates notably improved robustness to quantization as compared to AdderNet: a quantized ShiftAddNet with 8bit fixdpoint presentation reduces 65.1% 75.0% of the energy costs over the reported one of AdderNet (with a floatingpoint precision, as denoted as FP32) while offering comparable accuracies (1.79% +0.18%), and achieves a greatly higher accuracy (7.2% 37.1%) over the quantized AdderNet (FIX32/8) while consuming comparable or even less energy costs (25.2% 25.2%). Meanwhile, ShiftAddNet achieves 2.41% 16.1% higher accuracies while requiring 34.1% 70.9% less energy costs, as compared to DeepShift [elhoushi2019deepshift]. This set of results also verify our Hypothesis (1) in Section 3.1 that integrating the weak shift and add players can lead to improved network expressive capacity with negligible or even lower hardware costs.
We also compare ShiftAddNet with the baselines in an appletoapple manner based on the same quantization format (e.g., FIX32). For example, when evaluated on VGG19 with CIFAR10 (see Fig. 2 (c)), ShiftAddNet consistently (1) improves the accuracuies by 11.6%, 10.6%, and 37.1% as compared to AdderNet in FIX32/16/8 formats, at comparable energy costs (25.2% 15.7%); and (2) improves the accuracies by 26.8%, 26.2%, and 24.2% as compared to DeepShift (PS) using FIX32/16/8 formats, with comparable or slighly higher energy overheads. To further analyze ShiftAddNet’s improved robustness to quantization, we compare the discriminative power of AdderNet and ShiftAddNet by visualizing the class divergences using the SNE algorithm [van2014t], as shown in the supplement.
ShiftAddNet over SOTA on IoT applications. We further evaluate ShiftAddNet over the SOTA baselines on the two IoT datasets to evaluate its effectiveness on realworld IoT tasks. As shown in Fig. 2 (c) and (f), ShiftAddNet again consistently outperforms the baselines under all settings in terms of efficiencyaccuracy tradeoffs. Specifically, compared with AdderNet, ShiftAddNet achieves 34.1% 80.9% energy cost reductions while offering 1.08% 3.18% higher accuracies; and compared with DeepShift (PS), ShiftAddNet achieves 34.1% 50.9% energy savings while improving accuracies by 5.5% 6.9%. This set of experiments show that ShiftAddNet’s effectiveness and superiority extends to readworld IoT applications. We also observe similar improved efficiencyaccuracy tradeoffs on the MHEALTH [banos2014mhealthdroid] and USCHAD [zhang2012usc] datasets and report the performance in the supplement.
ShiftAddNet over SOTA on training trajectories. Fig. 3 (a) and (b) visualize the testing accuracy’s trajectories of ShiftAddNet and the two baselines versus both the training epoch and energy cost, respectively, on ResNet20 with CIFAR10. We can see that ShiftAddNet achieves a comparable or higher accuracy with fewer epochs and energy costs, indicating its better generalization capability.
4.3 ShiftAddNet over SOTA on domain adaption and finetuning
To further evaluate the potential capability of ShiftAddNet for ondevice learning [li2020halo], we consider the training settings of adaptation and finetuning:

Adaptation. We split CIFAR10 into two nonoverlapping subsets. We first pretrain the model on one subset and then retrain it on the other set to see how accurately and efficiently they can adapt to the new task. The same splitting is applied to the test set.

Finetuning. Similarly, we randomly split CIFAR10 into two nonoverlapping subsets, the difference is that each subset contains all classes. After pretraining on the first subset, we finetune the model on the other, expecting to see a continuous growth in performance.
Tab. 2 compares the testing accuracies and training energy costs of ShiftAddNet and the baselines. We can see that ShiftAddNet always achieves a better accuracy over the two SOTA multiplicationless networks. First, compared to AdderNet, ShiftAddNet boosts the accuracy by 5.31% and 0.65%, while reducing the energy cost by 56.6% on the adaptation and finetuning scenarios, respectively; Second, compared to DeepShift, ShiftAddNet notably improves the accuracy by 26.69% and 33.57% on the adaptation and finetuning scenarios, respectively, with a marginally increased energy (10.5%).
Setting  Methods  Accuracy (%)  Energy Costs (MJ)  

Adaptation  Finetuning  
ResNet20 on CIFAR10  DeepShift  58.41  51.31  9.88 
AdderNet  79.79  84.23  25.41  
ShiftAddNet  81.50  84.61  16.82  
ShiftAddNet (Fixed)  85.10  84.88  11.04 
4.4 Ablation studies of ShiftAddNet
We next study ShiftAddNet’s shift and add layers for better understanding this new network.
4.4.1 ShiftAddNet: fixing the shift layers or not
ShiftAddNet with fixed shift layers. In this set of experiments, we study ShiftAddNet with the shift layers fixed or learnable. As shown in Fig. 4, we can see that (1) Overall, ShiftAddNet with fixed shift layers can achieve up to 90.0% and 82.8% energy savings than AdderNet (with floatingpoint or fixedpoint precisions) and DeepShift, while leading to comparable or better accuracies (3.74% +31.2% and 3.5% 23.6%), respectively; and (2) interestingly, ShiftAddNet with fixed shift layers also surpasses the generic ShiftAddNet from two aspects: First, it always demands less energy costs (25.2% 40.9%) to achieve a comparable or even better accuracy; and second, it can even achieve a better accuracy and better robustness to quantizaiton (up to 10.8% improvement for 8bit fixedpoint training) than the generic ShiftAddNet with learnable shift layers, when evaluated with VGG19small on CIFAR100.
ShiftAddNet with its fixed shift layers pruned. As it has become a common practice to prune multiplicationbased DNNs before deploying into resourceconstrained devices, we are thus curious whether this can be extended to our ShiftAddNet. To do so, we randomly prune the shift layers by and , and compare the testing accuracy versus the training epochs for both the pruned ShiftAddNets and its corresponding AdderNet. Fig. 5 shows that ShiftAddNet maintains its fast convergence benefit even when the shift layers are largely pruned (e.g., up to 70%).
4.4.2 ShiftAddNet: sparsify the add layers or not
Sparsifying the add layers allows us to further reduce the used parameters and save training costs. Similar to quantization, we observe that even slightly pruning AdderNet incurs an accuracy drop. As shown in Fig. 6, we visualize the distribution of weights in the 11th add layer when using a ResNet20 as backbone under different pruning ratios. Note that only nonzero weights are shown in the histogram for better visualization. We can see that networks with only adder layers, i.e., AdderNet, fail to provide a wide dynamic range for the weights (collapse to narrow distribution ranges) at high pruning ratios, while ShiftAddNet can preserve a consistently wide dynamic ranges of weights. That explains the improved robustness of ShiftAddNet to sparsification. The test accuracy comparisons in Fig. 6 (c) demonstrate that when pruning 50% of the parameters in the add layers, ShiftAddNet can still achieve 80.42% test accuracy while the accuracy of AdderNets collapses to 51.47%.
5 Conclusion
We propose a multiplicationfree ShiftAddNet for efficient DNN training and inference inspired by the wellknown shift and add hardware expertise, and show that ShiftAddNet achieves improved expressiveness and parameter efficiency, solving the drawbacks of networks with merely shift and add operations. Moreover, ShiftAddNet enables more flexible control of different levels of granularity in the network training than ConvNet. Interestingly, we find that fixing ShiftAddNet’s shift layers even leads to a comparable or even better accuracy for overparameterized networks on our considered IoT applications. Extensive experiments and ablation studies demonstrate the superior energy efficiency, convergence, and robustness of ShiftAddNet over its add or shift only counterparts. We believe many promising problems are still open to be discussed for our proposed new network, an immediate future work is to explore the theoretical ground of such a fixed regularization.
Broader impact
Efficient DNN training goal. Recent DNN breakthroughs rely on massive data and computational power. Also, the modern DNN training requires massive yet inefficient multiplications in the convolution, making DNN training very challenging and limiting the practical applications on resourceconstrained mobile devices. First, training DNNs causes prohibitive computational costs. For example, training a medium scale DNN, ResNet50, requires ten to the power of eighteen floatingpoint operations or FLOPs [you2017imagenet]. Second, DNN training has raised pressing environmental concerns. For instance, the carbon emission of training one DNN can be as high as one American cars’ lifelong emission [strubell2019energy, li2020halo]. Therefore, efficient DNN training has become a very important research problem.
Generic hardwareinspired algorithm. To achieve the efficient training goal, this paper takes one further step along the direction of multiplicationless deep networks. by drawing a very fundamental idea in the hardwaredesign practice, computer processors, and even digital signal processing. It has been known for long that multiplications can be performed with additions and logical bitshifts [xue1986adaptive], whose hardware implementation is very simple and much faster [marchesi1993fast], without compromising the result quality or precision. The above clever “shortcut” saves arithmetic operations, and can readily be applied to accelerating the hardware implementation of any machine learning algorithm involving multiplication (either scalar, vector or matrix). But our curiosity is well beyond this: we are supposed to learn from this hardwarelevel “shortcut", for designing efficient learning algorithms.
Societal consequences. Success of this project enables both efficient online training and inference of stateoftheart DNNs in pervasive resourceconstrained platforms and applications. As machine learning powered edge devices have penetrated all walks of life, the project is expected to generate tremendous impacts on societies and economies. Progress on this paper will enable ubiquitous DNNpowered intelligent functions in edge devices, across numerous camerabased InternetofThings (IoT) applications such as traffic monitoring, selfdriving and smart cars, personal digital assistants, surveillance and security, and augmented reality. We believe the hardwareinspired ShiftAddNet is a significant efficient network training methods, which would make an impact to the society.
References
Appendix A Appendix
a.1 Visualize the divergence of different classes
To further analyze the improved robustness (for quantizing) of combining two weak players: shift and add. we compare the discriminative power of AdderNet and ShiftAddNet. Specifically, we visualize the class divergences using the distributed stochastic neighbor embedding (tSNE) algorithm [van2014t]
, which is wellsuited for embedding highdimensional data for visualization in a lowdimensional space of two or three dimensions (2/3D). After reducing the dimensions of learned features to 2/3D, we are then able to analyze the discrimination among different classes, which further allows us to compare the effectiveness of different networks. As shown in Fig.
7, the 2/3D visualization show that the proposed ShiftAddNet discriminate different classes better (i.e., the boundary among different classes can be easier to identify as compared to AdderNet [chen2019addernet]) for both the floating point (FP32) and fixed point (FIX8) scenarios.a.2 Evaluation on two more IoT datasets.
Accuracy and training costs tradeoffs. We evaluate DCNN [jiang2015human] on the popular MHEALTH [banos2014mhealthdroid] and USCHAD [zhang2012usc] IoT benchmarks. As shown in Fig. 8 (a) and (b), ShiftAddNet again consistently outperforms the baselines under all settings in terms of efficiencyaccuracy tradeoffs: (1) over AdderNet: ShiftAddNet reduces 32.8% 90.6% energy costs while resulting comparable accuracies (0.65% 9.87%); and (2) over DeepShift: ShiftAddNet achieves 7.85% 30.7% higher accuracies while requiring 44.1% 74.7% less energy costs.
Inference costs: As shown in Fig. 8 (a), when training DCNN on IoT datasets, ShiftAddNet with fixed shift layers (FIX32) costs 1.7 J, where AdderNet (FIX32) costs 1.9 J and DeepShift (FIX32) costs 2.6 J, respectively, leading to 10.5% / 34.6% savings.
Comments
There are no comments yet.