Powerful deep neural networks (DNNs) come at the price of prohibitive resource costs during both DNN inference and training, limiting the application feasibility and scope of DNNs in resource-constrained device for more pervasive intelligence. DNNs are largely composed of multiplication operations for both forward and backward propagation, which are much more computationally costly than addition [horowitz2014energy]. The above roadblock has driven several attempts to design new types of hardware-friendly deep networks, which rely less on heavy multiplications in order for higher energy efficiency. ShiftNet [wu2018shift, chen2019all] adopted spatial shift operations paired with pointwise convolutions, to replace a large portion of convolutions. DeepShift [elhoushi2019deepshift] employed an alternative of bit-wise shifts, which are equivalent to multiplying the input with powers of 2. Lately, AdderNet [chen2019addernet] pioneered to demonstrate the feasibility and promise of replacing all convolutions with merely addition operations.
This paper takes one step further along this direction of multiplication-less deep networks, by drawing a very fundamental idea in the hardware-design practice, computer processors, and even digital signal processing. It has been known for long that multiplications can be performed with additions and logical bit-shifts [xue1986adaptive, 7780065], whose hardware implementation are very simple and much faster [marchesi1993fast]
, without compromising the result quality or precision. Also on currently available processors, a bit-shift instruction is faster than a multiply instruction and can be leveraged to multiply (shift left) and divide (shift right) by powers of two. Multiplication (or division) by a constant is then implemented using a sequence of shifts and adds (or subtracts). The above clever “shortcut” saves arithmetic operations, and can readily be applied to accelerating the hardware implementation of any machine learning algorithm involving multiplication (either scalar, vector, or matrix). But our curiosity is well beyond this:can we learn from this hardware-level “shortcut" to design efficient learning algorithms?
The above uniquely motivates our work: in order to be more “hardware-friendly”, we strive to re-design our model to be “hardware-inspired”, leveraging the successful experience directly form the efficient hardware design community. Specifically, we explicit re-parameterize our deep networks by replacing all convolutional and fully-connected layers (both built on multiplications) with two multiplication-free layers: bit-shift and add. Our new type of deep model, named ShiftAddNet, immediately lead to both energy-efficient inference and training algorithms.
We note that ShiftAddNet seamlessly integrates bit-shift and addition together, with strong motivations that address several pitfalls in prior arts [wu2018shift, chen2019all, elhoushi2019deepshift, chen2019addernet]. Compared to utilizing spatial- or bit-shifts alone [wu2018shift, elhoushi2019deepshift], ShiftAddNet can be fully expressive as standard DNNs, while [wu2018shift, elhoushi2019deepshift] only approximate the original expressive capacity since shift operations cannot span the entire continuous space of multiplicative mappings (e.g., bit-shifts can only represent the subset of power-of-2 multiplications). Compared to the fully additive model [chen2019addernet], we note that while repeated additions can in principle replace any multiplicative mapping, they do so in a very inefficient way. In contrast, by also exploiting bit-shifts, ShiftAddNet is expected to be more parameter-efficient than [chen2019addernet] which relies on adding templates. As a bonus, we notice the bit-shift and add operations naturally correspond to coarse- and fine-grained input manipulations. We can exploit this property to more flexibly trade-offs between training efficiency and achievable accuracy, e.g., by freezing all bit-shifts and only training add layers. ShiftAddNet with fixed shift layers can achieve up to 90% and 82.8% energy savings than fully additive models [chen2019addernet] and shift models [elhoushi2019deepshift] under floating-point or fixed-point precisions trainings, while leading to a comparable or better accuracies (-3.7% +31.2% and 3.5% 23.6%), respectively. Our contributions can be summarized as follow:
Uniquely motivated by the hardware design expertise, we combine two multiplication-less and complementary operations (bit-shift and add) to develop a hardware-inspired network called ShiftAddNet that is fully expressive and ultra-efficient.
We develop training and inference algorithms for ShiftAddNet. Leveraging the two operations’ distinct granularity levels, we also investigate ShiftAddNet trade-offs between training efficiency and achievable accuracy, e.g., by freezing all the bit-shift layers.
We conduct extensive experiments to compare ShiftAddNet with existing DNNs or multiplication-less models. Results on multiple benchmarks demonstrate its superior compactness, accuracy, training efficiency, and robustness. Specifically, we implement ShiftAddNet on a ZYNQ-7 ZC706 FPGA board [guide2012zynq] and collect all real energy measurements for benchmarking.
2 Related works
Multiplication-less DNNs. Shrinking the cost-dominate multiplications has been widely considered in many DNN designs for reducing the computational complexity [howard2017mobilenets, 8050797]: [howard2017mobilenets] decomposes the convolutions into separate depthwise and pointwise modules which require fewer multiplications; and [lin2015neural, courbariaux2016binarized, NIPS2019_8971]binarize the weights or activations to construct DNNs consisting of sign changes paired with much fewer multiplications. Another trend is to replace the multiplication operations with other cheaper operations. Specifically, [chen2019all, wu2018shift] leverage spatial shift operations to shift feature maps, which needs to be cooperated with pointwise convolution to aggregate spatial information; [elhoushi2019deepshift] fully replaces multiplications with both bit-wise shift operations and sign changes; and [chen2019addernet, song2020addersr, xu2020kernel] trade multiplications for cheaper additions and develop a special backpropogation scheme for effectively training the add-only networks.
Hardware costs of basic operations. As compared to shift and add, multipliers can be very inefficient in hardware as they require high hardware costs in terms of consumed energy/time and chip area. Shift and add operations can be a substitute for such multipliers. For example, they have been adopted for saving computer resources and can be easily and efficiently performed by a digital processor [sanchez2013approach]
. This hardware idea has been adopted to accelerate multilayer perceptrons (MLP) in digital processors[marchesi1993fast]. We here motivated by such hardware expertise to fully replace multiplications in modern DNNs with merely shift and add, aiming to solve the drawbacks in existing shift-only or add-only replacements methods and to boost the network efficiency over multiplication-based DNNs.
Relevant observations in DNN training. It has been shown that DNN training contains redundancy in various aspects [wang2019e2, you2019drawing, yang2019legonet, han2020ghostnet, zhou2020go, chen2020frequency]. For example, [liu2020orthogonal]
explores an orthogonal weight training algorithm which over-parameterizes the networks with the multiplication between a learnable orthogonal matrix and fixed randomly initialized weights, and argue that fixing weights during training and only learning a proper coordinate system can yield good generalization for over-parameterized networks; and[juefei2017local] separates the convolution into spatial and pointwise convolutions, while freezing the binary spatial convolution filters (called anchor weights) and only learning the pointwise convolutions. These works inspire the ablation study of fixing shift parameters in our ShiftAddNet.
3 The proposed model: ShiftAddNet
In this section, we present our proposed ShiftAddNet. First, we will introduce the motivation and hypothesis beyond ShiftAddNet, and then discuss ShiftAddNet’s component layers (i.e., shift and add layers) from both hardware cost and algorithmic perspectives, providing high-level background and justification for ShiftAddNet. Finally, we discuss a more efficient variant of ShiftAddNet.
3.1 Motivation and hypothesis
Driven from the long-standing tradition in the field of energy-efficient hardware implementation to replace expensive multiplication with lower-cost bit-shifts and adds, we re-design DNNs by pipelining the shift and add layers. We hypothesize that (1) while DNNs with merely either shift and add layers in general are less capable compared to their multiplication-based DNN counterparts, integrating these two weak players can lead to networks with much improved expressive capacity, while maintaining their hardware efficient advantages; and (2) thanks to the coarse- and fine-grained input manipulations resulted from the complementary shift and add layers, there is a possibility that such new network pipeline can even lead to new models which are comparable with multiplication-based DNNs in terms of task accuracy, while offering superior hardware efficiency.
3.2 ShiftAddNet: shift Layers
This subsection discusses the shift layers adopted in our proposed ShiftAddNet in terms of hardware efficiency and algorithmic capacity.
|Format||ASIC (45nm)||FPGA (ZYNQ-7 ZC706)|
|Operation||Format||Energy (pJ)||Improv.||Energy (pJ)||Improv.|
Hardware perspective. Shift operation is a well known efficient hardware primitive, motivating recent development of various shift-based efficient DNNs [elhoushi2019deepshift, wu2018shift, chen2019all]. Specifically, the shift layers in [elhoushi2019deepshift] reduce DNNs’ computation and energy costs by replacing the regular cost-dominant multiplication-based convolution and linear operations (a.k.a fully-connected layers) with bit-shift-based convolution and linear operations, respectively. Mathematically, such bit-shift operations are equivalent to multiplying by powers of 2. As summarized in Tab. 1, such shift operations can be extremely efficient as compared to their corresponding multiplications. In particular, bit-shifts can save as high as 196 and 24 energy costs over their multiplication couterpart, when implemented in a 45nm CMOS technology and SOTA FPGA [zc706]
, respectively. In addition, for a 16-bit design, it has been estimated that the average power and area of multipliers are at least 9.7and 1.45, respectively, of the bit-shifts [elhoushi2019deepshift].
Algorithmic perspective. Despite its promising hardware efficiency, networks constructed with bit-shifts can compare unfavorably with its multiplication-based counterpart in terms of expressive efficiency. Formally, expressive efficiency of architecture A is higher than architecture B if any functions realized by B could be replicated by A, but there exists functions realized by A, which cannot be replicated by B unless its size grows significantly larger [sharir2018on]. For example, it is commonly adopted that DNNs are exponentially efficient as compared to shallow networks because a shallow network must grow exponentially large for approximating the functions represented by a DNN of polynomial sizes. For ease of discussion, we refer to [zhong2018shift] and use a loosely defined metric of expressiveness called expressive capacity (accuracy) in this paper without loss of generality. Specifically, expressive capacity refers to the achieved accuracy of networks under the same or similar hardware cost, i.e., network A is deemed to have a better expressive capacity compared to network B if the former achieves a higher accuracy at a cost of the same or even fewer FLOPs (or energy cost). From this perspective, networks with bit-shift layers without full-precision latent weights are observed to be inferior to networks with add layers or multiplication-based convolution layers as shown in prior arts [chen2019addernet, howard2017mobilenets] and validated in our experiments (see Sec. 4) under various settings and datasets.
3.3 ShiftAddNet: add Layers
Similar to the aforementioned subsection, here we discuss the add layers adopted in our proposed ShiftAddNet in terms of hardware efficiency and algorithmic capacity.
Hardware perspective. Addition is another well known efficient hardware primitive. This has motivated the design of efficient DNNs mostly using additions [chen2019addernet], and there are many works trying to trade multiplications for additions in order to speed up DNNs [afrasiyabi2018non, chen2019addernet, Wang_2019_CVPR]. In particular, [chen2019addernet] investigates the feasibility of replacing multiplications with additions in DNNs, and presents AdderNets which trade the massive multiplications in DNNs for much cheaper additions to reduce computational costs. As a concrete example in Tab. 1, additions can save up to 196 and 31 energy costs over multiplications in fixed-point formats, and can save 47 and 4.1 energy costs in flaoting-point formats (more expensive), when being implemented in a 45nm CMOS technology and SOTA FPGA [zc706], respectively. Note that the pioneering work [chen2019addernet] which investigates addition dominant DNNs presents their networks in flaoting-point formats.
Algorithmic perspective. While there have been no prior works studying add layers in terms of expressive efficiency or capacity. The results in SOTA bit-shift-based networks [elhoushi2019deepshift] and add-based networks [chen2019addernet] as well as our experiments under various settings show that add-based networks in general have better expressive capacity than their bit-shift-based counterparts. In particular, AdderNets [chen2019addernet] achieves a 1.37% higher accuracy than that of DeepShift [elhoushi2019deepshift]
at a cost of similar or even lower FLOPs on ResNet-18 with the ImageNet dataset. Furthermore, when it comes to DNNs, the diversity of learned features’ granularity is another factor that is important to the achieved accuracy
. In this regard, shift layers are deemed to extract large-grained feature extraction as compared to small-grained features learned by add layers.
3.4 ShiftAddNet implementation
3.4.1 Overview of the structure
To better validate our aforementioned Hypothesis (1), i.e., integrating the two weak players (shift and add) into one can lead to networks with much improved task accuracy and hardware efficiency as compared to networks with merely one of the two weak players, we adopt SOTA bit-shift-based and add-based networks’ design to implement the shift and add layers of our ShiftAddNet in this paper. In this way, we can better evaluate that the resulting designs’ improved performance comes merely from the integration effect of the two, ruling out potential impact due to a different design. The overall structure of ShiftAddNet is illustrated in Fig. 1, which can be formulated as follows:
where I and O denote the input and output activations, respectively; and are kernel functions to perform the inner products and subtractions of a convolution; denotes the weights in the add layers, and represents the weights in the shift layers, in which s are sign flip operators and the powers of 2 parameters p can represent the bit-wise shift.
Dimensions of shift and add layers.
A shift layer in ShiftAddNet adopts the same strides and weight dimensions as that in the corresponding multification-based DNNs (e.g., ConvNet), followed by an add layer which adapts its kernel sizes and input channels to match the reduced feature maps. Although in this way ShiftAddNet contains slightly more weights than that of ConvNet/AdderNet (e.g.,1.3MB vs. MB in ConvNet/AdderNet (FP32) on ResNet20), it consumes less energy costs to achieve similar accuracies because data movement is the cost bottleneck in hardware acceleration [zhao2020smartexchange, li2020timely, 9197673]. ShiftAddNet can be further quantized to MB (FIX8) without hurting the accuracy, which will be demonstrated using experiments in Sec. 4.2.
3.4.2 Backpropagation in ShiftAddNet
ShiftAddNet adopts SOTA bit-shift-based and add-based networks’ design during backpropagation. Here we explicitly formulate both the inference and backpropagation of the shift and add layers. The add layers during inference can be expressed as:
where , and specifically, and , and , and stand for the number of the input and output channels, the size of the input and output feature maps, and the size of the weight filters, respectively; and , and denote the output and input activations, and the weights, respectively. Based on the above notation, we formulate the add layers’ backpropagation in the following equations:
where denotes the HardTanh function following AdderNet [chen2019addernet] to prevent gradients from exploding. Note that the difference over AdderNet is that the strides of ShiftAddNet’s add layers are always equal to one, while its shift layers share the same strides with its corresponding ConvNet.
Next, we use the above notation to introduce both the inference and backpropagation design of ShiftAddNet’s shift layers, with one additional symbol denoting the stride:
where , and denote the output and input activations, and the weights of shift layers, respectively; follows Equ. (4) to perform backpropagation since a shift layer is followed by an add layer in ShiftAddNet.
3.4.3 ShiftAddNet variant: fixing the shift layers
Inspired by the recent success of freezing anchor weights [juefei2017local, liu2020orthogonal] for over-parameterized networks, we hypothesize that freezing the “over-parameterized” shift layers (large-grained anchor weights) in ShiftAddNet can potentially lead to a good generalization ability, motivating us to develop a variant of ShiftAddNet with fixed shift layers. In particular, ShiftAddNet with fixed shift layers simply means the shift weight filters s and p in Equ. (1) remain the same after initialization. Training ShiftAddNet with fixed shift layers is straightforward because the shift weight filters (i.e., s and p in Equ.(1)) do not need to be updated (i.e., skipping the corresponding gradient calculations) while the error can be backpropagated through the fixed shift layers in the same way as they are backpropagated through the learnable shift layers (see Equ. (6)). Moreover, we further prune the fixed shift layers to only reserve the necessary large-grained anchor weights to design a more energy-efficient ShiftAddNet.
4 Experiment results
In this section, we first describe our experiment setup, and then benchmark ShiftAddNet over SOTA DNNs. After that, we evaluate ShiftAddNet in the context of domain adaptation. Finally, we present ablation studies of ShiftAddNet’s shift and add layers.
4.1 Experiment setup
Models and datasets. We consider two DNN models (i.e., ResNet-20 [He_2016_CVPR] and VGG19-small models [simonyan2014very]
) on six datasets: two classification datasets (i.e., CIFAR-10/100) and four IoT datasets (including MHEALTH[banos2014mhealthdroid], FlatCam Face [FlatCamFace], USCHAD [zhang2012usc]
, and Head-pose detection[gourier2004estimating]). Specifically, the Head-pose dataset contains 2,760 images, and we adopt randomly sampled 80% for training and the remaining 20% for testing the correctness of three outputs: front, left, and right [boominathan2020phlatcam]; the FlatCam Face dataset contains 23,838 face images captured using a FlatCam lensless imaging system [FlatCamFace], which are resized to 76 76 before being used.
For the CIFAR-10/100 and Head-pose datasets, the training takes a total of 160 epochs with a batch size of 256, where the initial learning rate is set to 0.1 and then divided by 10 at the 80-th and 120-th epochs, respectively, and a SGD solver is adopted with a momentum of 0.9 and a weight decay offollowing [liu2018rethinking]. For the FlatCam Face dataset, we follow [Cao18] to pre-train the network on the VGGFace 2 dataset for 20 epochs before adapting to the FlatCam Face images. For the trainable shift layers, we follow [elhoushi2019deepshift] to adopt 50% sparsity by default; and for the MHEALTH and USCHAD datasets, we follow [jiang2015human] to use a DCNN model and train it for 40 epochs.
Baselines and evaluation metrics.
Baselines and evaluation metrics.Baselines: We evaluate the proposed ShiftAddNet over two SOTA multiplication-less networks, including AdderNet [chen2019addernet] and DeepShift (use DeepShift (PS) by default) [elhoushi2019deepshift], and also compare it to the multiplication-based ConvNet [frankle2018the] under a comparable energy cost ( 30% more than AdderNet (FP32)). Evaluation metrics: For evaluating real hardware efficiency, we measure the energy cost of all DNNs on a SOTA FPGA platform, ZYNQ-7 ZC706 [guide2012zynq]. Note that our energy measurements in all experiments include the DRAM access costs.
4.2 ShiftAddNet over SOTA DNNs on standard training
Experiment settings. For this set of experiments, we consider the general ShiftAddNet with learnable shift layers. For the two SOTA multiplication-less baselines: AdderNet [chen2019addernet] and DeepShift [elhoushi2019deepshift], the latter of which quantizes its activations to 16-bit fixed-point for shifting purposes while its back-propagation uses a floating-point precision. As floating-point additions are more expensive than multiplications [FP-adder], we refer to SOTA quantization techniques [yang2020training, banner2018scalable] for quantizing both the forward (weights and activations) and backward (errors and gradients) parameters to 32/8-bit fixed-point (FIX32/8), for evaluating the potential energy savings of both the ShiftAddNet and AdderNet.
ShiftAddNet over SOTA on classification. The results on four datasets and two DNNs in Fig. 2 (a), (b), (e), and (d) show that ShiftAddNet can consistently outperform all competitors in terms of the measured energy cost, while improving the task accuracies. Specifically, with full-precision floating-point (FP32) ShiftAddNet even surpasses both the multiplication-based ConvNet and the AdderNet: when training ResNet-20 on CIFAR-10, ShiftAddNet reduces 33.7% and 44.6% of the training energy costs as compared to AdderNet and ConvNet [frankle2018the], respectively, outperforming SOTA multiplication-based ConvNet and thus validating our Hypothesis (2) in Section 3.1; and ShiftAddNet demonstrates notably improved robustness to quantization as compared to AdderNet: a quantized ShiftAddNet with 8-bit fixd-point presentation reduces 65.1% 75.0% of the energy costs over the reported one of AdderNet (with a floating-point precision, as denoted as FP32) while offering comparable accuracies (-1.79% +0.18%), and achieves a greatly higher accuracy (7.2% 37.1%) over the quantized AdderNet (FIX32/8) while consuming comparable or even less energy costs (-25.2% 25.2%). Meanwhile, ShiftAddNet achieves 2.41% 16.1% higher accuracies while requiring 34.1% 70.9% less energy costs, as compared to DeepShift [elhoushi2019deepshift]. This set of results also verify our Hypothesis (1) in Section 3.1 that integrating the weak shift and add players can lead to improved network expressive capacity with negligible or even lower hardware costs.
We also compare ShiftAddNet with the baselines in an apple-to-apple manner based on the same quantization format (e.g., FIX32). For example, when evaluated on VGG-19 with CIFAR-10 (see Fig. 2 (c)), ShiftAddNet consistently (1) improves the accuracuies by 11.6%, 10.6%, and 37.1% as compared to AdderNet in FIX32/16/8 formats, at comparable energy costs (-25.2% 15.7%); and (2) improves the accuracies by 26.8%, 26.2%, and 24.2% as compared to DeepShift (PS) using FIX32/16/8 formats, with comparable or slighly higher energy overheads. To further analyze ShiftAddNet’s improved robustness to quantization, we compare the discriminative power of AdderNet and ShiftAddNet by visualizing the class divergences using the -SNE algorithm [van2014t], as shown in the supplement.
ShiftAddNet over SOTA on IoT applications. We further evaluate ShiftAddNet over the SOTA baselines on the two IoT datasets to evaluate its effectiveness on real-world IoT tasks. As shown in Fig. 2 (c) and (f), ShiftAddNet again consistently outperforms the baselines under all settings in terms of efficiency-accuracy trade-offs. Specifically, compared with AdderNet, ShiftAddNet achieves 34.1% 80.9% energy cost reductions while offering 1.08% 3.18% higher accuracies; and compared with DeepShift (PS), ShiftAddNet achieves 34.1% 50.9% energy savings while improving accuracies by 5.5% 6.9%. This set of experiments show that ShiftAddNet’s effectiveness and superiority extends to read-world IoT applications. We also observe similar improved efficiency-accuracy trade-offs on the MHEALTH [banos2014mhealthdroid] and USCHAD [zhang2012usc] datasets and report the performance in the supplement.
ShiftAddNet over SOTA on training trajectories. Fig. 3 (a) and (b) visualize the testing accuracy’s trajectories of ShiftAddNet and the two baselines versus both the training epoch and energy cost, respectively, on ResNet-20 with CIFAR-10. We can see that ShiftAddNet achieves a comparable or higher accuracy with fewer epochs and energy costs, indicating its better generalization capability.
4.3 ShiftAddNet over SOTA on domain adaption and fine-tuning
To further evaluate the potential capability of ShiftAddNet for on-device learning [li2020halo], we consider the training settings of adaptation and fine-tuning:
Adaptation. We split CIFAR-10 into two non-overlapping subsets. We first pre-train the model on one subset and then retrain it on the other set to see how accurately and efficiently they can adapt to the new task. The same splitting is applied to the test set.
Fine-tuning. Similarly, we randomly split CIFAR-10 into two non-overlapping subsets, the difference is that each subset contains all classes. After pre-training on the first subset, we fine-tune the model on the other, expecting to see a continuous growth in performance.
Tab. 2 compares the testing accuracies and training energy costs of ShiftAddNet and the baselines. We can see that ShiftAddNet always achieves a better accuracy over the two SOTA multiplication-less networks. First, compared to AdderNet, ShiftAddNet boosts the accuracy by 5.31% and 0.65%, while reducing the energy cost by 56.6% on the adaptation and fine-tuning scenarios, respectively; Second, compared to DeepShift, ShiftAddNet notably improves the accuracy by 26.69% and 33.57% on the adaptation and fine-tuning scenarios, respectively, with a marginally increased energy (10.5%).
|Setting||Methods||Accuracy (%)||Energy Costs (MJ)|
|ResNet20 on CIFAR10||DeepShift||58.41||51.31||9.88|
4.4 Ablation studies of ShiftAddNet
We next study ShiftAddNet’s shift and add layers for better understanding this new network.
4.4.1 ShiftAddNet: fixing the shift layers or not
ShiftAddNet with fixed shift layers. In this set of experiments, we study ShiftAddNet with the shift layers fixed or learnable. As shown in Fig. 4, we can see that (1) Overall, ShiftAddNet with fixed shift layers can achieve up to 90.0% and 82.8% energy savings than AdderNet (with floating-point or fixed-point precisions) and DeepShift, while leading to comparable or better accuracies (-3.74% +31.2% and 3.5% 23.6%), respectively; and (2) interestingly, ShiftAddNet with fixed shift layers also surpasses the generic ShiftAddNet from two aspects: First, it always demands less energy costs (25.2% 40.9%) to achieve a comparable or even better accuracy; and second, it can even achieve a better accuracy and better robustness to quantizaiton (up to 10.8% improvement for 8-bit fixed-point training) than the generic ShiftAddNet with learnable shift layers, when evaluated with VGG19-small on CIFAR-100.
ShiftAddNet with its fixed shift layers pruned. As it has become a common practice to prune multiplication-based DNNs before deploying into resource-constrained devices, we are thus curious whether this can be extended to our ShiftAddNet. To do so, we randomly prune the shift layers by and , and compare the testing accuracy versus the training epochs for both the pruned ShiftAddNets and its corresponding AdderNet. Fig. 5 shows that ShiftAddNet maintains its fast convergence benefit even when the shift layers are largely pruned (e.g., up to 70%).
4.4.2 ShiftAddNet: sparsify the add layers or not
Sparsifying the add layers allows us to further reduce the used parameters and save training costs. Similar to quantization, we observe that even slightly pruning AdderNet incurs an accuracy drop. As shown in Fig. 6, we visualize the distribution of weights in the 11-th add layer when using a ResNet-20 as backbone under different pruning ratios. Note that only non-zero weights are shown in the histogram for better visualization. We can see that networks with only adder layers, i.e., AdderNet, fail to provide a wide dynamic range for the weights (collapse to narrow distribution ranges) at high pruning ratios, while ShiftAddNet can preserve a consistently wide dynamic ranges of weights. That explains the improved robustness of ShiftAddNet to sparsification. The test accuracy comparisons in Fig. 6 (c) demonstrate that when pruning 50% of the parameters in the add layers, ShiftAddNet can still achieve 80.42% test accuracy while the accuracy of AdderNets collapses to 51.47%.
We propose a multiplication-free ShiftAddNet for efficient DNN training and inference inspired by the well-known shift and add hardware expertise, and show that ShiftAddNet achieves improved expressiveness and parameter efficiency, solving the drawbacks of networks with merely shift and add operations. Moreover, ShiftAddNet enables more flexible control of different levels of granularity in the network training than ConvNet. Interestingly, we find that fixing ShiftAddNet’s shift layers even leads to a comparable or even better accuracy for over-parameterized networks on our considered IoT applications. Extensive experiments and ablation studies demonstrate the superior energy efficiency, convergence, and robustness of ShiftAddNet over its add or shift only counterparts. We believe many promising problems are still open to be discussed for our proposed new network, an immediate future work is to explore the theoretical ground of such a fixed regularization.
Efficient DNN training goal. Recent DNN breakthroughs rely on massive data and computational power. Also, the modern DNN training requires massive yet inefficient multiplications in the convolution, making DNN training very challenging and limiting the practical applications on resource-constrained mobile devices. First, training DNNs causes prohibitive computational costs. For example, training a medium scale DNN, ResNet-50, requires ten to the power of eighteen floating-point operations or FLOPs [you2017imagenet]. Second, DNN training has raised pressing environmental concerns. For instance, the carbon emission of training one DNN can be as high as one American cars’ life-long emission [strubell2019energy, li2020halo]. Therefore, efficient DNN training has become a very important research problem.
Generic hardware-inspired algorithm. To achieve the efficient training goal, this paper takes one further step along the direction of multiplication-less deep networks. by drawing a very fundamental idea in the hardware-design practice, computer processors, and even digital signal processing. It has been known for long that multiplications can be performed with additions and logical bit-shifts [xue1986adaptive], whose hardware implementation is very simple and much faster [marchesi1993fast], without compromising the result quality or precision. The above clever “shortcut” saves arithmetic operations, and can readily be applied to accelerating the hardware implementation of any machine learning algorithm involving multiplication (either scalar, vector or matrix). But our curiosity is well beyond this: we are supposed to learn from this hardware-level “shortcut", for designing efficient learning algorithms.
Societal consequences. Success of this project enables both efficient online training and inference of state-of-the-art DNNs in pervasive resource-constrained platforms and applications. As machine learning powered edge devices have penetrated all walks of life, the project is expected to generate tremendous impacts on societies and economies. Progress on this paper will enable ubiquitous DNN-powered intelligent functions in edge devices, across numerous camera-based Internet-of-Things (IoT) applications such as traffic monitoring, self-driving and smart cars, personal digital assistants, surveillance and security, and augmented reality. We believe the hardware-inspired ShiftAddNet is a significant efficient network training methods, which would make an impact to the society.
Appendix A Appendix
a.1 Visualize the divergence of different classes
To further analyze the improved robustness (for quantizing) of combining two weak players: shift and add. we compare the discriminative power of AdderNet and ShiftAddNet. Specifically, we visualize the class divergences using the -distributed stochastic neighbor embedding (t-SNE) algorithm [van2014t]
, which is well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions (2/3-D). After reducing the dimensions of learned features to 2/3-D, we are then able to analyze the discrimination among different classes, which further allows us to compare the effectiveness of different networks. As shown in Fig.7, the 2/3-D visualization show that the proposed ShiftAddNet discriminate different classes better (i.e., the boundary among different classes can be easier to identify as compared to AdderNet [chen2019addernet]) for both the floating point (FP32) and fixed point (FIX8) scenarios.
a.2 Evaluation on two more IoT datasets.
Accuracy and training costs tradeoffs. We evaluate DCNN [jiang2015human] on the popular MHEALTH [banos2014mhealthdroid] and USCHAD [zhang2012usc] IoT benchmarks. As shown in Fig. 8 (a) and (b), ShiftAddNet again consistently outperforms the baselines under all settings in terms of efficiency-accuracy trade-offs: (1) over AdderNet: ShiftAddNet reduces 32.8% 90.6% energy costs while resulting comparable accuracies (-0.65% 9.87%); and (2) over DeepShift: ShiftAddNet achieves 7.85% 30.7% higher accuracies while requiring 44.1% 74.7% less energy costs.
Inference costs: As shown in Fig. 8 (a), when training DCNN on IoT datasets, ShiftAddNet with fixed shift layers (FIX32) costs 1.7 J, where AdderNet (FIX32) costs 1.9 J and DeepShift (FIX32) costs 2.6 J, respectively, leading to 10.5% / 34.6% savings.