1. Introduction
Deep Learning has emerged as the most instrumental cog in revolutionizing the current drive of ubiquitous Artificial Intelligence (AI), especially in domains of visual recognition and natural language processing. Over the past decade, extensive research in deep learning has enabled machines to go beyond image classification
(Krizhevsky et al., 2012; Szegedy et al., 2015; He et al., 2016; Girshick et al., 2015) and natural language processing (Mikolov et al., 2013) to the extent of outperforming humans in games such as Atari (Mnih et al., 2015) and Go (Silver et al., 2017). Despite the unprecedented success of deep neural networks, standard network architectures prove to be intensive both in terms of memory and computational resources and require expensive GPUbased platforms for execution. However, with the advent of the modern age of ‘Internet of Things’ and a proliferating need for enabling AI in lowpower edge devices, designing energy and memoryefficient neural networks is quintessential. This has driven researchers to look at ways to reduce model complexity, while trying to meet the algorithmic requirements, like accuracy and reliability.One way to reduce model size is to modify the network architecture itself, such that it has fewer parameters. SqueezeNet (Iandola et al., 2016) employs a series of 11 convolutions to compress and expand feature maps as they pass through the neural network. Another method of compression is pruning which aims to reduce redundancies in overparameterized networks. Hence, researchers have investigated several network pruning techniques, both during training (Alvarez and Salzmann, 2017; Weigend et al., 1991) and inferencing (Han et al., 2015; Ullrich et al., 2017).
A different technique of model compression is reduced bit precision to represent weights and activations. Quantizing networks achieves energy efficiency and memory compression compared to fullprecision networks. Several training algorithms have been proposed to train such binary and ternary neural networks (Hubara et al., 2017; Mellempudi et al., 2017)
. Although these algorithms attain close to performance of a fullprecision network for smaller datasets such as MNIST and CIFAR10, scaling them to ImageNet is a challenge. As a solution, XNORNets (binary weights and activations) and BWNs (binary weights and fullprecision activations)
(Rastegari et al., 2016)were proposed. They offer a different scheme of binarization that uses a scaling factor per weight filter bank and were able to scale to ImageNet, albeit, with a degradation in accuracy compared to a fullprecision network. Researchers have also looked at a hybrid network structure combining BWNs and XNORNets
(Prabhu et al., 2018), where activations are full precision for certain layers and binary for others. However, despite such hybridization techniques, the gulf between full precision networks and quantized networks still remain considerably high, especially for deep networks and larger datasets. In light of these shortcomings of quantization algorithms, we propose hybrid network architectures combining binary and fullprecision sections to attain performance close to fullprecision networks while achieving significant energy efficiency and memory compression. We explore several hybrid networks which involve adding fullprecision residual connections, and breaking the network into binary and fullprecision sections, both layerwise and within a layer. We evaluate the performance of the proposed networks on datasets such as CIFAR100 and ImageNet and explore the tradeoffs between classification accuracy, energy efficiency and memory compression. We compare the different kinds of hybrid network architectures to identify the optimum network which recover the performance degradation in extremely quantized networks while achieving excellent compression. Our approach provides an effective way of designing hybrid networks which attain the classification performance close to fullprecision networks while achieving significant compression, thereby increasing the feasibility of ubiquitous use of lowprecision networks in lowpower edge devices.2. Design Methodology for Hybrid Networks
We address the problem of performance degradation due to extreme quantization by proposing hybrid network architectures constituted by binary networks with fullprecision elements in different forms. Introducing fullprecision elements could be in form of adding fullprecision residual layers or might involve splitting the neural networks into binary and fullprecision sections. A fullprecision layer here means both fullprecision weights and activations unless mentioned otherwise. We use the binarization scheme developed in (Rastegari et al., 2016) as they have been demonstrated to scale to large datasets. The binary convolution operation between inputs and weights is approximated as:
(1) 
Here, is the L1norm of . These binary convolutions are similar to XNOR operations and hence these networks are called XNORNets. We define XNORNet as our baseline binary network architecture. As in (Rastegari et al., 2016), we have kept the first and final layers of hybrid networks fullprecision. We apply our hybridization techniques on the binary layers of XNORNet. The types of hybrid network architectures explored are described below:

Hybrid Networks with fullprecision residual connections: This kind of hybrid networks are comprised of binary networks along with a few fullprecision residual connections. Residual connections are usually unity connections which run parallel to the convolutional layers and are added to the convolution output every few layers. Some of the residual connections might have weight filters to downsample the input maps when the filter size changes. As these weight layers are computationally less intensive, making them fullprecision promises improvement of classification accuracies while still achieving a large compression compared to fullprecision networks. For networks that do not have residual connections, we add a few fullprecision connections to form a hybrid network.

Sectioned Hybrid Networks: We explore an alternative technique to hybridize networks with network sectioning. Sectioning the network involves splitting the network into binary and fullprecision sections. Sectioning can be of two forms:

Interlayer sectioning: This type of sectioning involves splitting the neural network with layers into binary and fullprecision layers.

Intralayer sectioning: This type of sectioning involves splitting each layer of the neural network into binary and fullprecision sections with a fraction , which means % of the weight filters of each layer are fullprecision.

In our analysis, we consider ResNet and VGG network architectures. For ResNets, we also analyze all the proposed hybrid ResNetbased networks for different network widths. However, as VGGNets are inherently overparameterized, we only explore the hybrid VGG networks for one particular network width.
2.1. Hybrid ResNets with fullprecision residual connections
We consider the network architecture ResNetN (a ResNet with N layers) which has convolutional and 1 fullyconnected weight layers. We propose a hybrid ResNet architecture constituted by a XNOR ResNet along with fullprecision residual connections. The residual connections in ResNets are usually identity connections except in cases where the connection needs to be downsampled with convolution kernels. The network proposed using fullprecision residual connections is described below:
HybridRes A: This configuration is comprised of a binary ResNet along with fullprecision residual convolutional layers. Fig. 1 (a) shows this architecture where the fullprecision downsampling residual layers are shown in color.
2.2. Sectioned Hybrid ResNets
We also explore an alternative technique to hybridize by breaking the neural networks into binary and fullprecision sections. This can be done in two ways, interlayer and intralayer. The networks are described below:

HybridRes B: In this configuration, we perform interlayer sectioning where we section the ResNet into binary layers and full precision layers. For example, in ResNet20, this sectioning can lead to 6 fullprecision layers and 12 binary layers, as shown in Fig. 1 (b).

HybridRes C: In This configuration, we split each layer of the ResNet into binary and fullprecision sections with a fraction as defined earlier (shown in Fig. 1 (c)).
2.3. Hybrid VGG Networks with fullprecision residual connections
We consider the network architecture VGGN which has convolutional kernel layers and 3 fully connected layers. We propose a hybrid VGG network design by adding fullprecision residual connections to the standard VGG network. We extend that concept of using fullprecision residual connections to VGG networks. The network is described below and depicted in Fig. 2 (a):
HybridVGG A: In this network, we add residual connections to the standard VGGN architecture every two convolutional layers. Each time the number of output maps changes, we include a fullprecision downsampling convolutional layer in the residual path, as shown in color in Fig. 2 (a). For ImageNet simulations, we made the second fully connected layer of this configuration fullprecision. Let us call this network HybridVGG A’.
2.4. Sectioned Hybrid VGG Networks
We explore sectioned hybridization techniques involving breaking the VGG network into binary and fullprecision sections. We consider interlayer and intralayer sectioning. The descriptions of the networks are listed below and depicted in Fig. 2 (b) and (c):

HybridVGG B: In this network, we consider intralayer sectioning where we make a fraction of each convolutional layer fullprecision. The fullprecision sections are shown in color in Fig. 2 (b).

HybridVGG C: This network considers interlayer sectioning involving fullprecision layers and binary layers. For example, in VGG19 architecture (Fig. 2 (c)), we can split the binary section of the XNORNet into 15 binary convolutional layers and 2 fullprecision linear layers.
For VGG networks, we further compare the proposed hybrid networks with networks with increased widths and increased bitprecision. These networks are based on the basic VGG XNORNet, shown in Fig. 2 (d). The networks are described below:

VGG2bit: We increase the bitprecision of weights of the quantized layers to 2bit from 1bit in case of XNORNets that we have explored thus far. The training algorithm for any kbit quantized network can be readily derived from XNORNet (Rastegari et al., 2016) where the quantized levels are:
instead of the function, as in case of binary networks.

VGGInflate: We increase the network width 2 times, i.e, the number of filter banks in each convolutional layer is doubled.
3. Experiments, Results and Discussion
3.1. Energy and Memory Calculations
We perform a storage and computation analysis to calculate the energy efficiency and memory compression of the proposed networks. For any two networks A and B, the energy efficiency of Network A with respect to Network B can be defined as: Energy Efficiency (E.E) = (Energy consumed by Network A)/(Energy consumed by Network B). Similarly, Memory compression of Network A with respect to Network B can be defined as: Memory Compression (M.C) = (Storage required by Network A)/(Storage required by Network B). E.E (FP) (M.C (FP)) is the energy efficiency (memory compression) of a network with respect to the fullprecision network whereas E.E (XNOR) (M.C (XNOR)) is the energy efficiency (memory compression) of the network with respect to the XNOR network.
3.1.1. Energy Efficiency
We considered the energy consumed by the computations (multiplyandaccumulate or MAC operations) and memory accesses in our calculations for energy efficiency. We do not take into account energy consumed due to data flow and instruction flow in the architecture. For a convolutional layer, there are input channels and output channels. Let the size of the input be , size of the kernel be and size of the output be . Thus, in Table 1 we present the number of memoryaccesses and computations for standard fullprecision networks:
Operation  Number of Operations 

Input Read  
Weight Read  
Computations (MAC)  
Memory Write 
The number of binary operations in the binary layers of the hybrid networks is same as the number of fullprecision operations in the corresponding layers of the fullprecision networks. Since we use the XNORNet training algorithm for training the binary weights in our work, we consider additional fullprecision memory accesses and computations for parameter , where is the scaling factor for each filter bank in a convolutional layer. Number of accesses for is equal to the number of output maps, . Number of fullprecision computations are .
We calculated the energy consumption from projections for 10 nm CMOS technology (Keckler et al., 2011). Considering 32bit representation as fullprecision, the energy consumption for both binary and 32bit memory accesses and computations are shown in Table. 2.
Operation  Energy (pJ)  Operation  Energy (pJ) 

32bit Memory Access  80  Binary Memory Access  2.5 
32bit MAC  3.25  Binary MAC  0.1 
The energy numbers for binary memory accesses and MAC operations are scaled down 32 times from the corresponding fullprecision values. Let the number of fullprecision (binary) memory accesses in any layer be () and number of fullprecision (binary) computations in any layer be (). Then, energy consumed by any layer is given by . For a binary layer, and . For a fullprecision layer,
. Note, this calculation is a rather conservative estimate which does not take into account other hardware architectural aspects such as inputsharing or weightsharing. However, our approach concerns with modifications of network architecture and we compare the ratios of energy consumption. These aspects of the hardware architecture affect all the networks equally and hence can be taken out of consideration.
3.1.2. Memory Compression
The memory required for any network is given by product of the total number of weights in the network multiplied by the precision of the weights. The number of weights in each layer is given by . Thus, the memory required by a fullprecision layer is bits and that of a binary layer is bits.
Note that the assumption for the energy and storage calculations for binary layers hold for custom hardware capable of handling fixedpoint binary representations of data, thus leveraging the benefits offered by quantized networks.
3.2. Image Classification Framework
We evaluated the performance of all the networks described in this section in PyTorch. We explore hybridization of 2 network architectures, namely, ResNet20 and VGG19 where the training algorithm for the binarized layers has been adopted from ‘XNORNet’
(Rastegari et al., 2016). We perform image classification on the dataset CIFAR100 (Krizhevsky and Hinton, 2009). The CIFAR100 dataset has 50000 training images and 10000 testing images of size for 100 classes. We report classification performance using top1 test accuracies for the datasets.3.3. Drawbacks of XNORNets
Firstly, we evaluate the performance of our baseline binary networks or XNORNets for VGG19 and ResNet20. This analysis will help us understand the drawbacks of using purely binary networks without any hybridization. Table. 3 lists the accuracy, energy effiency and memory compression of XNORNets:
Network  FullPrecision Accuracy (%)  XNORNet Accuracy (%)  E.E (FP)  M.C (FP) 

VGG19  67.21  37.47  24.13  24.08 
Resnet20  65.81  50.2  18.67  17.26 
We observe that VGG19 and ResNet20 have similar fullprecision accuracies, however, XNORNet VGG19 suffers a significantly higher degradation in accuracy compared to XNORNet ResNet20 on CIFAR100. Inflating the networks, i.e., making the networks wider is a way to improve upon the degradation in accuracy suffered by XNORNets. As shown in Table. 4, inflating the networks improves the accuracy of XNOR ResNet20 close to fullprecision accuracy at the cost of memory compression, however, in case of XNOR VGG19, the improvement is not significant. To address this degradation in accuracy, we propose hybrid network architectures where we use a few fullprecision layers in extremely quantized networks to improve the performance of XNORNets while still achieving significant energyefficiency and memory compression with respect to fullprecision networks.
3.4. Hybrid ResNets
We compare the proposed hybrid ResNet architectures, namely HybridRes A, HybridRes B and HybridRes C. HybridRes A is a XNOR ResNet with fullprecision residual connections. HybridRes B consists of 6 fullprecision layers and 12 binary layers. HybridRes C has a fullprecision fraction of in each layer. These numbers are chosen to maintain reasonable compression with respect to fullprecision networks. We explore these network architectures for varying network width. Table 5 lists the accuracy and other metrics for both the hybrid networks and Fig. 3 (a) (and (b)) shows the comparison plot of both XNOR and explored hybrid ResNet architectures in terms of energy efficiency (and memory compression) and accuracy.
We observe that the hybrid network with fullprecision residual connections, HybridRes A, achieves superior tradeoff in terms of accuracy, energyefficiency and memory compression when hybridizing ResNets compared to sectioned hybrid networks such as HybridRes B and HybridRes C. In fact, Fig. 3 shows that HybridRes A is even superior to just inflating the network width. The results highlights the importance of fullprecision residual connections during binarization.
3.5. Hybrid VGG networks
Fullprecision residual connections offer the optimal way of hybridizing ResNets. We apply the same concept to hybridize VGG networks where we propose a hybrid VGG network by adding fullprecision residual connections to a binary VGG network, namely, HybridVGG A. We also explored sectioned hybrid networks described in Section 2.4, HybridVGG B and HybridVGG C to identify the optimum hybrid VGG architecture. Note, HybridVGG B has a fullprecision fraction of in each layer. HybridVGG C has the linear layers fullprecision and the rest binary. These numbers are chosen to maintain reasonable compression with respect to fullprecision networks. Table. 6 compares the networks explored.
Fig. 4 (a) ( and (b)) shows the comparison of networks in terms of accuracy and energyefficiency (and memorycompression). We observe that the hybrid VGG network with fullprecision residual connections (HybridVGG A) achieves the best accuracy while achieving the highest compression and energy efficiency compared to other hybrid networks. We show that HybridVGG A can improves the performance of a VGG XNOR baseline by 25 % while achieving the energy efficiency and the memory compression compared to a fullprecision network. Sectioning the network such that the final 3 fully connected layers are fullprecision, as in HybridVGG C, do not match HybridVGG A in terms of accuracy despite losing significantly in efficiency and compression. Interestingly, HybridVGG A performs better than HybridVGG B, despite the latter being an alternative network where fullprecision information is carried in parallel. To summarize, HybridVGG A emerge as the most optimum hybrid network which is also significantly superior to VGG2bit and inflated binarized VGG network. These results further show that hybrid networks with fullprecision residual connections offer the best tradeoff for improving the performance extremely quantized networks.
3.6. Scaling to ImageNet
The dataset ImageNet (Deng et al., 2009) is the most challenging dataset pertaining to image classification tasks which is used by the competition ImageNet LargeScale Visual Recognition Challenge (ILSVRC) to perform the experiments. This subset consists of 1.2 million training images and 50000 validation images divided into 1000 categories. We consider the network VGG19 for our evaluation and compare HybridVGG A with the VGG19 XNOR baseline and HybridVGG C. For further comparison, we use a variant of HybridVGG A, described in Section 2.3, named HybridVGG A’. Table. 7 lists the accuracy, energy efficiency and memory compression for the explored networks. We observe that although HybridVGG A does not enhance the performance signficantly, using an additional fullprecision linear layer as in HybridVGG A’ can increase the performance of the XNORNet baseline by 20 % while still achieving 13.1 the energy efficiency compared to fullprecision networks. HybridVGG C matches the performance of HybridVGG A’, however, at the cost of compression. The results show that the superiority of hybrid networks with fullprecision residual connections over other hybrid networks hold true even for larger datasets.
3.7. Discussion
Hybrid networks with fullprecision residual connections achieve superior performance in comparison to other hybrid network architectures that were explored in this work. Residual connections offer a parallel path of carrying information from input to output. The hybrid networks with fullprecision residual connections exploit this characteristic to partially preserve information lost due to quantization. Residual connections are also computationally simple and the number of weight layers in the residual path is small. Due to this low overhead, using fullprecision residual connections in binary networks is a promising technique to match the performance of fullprecision networks while still achieving significant compression and energy efficiency.
The humongous computing power and memory requirements of deep networks stand in the way of ubiquitous use of AI for performing onchip analytics in lowpower edge devices. Memory compression along with the close match to stateofart accuracies offered by the hybrid extremely quantized networks go a long way to address that challenge. The significant energy efficiency offered by the compressed hybrid networks increases the viability of using AI, powered by deep neural networks, in edge devices. With the proliferation of connected devices in the IOT environment, AIenabled edge computing can reduce the communication overhead of cloud computing and augment the functionalities of the devices beyond primitive tasks such as sensing, transmission and reception to insitu processing.
4. Conclusion
Binary neural networks suffer from significant degradation in accuracy for deep networks and larger datasets. In this work, we propose extremely quantized hybrid networks with both binary and fullprecision sections to closely match fullprecision networks in terms of classification accuracy while still achieving significant energy efficiency and memory compression. We explore several hybrid network architecture such as binary networks with fullprecision residual connections and sectioned hybrid networks to explore the tradeoffs between performance, energy efficiency and memory compression. Our analysis on ResNets and VGG networks on datasets such as CIFAR100 and ImageNet show that the hybrid networks with fullprecision residual connections emerge as the optimum in terms of accuracy, energy efficiency and memory compression compared to other hybrid networks. This work sheds light on effective ways of designing compressed neural network architectures and potentially paves the way toward using energyefficient hybrid networks for AIbased onchip analytics in lowpower edge devices with accuracy comparable to fullprecision networks.
Acknowledgement
This work was supported in part by the Center for Braininspired Computing Enabling Autonomous Intelligence (CBRIC), one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, in part by the National Science Foundation, in part by Intel, in part by the ONRMURI program and in part by the Vannevar Bush Faculty Fellowship.
References

Krizhevsky et al. (2012)
Krizhevsky et al.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012. 
Szegedy et al. (2015)
Szegedy et al.
Going deeper with convolutions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 1–9, 2015.  He et al. (2016) He et al. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Girshick et al. (2015) Girshick et al. Fast rcnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.

Mikolov et al. (2013)
Mikolov et al.
Efficient estimation of word representations in vector space.
In Proceedings of Workshop at International Conference on Learning Representations, 2013. 
Mnih et al. (2015)
Mnih et al.
Humanlevel control through deep reinforcement learning.
Nature, 518(7540):529, 2015.  Silver et al. (2017) Silver et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Iandola et al. (2016) Iandola et al. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 Alvarez and Salzmann (2017) Alvarez and Salzmann. Compressionaware training of deep networks. In Advances in Neural Information Processing Systems, pages 856–867, 2017.
 Weigend et al. (1991) Weigend et al. Generalization by weightelimination with application to forecasting. In Advances in neural information processing systems, pages 875–882, 1991.
 Han et al. (2015) Han et al. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
 Ullrich et al. (2017) Ullrich et al. Soft weightsharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.

Hubara et al. (2017)
Hubara et al.
Quantized neural networks: Training neural networks with low
precision weights and activations.
The Journal of Machine Learning Research
, 18(1):6869–6898, 2017.  Mellempudi et al. (2017) Mellempudi et al. Ternary neural networks with finegrained quantization. arXiv preprint arXiv:1705.01462, 2017.
 Rastegari et al. (2016) Rastegari et al. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 Prabhu et al. (2018) Prabhu et al. Hybrid binary networks: Optimizing for accuracy, efficiency and memory. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 821–829. IEEE, 2018.
 Keckler et al. (2011) Keckler et al. Gpus and the future of parallel computing. IEEE Micro, (5):7–17, 2011.
 Krizhevsky and Hinton (2009) Krizhevsky and Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Deng et al. (2009) Deng et al. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.