Introduction
1 Introduction
The recent advent of ‘Internet of Things’ (IOT) has deeply impacted our lives by enabling connectivity, communication and autonomous intelligence. With rapid proliferation of connected devices, the amount of data that needs to be processed is ever increasing. These data collected from numerous distributed devices are usually noisy, unstructured and heterogeneous [1]
. Deep learning succeeds in reliably processing such complex and large volumes of data where conventional machine learning techniques fail
[2]. Thus, it has become the driving force behind ubiquitous Artificial Intelligence (AI), and we see the pervasiveness of deep learning in various applications such as speech recognition, predictive systems and image and video classification
[3, 4, 5, 6].Traditionally, IOT devices act as data collecting interfaces that feed the deep learning models deployed in centralized cloud computing systems. However, such systems have their own issues and vulnerabilities. In realtime application such as selfdriving cars, the latency of communication between IOT devices and the cloud can pose a serious safety risk. As more IOT devices connect to the cloud, it strains the available shared bandwidth for communication. Furthermore, rising concerns around data privacy and overcentralization of information has propelled the need for decentralized userspecific systems [7, 8]. Edge computing [9] is a promising alternative that enables IOT devices to process data, thus reducing communication overhead and latency and ensuring decentralization of data. The facilitation of onchip analytics offered by edge computing can prove to be pivotal for autonomous platforms such as drones and selfdriving cars as well as smart appliances. In addition, smart edge devices can play a significant role in healthcare monitoring systems and medical applications. Intelligent edge devices can be further leveraged for swarm intelligence based applications. However, computing in these resource constrained edge devices comes with its own challenges. Deep learning models are usually large in size and computationally intensive, thus making them difficult to implement in lowpower and memoryconstrained IOT devices. Thus, there is a need to design deep learning models which can perform effectively while requiring less memory and less computations.
One approach toward compressing neural network models is to modify the network architecture itself, such that it has fewer parameters, such as SqueezeNet [10]. Another method of compression is pruning which aims to reduce redundancies in overparameterized networks. To that effect, researchers have investigated several network pruning techniques, both during training [11, 12] and inference [13, 14].
A different technique of model compression is representing weights and activations with reduced precision. Quantized networks [15] help achieve reduction in energy consumption as well as improve memory compression compared to fullprecision networks. Binary neural networks [16]
are an extreme case of quantization where the activations and weights are reduced to binary representations. These networks drastically reduce the energy consumption by replacing the expensive multiply and accumulate (MAC) operations with simple add or subtract operations. This massive reduction in memory usage and computational cost make them particularly suitable for edge computing. However, despite these benefits, the networks suffer from performance and scalability issues, especially for complex pattern recognition tasks.
Several training algorithms [17]
have been proposed to optimize network performance to achieve stateofart accuracy in extremely quantized neural networks. Although such training methodologies recover the performance hit caused by binarizations weights alone, they fail to completely counter the degradation caused by binarizing both weighs and activations. In this work, we present HybridNet, a mixedprecision network topology fashioned by the combination of binary and highprecision inputs and activations in different layers of a network. We use Principal Component Analysis (PCA) to determine significance of layers based on the ability of a layer to expand data into higher dimensional space, with the ultimate aim of linear separability. Viewing a neural network as an iterative projection of input onto a successively higher dimensional manifold at each layer, until the data is eventually linearly separable allows us to identify layers that contribute relevant transformations. Following the algorithm in
[18], we find the ‘significant dimensions’ in a layer as the number of dimensions that cumulatively explain 99.9% of the total variance of the output activation map generated by that layer. Since we want the data to be expanded into higher dimensions at each layer, we deem the layers at which significant dimensions increase from the previous layer as significant. Following the identification of significant layers, we increase the bitprecision of the inputs and weights of those layers, keeping the rest of the layers entirely binary. Traditionally, PCA has been used primarily as a dimensionality reduction technique. It was also recently used to identify redundancies in different layers of a neural network and prune out the redundant features
[18]. We propose a methodology where we use PCA in a reverse manner, i.e., to increase the precision of the important layers. HybridNet remarkably improves the performance of extremely quantized neural networks, while keeping the activations and weights of the most of the layers binary. This ensures low energy consumption and high memory compression of extremely quantized neural networks while achieving significantly enhanced classification performance compared to binary networks such as XNOR networks. This work not only achieves signficant progress in the challenge of quantizing neural networks to binary representations but also paves way for optimized yet highly accurate quantized networks suitable for enabling intelligence at the edge.2 Related Work
Various techniques have been proposed to improve the performance of quantized networks. Fully binary networks [16, 17] are constructed by replacing the activations with their sign. However, these networks usually suffer from significant degradation in accuracy, especially for larger datasets such as CIFAR100 and ImageNet. One intuitive way of recovering quantization errors is using wider networks [19] but it comes at the cost of increased energy consumption. There have been efforts focusing on gradient calculations for approximated sign functions to ameliorate the effect of binarization [20]. More general quantization schemes have also been explored for weights and activations [21, 22]. Although weight quantization can be compensated by training the network with quantized weights [15]
, it has been observed that input quantization pose a serious challenge to classification performance for precisions lower than 4 bits. One approach that addresses this challenge involve clipping the activations by setting an upper bound. Although this approach seems to be heuristic, recent efforts have focused on using trainable quantization that can be dynamically manipulated
[23, 24]. One such approach involves parameterized clipping where the clipping level is dynamically adjusted through gradient descent [25]. Another approach proposed the use of batchnormalization layers after ReLU activations to bound the activation values for effective quantization
[26]. Note, that most of these works focus on optimizing the activations when the quantization precision is 2 bits or more. Binary networks with both 1bit activations and weights, despite offering the most benefits in terms of computation cost and memory compression, still suffer from significant degradation in performance compared to fullprecision networks.An alternative path towards improving the accuracy of binary neural networks focuses on network design techniques. To that effect, improved input representations through shortcut connections in deep networks can significantly improve performance of binary neural networks without any increase in computation cost [20]. This is because shortcut connections are usually identity in nature and do not comprise of expensive MAC operations. Combinations of different kinds of input precisions have also been explored across different layers to circumvent the significant decrease in classification accuracy of such binarized networks [27]. There has been considerable effort in making the search for optimum neural architecture more sophisticated through efficient design space exploration [28]. A theoretical approach towards predicting layerwise precision requirement has been also explored [29]. Our work differs from most of the current efforts in quantized neural networks as it lies in the realm of hybrid network design for more optimal performance of neural networks where most of the layers still have 1bit weights and activations. This motivates us to propose an algorithm to identify important layers and judiciously reinforce those particular layers with higher bitprecision representation. To follow such a motivation, it is necessary to understand the significance of layers, which we explain in the next section.
3 PCAdriven HybridNet Design
A HybridNet is a neural network that employs two different bit precisions for its weights and activations. The base network is of low precision, for example 1 bit, and certain layers are selected and set to a higher bit precision. For selecting the layers, we use Principal Component Analysis (PCA) on the output feature maps of each of the layers. Given any set of correlated variables, such as the feature maps, PCA does an orthogonal transformation to map them to uncorrelated variables called Principal Components(PCs), which also form the orthogonal basis set for these tensors. Each of these resulting basis vectors identify directions of varying variance in the data, and are ordered in decreasing manner, with the first vector in the direction of highest variance.
In a neural network, each layer applies a transformation on its input and projects it to a new feature space of ideally higher or same dimension with the objective of achieving linear separability. PCA provides the ability to study the directions of maximum variance in the input data. The preReLU activation map generated by a filter is considered to composed of many instances of that particular filter. Performing PCA and finding the number of filters needed to explain a predefined cumulative percentage of variance identifies the number of significant dimensions for each layer. More the number of principal components needed to preserve a significant percentage, say , of the total variance in the input, lesser is the redundant information carried by those tensors, and higher is the significant dimensionality of those tensors. Ideally, we want the number of PC’s required to explain
% of the total variance of the feature space to increase as we move deeper into the network in order to extract more uncorrelated, unique features from the data, and project it into a higher dimensional space that will eventually lead to linear separability at the classifier layer. Thus, the layers for which the number of PCs explaining variance in the output data is more than that in the input data, contribute to significant transformations on the input data. In this section, we propose a methodology to identify these significant layers and subsequently design mixedprecision networks by increasing the bitprecision of those layers.
algocf[h!]
3.1 PCAdriven identification of significant layers
We perform our analysis on activations of each layer, which provide a notion of activity of each filter in that layer. Let us consider the activation matrix of the layer, . Layer has filter banks, each containing filters of size . and are the number of input and output channels. The first element of the output map of is the result of convolution of the first sized input patch with the filter bank. The rest of the elements of any particular output map of
can be obtained by striding over the entire input. Thus, if we consider
as the size of a minibatch, is a 4dimensional matrix of size where and are the height and width of the each output map. If we flatten the 4D data into a 2D matrix , we would obtain samples, each containing elements equivalent to the number of filter banks. This process is shown in Fig. LABEL:fig:pcaexp_(a).When PCA is performed over the aforementioned 2D matrix
, the singular value decomposition (SVD) of the mean normalized, symmetric matrix
generates eigenvectorsand eigenvalues
. The total variance in the data is given by sum of the variances of individual parameters:(1) 
The contribution of any component, , towards the total variance can be expressed as . To calculate the number of significant components, we set a threshold value which is amount of variance the first significant components are able to explain. This can be expressed as:
(2) 
An example of a typical curve of the cumulative sum of variance for different filter numbers, obtained by PCA, is shown in Fig. 1 (a) (rightmost). As the PCA analysis produces the most significant components to explain fraction of the total variance, we proceed to identify the significant layers. We define a significant layer as the layer which transforms the input data such that the number of significant components to explain fraction of the variance, increase from that required for the output of previous layer. Let be the number of significant components corresponding to the layer. Then, it can be said that layer contributes a relevant transformation on the input data if . It means that the layer requires more significant components to explain the variance in the data at its output than the previous layer. However, for a better control on deciding the important layers, we check the condition whether to determine if the layer is significant. This is explained in Fig. 1 (b) (middle) where the dots marked in red denote the significant layers where .
3.2 HybridNet Design
The PCA analysis helps us identify a set of important layers in an Nlayer network. We design a hybrid network, ‘HybridNet’, where we set the bitprecision of weights and inputs of the important layers to a higher value, , than the other layers which have binary weights and inputs. This is shown in Fig. 1 (b) (rightmost). The weights and inputs of the first and the final layers of a Nlayer network are kept fullprecision, according to standard practice [17, 21, 25]. The quantization algorithm for any bit quantized layer can be readily derived from XNORNet [17] where the quantized levels are:
(3) 
In a layer with bit weights and activations, is used instead of the function in layers with binary weights and activations. We use a slightly modified version of quantized networks, proposed in [17], where the weights have a scaling factor instead of just being quantized. The convolution operation in between inputs and weights in such a network is approximated as:
Binary:  (4)  
bit:  (5) 
Here, is the L1norm of and act as a scaling factor for the binary weights. In binary layers, the activation gradients are clipped such that they lie between 1 and 1. In the
bit layers, we get rid of the activation gradient clipping for better representation. Each layer of a Nlayer HybridNet have either binary or
bit weight kernels and the activations after each convolution are again quantized before passing to the next layer.HybridNet is expected to have a higher computation cost than a binary network. The parameter
decides the number of important layers to consider and hence a penalty is incurred due to increase in bitprecision. We can estimate the penalty in computation cost incurred due to the increase in bitprecision a in network with
number of significant layers as(6) 
Here is the computation cost of a binary layer and is the overhead of bit computation over binary computation. We will present a detailed analysis of energy consumption and memory usage later in the manuscript.
For residual networks, ResNets, we include another design feature in addition to the PCAdriven HybridNet, improving input representations through residual connections. This has been alluded to by Liu et al in
[20] where adding identity shortcut connections at every layer improves representational capability of binary networks. In standard residual networks [5], such identity connections are added to address the vanishing gradient problem in deep neural networks. However, in case of binary networks, these connections serve to provide an improved representation by carrying floatingpoint information from the previous layer. As a result, the HybridNet design also considers the effect of adding such highway connections at every layer. Note, in case of convolution layers which induce a change in size of each feature map, the shortcut connections consist of
convolution weight layers to account for the change in size [5].4 Experiments, Results and Discussion
4.1 Experiments
We evaluated the performance of all the networks described in this section 4.2 in PyTorch
[30]. We perform image classification on the datasets CIFAR100 [31] and ImageNet [32]. The CIFAR100 dataset has 50000 training images and 10000 testing images of size for 100 classes. For the CIFAR100 dataset, we explore the proposed HybridNet design methodology on standard network architectures, ResNet20, ResNet32 and VGG15, where the training algorithm for the quantized layers has been adopted from [17]. We extended our analysis to the ImageNet dataset [32] which is the most challenging dataset pertaining to image classification tasks. It consists of 1.2 million training images and 50000 validation images divided into 1000 categories. For simplicity, we considered ResNet18 for our ImageNet evaluation. We explore different network configurations, for ResNet, shown in Fig. 2 (a) and VGG architectures, shown in Fig. 2 (b) to compare the proposed HybridNet. Note that HybridComp A is formed by interlayer sectioning, i.e., dividing the network into 2 parts ( binary and bit layers) where is the number of the layers between the first and last layer. The widths of the network architectures, shown in Fig. 2 (a) and Fig. 2 (b) are for CIFAR100 dataset. For ImageNet, we have used a wider network architecture, which we describe in Table. 1.ResNet  18 

77 conv 64 stride 2 
33 maxpool stride 2 
33 conv 64 stride 1 ( 4) 
33 conv 128 stride 2 
33 conv 128 stride 1 ( 3) 
33 conv 256 stride 2 
33 conv 256 stride 1 ( 3) 
33 conv 512 stride 2 
33 conv 512 stride 1 ( 3) 
Linear 1000 
4.1.1 Energy efficiency and Memory compression
We have briefly alluded to the possible penalty incurred due to increasing the bitprecision of certain layers in a network. To identify its effect with respect to the entire network metrics and further illustrate the benefits of the proposed HybridNets, we perform a storage and computation analysis to calculate the energy efficiency and memory compression of the proposed networks. For any two networks A and B, the energy efficiency and memory compression of Network A with respect to Network B can be defined as:
(7)  
where and are the energy consumed by Network A and Network B respectively, and are the memory used for storing the weights of Network A and Network B, respectively. We estimate energy efficiency (E.E) and memorycompression (M.C) with respect to an fullprecision network and normalize it with respect to an XNORNet network which is an entirely binary network except the first and final layer. Thus, the normalized E.E () and normalized M.C () of any network A can be written as:
(8) 
Here, is the energy (memory) consumed by the layer of a network with fullprecision weights and activations whereas is the energy (memory) consumed by the layer of any network A under consideration.
4.2 Results  PCA
We perform PCA analysis on the activations of each convolutional layer and extract the number of principal components required to explain a fraction of variance in the data. The design parameters such as and subsequently are heuristically. For all analysis, we fix as this makes the increases in significant components, across various layers clearly distinguishable. The values are chosen based on the variation in across layers. A higher value yields less number of significant layers. For clarity, we perform our analysis for various values.
4.2.1 ResNet Architectures  CIFAR100
For ResNet architectures, we perform the PCA on a plain version of a binary network devoid of any residual connections. We decided to do this to isolate the effect of the convolution layers on the activations, instead of having residual connections. This is done because we focus on the quantization of the filters of the layers and the residual additions may distort the output feature space and hence the information we seek from it. Fig. 3 (a) and (b) shows the variation in the number of filters required to explain with different layers for ResNet20 and ResNet32 architectures respectively. As expected, the maximum change in occurs when the number of output channels increase. However, we observe a trend in both networks, that the layers just after the output channels increase from, say 16 to 32 or 32 to 64, attribute for the maximum change in the number of significant filters. Based on our criteria for significant layers, discussed in Section 3.1, we fix a for Resnet20 and for ResNet32 to identify the layers where the number of significant components undergo a change more than . Fig. 2 (a) also shows those layers marked by red dots. Note, by varying the , more or less number of layers can be considered as significant. After performing this analysis on a plain version of the ResNet architecture, we perform network simulations on the standard version with residual connections.
CIFAR100  

Network Arch  Significant layers 
ResNet20 ()  8, 9, 10, 14, 15, 16, 18 
ResNet20 ()  8, 9, 14, 15 
ResNet32 ()  12, 13, 22, 23, 24 
VGG15 ()  3, 5, 8, 11, 12 
VGG15 ()  3, 5, 8 
ImageNet  
Network Arch  Significant layers 
ResNet18 ()  6, 10, 14, 15 
ResNet18 ()  6, 10, 11, 14, 15, 16 
ResNet18 ()  6, 7, 10, 11, 14, 15, 16 
4.2.2 VGG Architectures  CIFAR100
For VGG architecture, we perform the PCA of a binary network which has binary weights and activations for all layers except the first and the last. Fig. 3 (c) shows the plot showing how the number of filters required to explain of the variance changes with different layers for a VGG15 architecture. We observe that the number of significant filters mostly increase when the number of filter bank increases at a particular layer. For rest of the layers, it remains fairly constant. As the PCA plot shows very little variation across layers, we consider a relatively lower with respect to the number of filters. We mark the significant layers by red dots in Fig. 3 (c). Table. 2 lists the different combination of significant layers obtained from ResNet and VGG architectures through the PCA analysis for different values for CIFAR100 dataset. Note, we did not choose a lower for ResNet32 as it would have included many layers which would increase the computation cost without a significant benefit in accuracy.
4.2.3 ResNet Architectures  ImageNet
We further perform PCA analysis on ResNet18 architecture for the ImageNet dataset. Fig. 3 (d) shows the plot showing how the number of filters required to explain of the variance changes with different layers for ResNet18 for . The significant layers identified by our proposed methodology are marked with red dots. We observe a similar trend as in case of CIFAR100, that maximum increase in the number of significant filters, , occur in the first few layers after every change in filter size. We perform the PCA analysis for to identify the significant layers, listed in Table. 2.
4.3 Image Classification Results  CIFAR100
4.3.1 ResNet Architectures
The ResNetN architecture consists of N1 convolution layers and a fullyconnected classifier. As discussed before, the first convolution layer and the classifier have full precision inputs and weights. For CIFAR100 dataset, we consider and . Further, we consider a slightly modified version of ResNet, where we add identity shortcut connections at every layer instead of every two layers for better input representation, as discussed earlier. We increase the bitprecision of weights and inputs of the layers obtained from PCA analysis to bit precisions and to form HybridNet (, ). The rest of the layers have binary representations for weights and inputs. We also compare the proposed HybridNet with HybridComp A (, ) (), which is formed by splitting the entire network into binary and bit sections. Table. 3 shows that accuracy, energy efficiency and memory compression of the proposed HybridNet based on ResNet20 and ResNet32 in comparison to XNORNet and other kinds of hybrid networks discussed in Fig. 2.
ResNet20  

FP Accuracy  69.49%  
 16.35,  17.26  
Network Type  Accuracy (%)  
XNOR  51.55  1  1 
BinaryShortcut 1  53.91  0.99  1 
HybridNet (2,2) ()  62.18  0.87  0.77 
HybridNet (4,4) ()  62.66  0.7  0.53 
HybridNet (2,2) ()  60.38  0.93  0.88 
HybridNet (4,4) ()  61.37  0.82  0.7 
Quantized (2,2)  64.55  0.73  0.65 
XNOR  2x width  65.11  0.39  0.33 
HybridComp A (2,2) (k=6)  61.49  0.88  0.71 
HybridComp A (2,2) (k=12)  62.17  0.8  0.67 
ResNet32  
FP Accuracy  70.62%  
 18.42,  20.44  
Network Type  Accuracy (%)  
XNOR  53.21  1  1 
BinaryShortcut 1  57.2  0.99  1 
HybridNet (2,2) ()  63.87  0.94  0.87 
HybridNet (4,4) ()  64.27  0.84  0.69 
Quantized (2,2)  67.46  0.7  0.61 
XNOR  2x width  63  0.38  0.31 
HybridComp A (2,2) (k=6)  62.26  0.91  0.76 
VGG15  
FP Accuracy  68.31%  
 21.77,  26.24  
Network Type  Accuracy (%)  
XNOR  48.69  1  1 
HybridNet (2,2) ()  61.81  0.84  0.75 
HybridNet (4,4) ()  63.38  0.64  0.5 
HybridNet (2,2) ()  59.55  0.93  0.92 
HybridNet (4,4) ()  60.02  0.81  0.80 
Quantized (2,2)  67.65  0.65  0.55 
XNOR  2x width  57.03  0.29  0.3 
HybridComp A (2,2) (k=3)  57.62  0.85  0.72 
We observe that the proposed HybridNet achieves a much superior tradeoff between accuracy, energy efficiency and memory compression compared to other kinds of hybridization techniques. Moreover, in case of both ResNet20 and ResNet32, HybridNet increases the classification accuracy by 1011% compared to a XNORNet with minimal degradation in efficiency and compression. While quantizing the entire network to 2bit inputs and weights (Quantized (2,2)) achieves a slightly higher accuracy, we show that the our principle of increasing the bitprecision of few significant layers captures most of the increase in accuracy from an XNORNet to a 2bit networks. HybridNet thus consumes less energy and less memory for ResNet20 than a 2bit network with a performance within of the latter. For ResNet32, the benefits of HybridNet is even pronounced where it consumes less energy and less memory than a 2bit network while achieving accuracy within of the latter. HybridNet thus ensures a signficant improvement in accuracy over a binary network without making the entire network 2bit. We also show that HybridNet achieves a higher accuracy than HybridComp A networks while consuming less energy for both ResNet20 and ResNet32, thus demonstrating the effectiveness of the design methodology.
4.3.2 VGG architecture
We further extend our analysis to VGG architectures. We considered VGG15, which consists of 13 convolutional and 2 fullyconnected layers as shown in Fig. 2 (b). We kept one of the fullyconnected layer binary to preserve energyefficiency. Table 3 lists the accuracy, energy efficiency and memory compression results for VGG15 on CIFAR100 for different networks. We consider and for our analysis and for each of the network configurations we use for the significant layers. We observe that HybridNet achieves 13% higher accuracy than a XNORNet with minimal degradation in efficiency. When we make the inputs and weights of the entire network 2bit (Quantize (2,2)), we achieve an even higher accuracy. We also show that HybridNet with 2bit layers achieve a better performance than HybridComp A for isoefficiency in energy and memory. Similar to trends in ResNet, we observe that making the significant layers 4bit while keeping the rest of the layers binary improves performance, however, at the cost of energyefficiency. An entirely 2bit network proves to be a more efficient solution. In summary, even for VGG architecture, we show that HybridNet achieves significantly superior performance compared to a XNOR network while keeping most of the layers binary.
Resnet18  
FP Accuracy  69.15%  
 17.00,  13.35  
Network Type  Accuracy (%)  E.E  M.C  
XNOR  50.33  1  1  
BinaryShortcut 1  54.05  1  1  
HybridNet (2,2) ()  60.38  0.9  0.87  
HybridNet (2,2) ()  61.95  0.84  0.8  
HybridNet (2,2) ()  62.73  0.82  0.8  
HybridNet (4,4) ()  61.70  0.75  0.7  
Quantize (2,2)  XNORkbit  64.51  0.7  0.71 
DoReFA  62.6  –  –  
PACT  67  
HybridComp A (2,2) (k=4)  59.47  0.86  0.77 
4.4 Image Classification Results  ImageNet
We evaluate the proposed HybridNet design Table. 4. We observe that the XNOR network suffers a significant degradation in accuracy from a fullprecision network. Even the BinaryShortcut 1 network with residual connections at every layer fail to recover the classification accuracy. Compared to these binary network, we observe that the proposed HybridNet, considering both 2bit and 4bit weights and activations achieves upto 10 % higher accuracy than corresponding XNOR network, while achieving a normalized energyefficiency of 90% and normalized memory compression of 87% of a fully binary network. Quantizing the activations and weights of the entire network to 2bits can further increase the accuracy by 12% but at the cost of a 1520% increase in energy consumption than HybridNet. Note, that we have provided results for different input quantization algorithms, such as DoReFANet [21] and PACT [25], for the Quantize (2,2) network although we have used the XNOR quantization (described in Section 3.2) in this work. This work shows that increasing the bitprecision of a few significant layers can remarkably boost the performance of binary neural networks without making the entire network higher precision.
4.5 Discussion
The proposed HybridNet design uses PCAdriven hybridization of extremely quantized neural networks, resulting in significant improvements and observations as listed. One key contribution of the proposed methodology is that we can design hybrid networks without any iterations. It does not require an iterative design space exploration to identify optimal networks. Moreover, this methodology shows that increasing the bitprecision of only the significant layers in a binary network achieves performance close to that of a network that is entirely composed of layers with higher bitprecision weights and activations. Intuitively, a 2bit network performs much better than a binary network. However, our analysis shows that it is not necessary to make the weights and activations of the entire network 2bit. HybridNet achieves more than improvement over a XNOR network, which is a fully binary network except the first and final layers, by increasing the bitprecision of less than half of the entire network. In fact, for deeper networks, like ResNet34, this improvement is achieved with only increase in energy consumption from a XNORNet [17]. Thus, HybridNet goes a long way in reaching close to highprecision accuracies with networks which are mostly binary and attain comparable energyefficiency and memory compression to binary networks such as XNORNet [17] and BNN [16]. Moreover, this methodology can be extended to any network where making significant layers of the network bit while keeping the rest of the network bit (), can potentially produce comparable performance with enhanced energyefficiency than an entirely bit network.
Secondly, the performance of HybridNet is subject to the nature of the plots obtained from PCA on the binary version of the networks. For example, for ResNet architectures (Fig. 3 (a), (b) and (c) for CIFAR100 and Fig. 3 (d) for ImageNet), we observe the number of significant components increase for the layers which are adjacent to the ones where the number of output channels increase and then decrease for the later layers which have the same number of output channels. It can be said that the later layers are not adding to the linear separability of the data and binarizing them preserve the accuracy as observed. Or in other words, the significant layers identified using our proposed methodology contribute remarkably higher to the linear separability than the other layers. This is reflected in the results where we show the performance difference between HybridNet and a 2bit network is less than . However, for VGG architectures, we observe that the PCA plot remains fairly flat, which means that the identified significant layers are not remarkably different in their contribution towards linear separability of the data in comparison to the other layers. This is reflected in the performance difference () of HybridNet from a 2bit network for a VGG15 network.
Thirdly, we observe that increasing the bitprecision of the weights and activations of the significant layers to 4bits while keeping the rest of the layers binary is not the most energyefficient way of improving accuracy of a network. An entire 2bit network proves to be more energyefficient while performing better. It may be because the loss due to binarization can not be significantly recovered by increasing the bitprecision much higher than binary, while keeping most of the layers binary. Thus, the proposed methodology performs best when the precision of the significant layers is close to the base precision of the network (binary in our case).
In this work, we have considered the quantization scheme, explored in [17]. Since then, there has been a plethora of works focused on improving quantization for both inputs and weights [21, 25]. HybridNet focuses on improving the performance of binary neural networks through mixedprecision network design and we believe the improved quantization schemes should further increase the accuracy of both HybridNets and entirely 2bit or 4bit networks.
The humongous computing power and memory requirements of deep networks stand in the way of ubiquitous use of AI for performing onchip analytics in lowpower edge devices. The significant energy efficiency offered by the compressed hybrid networks increases the viability of using AI, powered by deep neural networks, in edge devices. With the proliferation of connected devices in the IOT environment, AIenabled edge computing can reduce the communication overhead of cloud computing and augment the functionalities of the devices beyond primitive tasks such as sensing, transmission and reception to insitu processing.
5 Conclusion
Binary neural networks offer significant energyefficiency and memory compression compared to fullprecision networks. In this work, we propose a oneshot methodology for designing mixedprecision, hybrid networks with binary and higher bitprecision inputs and weights to improve the performance of extremely quantized neural networks in terms of classification accuracy while still achieving significant energy efficiency and memory compression. The proposed methodology uses PCA to identify significant layers in a binary network which transform the input data such that the output feature space require more significant dimensions to explain variance in data. PCA is usually exploited to perform layerwise dimensional reduction. We use PCA in an opposite manner in order to determine which layers cause the number of signficant dimensions to increase across input and output. Next, we increase the bitprecision of the weights and activations of the significant layers and keeping that of the other layers binary. The proposed HybridNet achieves more than improvement over XNOR networks for ResNet and VGG network architectures on CIFAR100 and ImageNet with only increase in energy consumption, thus ensuring more than reduction in energy consumption and memory compression from fullprecision networks. Memory compression along with the close match to highprecision accuracies offered by the proposed mixedprecision network design using layerwise information allows us to explore interesting possibilities in the realm of hardwaresoftware codesign. This work thus proposes an effective, oneshot methodology for designing hybrid, compressed neural networks and potentially paves the way toward using energyefficient hybrid networks for AIbased onchip analytics in lowpower edge devices with accuracy close to fullprecision networks.
6 Methods
6.1 Energy Efficiency and Memory calculations
6.1.1 Energy Efficiency
The primary modeldependent metrics that affect the energy consumption of classification task are the energies consumed by the computations (multiplyandaccumulate or MAC operations) and memory accesses in our calculations for energy efficiency. We exclude energy consumed due to data flow and instruction flow in the architecture. For a convolutional layer, there are input channels and output channels. Let the size of the input be , size of the kernel be and size of the output be . Thus, in Table 5 we present the number of memoryaccesses and computations for standard fullprecision (FP) networks:
Operation  Number of Operations 

Input Read  
Weight Read  
Computations (MAC)  
Memory Write 
The number of binary memory accesses () and computations () in a binary layer is same as the corresponding number in fullprecision layer of equivalent dimensions. As explained in Eq. 1, we consider additional fullprecision memory accesses and computations for parameter , where is the scaling factor for each filter bank in a convolutional layer. Number of accesses for is equal to the number of output maps, . Number of fullprecision computations are . Table 6 lists the number of kbit and fullprecision memory access and computations of any layer.
Operation  Term  Number of Operations 

kbit Memory Access  +  
kbit Computations (MAC)  
FP Memory Access  
FP Computations 
We calculated the energy consumption from projections for 45 nm CMOS technology [33, 13]. Considering 32bit representation as fullprecision, the energy consumption for both binary and 32bit memory accesses and computations are shown in Table. 7.
Operation  Term  Energy (pJ) 

kb Memory Access  2.5  
32b MULT FP  3.7  
32b MULT INT  3.1  
32b ADD FP  0.9  
32b ADD INT  0.1  
kbit MAC INT  ((3.1*)/32+0.1)  
kbit MAC FP  4.6 
Then, energy consumed by any layer with kbit weights and activations is given by
(9) 
Note, this calculation is a rather conservative estimate which does not take into account other hardware architectural aspects such as inputsharing or weightsharing. However, our approach concerns with modifications of network architecture and we compare the ratios of energy consumption. These aspects of the hardware architecture affect all the networks equally and hence can be taken out of consideration. Further, FP MAC operations can be optimized for lower energy consumptions. In our calculations, we have bluntly taken it as the sum of a 32b FP Multiply and 32b FP Add operations. These optimizations are catered towards FP networks, and reduce the FP energy consumption. This, in turn, will reduce the energy efficiency of the binary and hybrid networks. In this work, we are focused on comparing different kinds of binary and hybrid network, and hence, this assumption of FP MAC energy is not going to affect the analysis.
6.1.2 Memory Compression
The memory required for any network is given by product of the total number of weights in the network multiplied by the precision of the weights. The number of weights in any layer is given by:
(10) 
considering usual notations describer earlier. Thus, the total memory requirements can be simply written as where is the precision of weights in the layer. We can estimate memory compression (M.C) with respect to a fullprecision network and normalize it with respect to an XNORNet network which is an entirely binary network except the first and final layer.
Note that the assumption for the energy and storage calculations for binary layers hold for custom hardware capable of handling fixedpoint binary representations of data, thus leveraging the benefits offered by quantized networks.
Acknowledgement
This work was supported in part by the Center for Braininspired Computing Enabling Autonomous Intelligence (CBRIC), one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, in part by the National Science Foundation, in part by Intel, in part by the ONRMURI program and in part by the Vannevar Bush Faculty Fellowship.
References
 [1] Gubbi, J., Buyya, R., Marusic, S. & Palaniswami, M. Internet of things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems 29, 1645–1660, DOI: 10.1016/j.future.2013.01.010 (2013).
 [2] Yao, S., Hu, S., Zhao, Y., Zhang, A. & Abdelzaher, T. DeepSense. In Proceedings of the 26th International Conference on World Wide Web  WWW’ 17, DOI: 10.1145/3038912.3052577 (ACM Press, 2017).

[3]
Krizhevsky et al.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, 1097–1105 (2012). 
[4]
Szegedy et al.
Going deeper with convolutions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, 1–9 (2015).  [5] He et al. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
 [6] Girshick et al. Fast rcnn. In Proceedings of the IEEE international conference on computer vision, 1440–1448 (2015).
 [7] Kaufman, L. M. Data security in the world of cloud computing. IEEE Security & Privacy 7, 61–64 (2009).
 [8] Gonzalez, N. et al. A quantitative analysis of current security concerns and solutions for cloud computing. Journal of Cloud Computing: Advances, Systems and Applications 1, 11 (2012).
 [9] Li, D., Salonidis, T., Desai, N. V. & Chuah, M. C. Deepcham: Collaborative edgemediated adaptive deep learning for mobile object recognition. In 2016 IEEE/ACM Symposium on Edge Computing (SEC), 64–76 (IEEE, 2016).
 [10] Iandola et al. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016).
 [11] Alvarez & Salzmann. Compressionaware training of deep networks. In Advances in Neural Information Processing Systems, 856–867 (2017).
 [12] Weigend et al. Generalization by weightelimination with application to forecasting. In Advances in neural information processing systems, 875–882 (1991).
 [13] Han et al. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, 1135–1143 (2015).
 [14] Ullrich et al. Soft weightsharing for neural network compression. arXiv preprint arXiv:1702.04008 (2017).
 [15] Hubara et al. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18, 6869–6898 (2017).
 [16] Courbariaux, M., Hubara, I., Soudry, D., ElYaniv, R. & Bengio, Y. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830 (2016).
 [17] Rastegari et al. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, 525–542 (Springer, 2016).
 [18] Garg, I., Panda, P. & Roy, K. A low effort approach to structured cnn design using pca. arXiv preprint arXiv:1812.06224 (2018).
 [19] Mishra, A., Nurvitadhi, E., Cook, J. J. & Marr, D. Wrpn: wide reducedprecision networks. arXiv preprint arXiv:1709.01134 (2017).
 [20] Liu, Z. et al. Bireal net: Enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), 722–737 (2018).
 [21] Zhou, S. et al. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).
 [22] Zhou, S.C., Wang, Y.Z., Wen, H., He, Q.Y. & Zou, Y.H. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology 32, 667–682 (2017).
 [23] Zhang, D., Yang, J., Ye, D. & Hua, G. Lqnets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), 365–382 (2018).
 [24] Jung, S. et al. Joint training of lowprecision neural network with quantization interval parameters. arXiv preprint arXiv:1808.05779 (2018).
 [25] Choi, J. et al. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018).
 [26] Graham, B. Lowprecision batchnormalized activations. arXiv preprint arXiv:1702.08231 (2017).
 [27] Prabhu et al. Hybrid binary networks: Optimizing for accuracy, efficiency and memory. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 821–829 (IEEE, 2018).
 [28] Wu, B. et al. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090 (2018).
 [29] Sakr, C. & Shanbhag, N. Pertensor fixedpoint quantization of the backpropagation algorithm. arXiv preprint arXiv:1812.11732 (2018).
 [30] Paszke, A. et al. Automatic differentiation in pytorch. In NIPSW (2017).
 [31] Krizhevsky & Hinton. Learning multiple layers of features from tiny images. Tech. Rep., Citeseer (2009).
 [32] Deng et al. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 248–255 (Ieee, 2009).
 [33] Keckler et al. Gpus and the future of parallel computing. IEEE Micro 7–17 (2011).
Comments
There are no comments yet.