PCA-driven Hybrid network design for enabling Intelligence at the Edge

06/04/2019 ∙ by Indranil Chakraborty, et al. ∙ 3

The recent advent of IOT has increased the demand for enabling AI-based edge computing in several applications including healthcare monitoring systems, autonomous vehicles etc. This has necessitated the search for efficient implementations of neural networks in terms of both computation and storage. Although extreme quantization has proven to be a powerful tool to achieve significant compression over full-precision networks, it can result in significant degradation in performance for complex image classification tasks. In this work, we propose a Principal Component Analysis (PCA) driven methodology to design mixed-precision, hybrid networks. Unlike standard practices of using PCA for dimensionality reduction, we leverage PCA to identify significant layers in a binary network which contribute relevant transformations on the input data by increasing the number of significant dimensions. Subsequently, we propose Hybrid-Net, a network with increased bit-precision of the weights and activations of the significant layers in a binary network. We show that the proposed Hybrid-Net achieves over 10 improvement in classification accuracy over binary networks such as XNOR-Net for ResNet and VGG architectures on CIFAR-100 and ImageNet datasets while still achieving upto 94 methodology allows us to move closer to the accuracy of standard full-precision networks by keeping more than half of the network binary. This work demonstrates an effective, one-shot methodology for designing hybrid, mixed-precision networks which significantly improve the classification performance of binary networks while attaining remarkable compression. The proposed hybrid networks further the feasibility of using highly compressed neural networks for energy-efficient neural computing in IOT-based edge devices.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

1 Introduction

The recent advent of ‘Internet of Things’ (IOT) has deeply impacted our lives by enabling connectivity, communication and autonomous intelligence. With rapid proliferation of connected devices, the amount of data that needs to be processed is ever increasing. These data collected from numerous distributed devices are usually noisy, unstructured and heterogeneous [1]

. Deep learning succeeds in reliably processing such complex and large volumes of data where conventional machine learning techniques fail

[2]

. Thus, it has become the driving force behind ubiquitous Artificial Intelligence (AI), and we see the pervasiveness of deep learning in various applications such as speech recognition, predictive systems and image and video classification

[3, 4, 5, 6].

Traditionally, IOT devices act as data collecting interfaces that feed the deep learning models deployed in centralized cloud computing systems. However, such systems have their own issues and vulnerabilities. In real-time application such as self-driving cars, the latency of communication between IOT devices and the cloud can pose a serious safety risk. As more IOT devices connect to the cloud, it strains the available shared bandwidth for communication. Furthermore, rising concerns around data privacy and over-centralization of information has propelled the need for decentralized user-specific systems [7, 8]. Edge computing [9] is a promising alternative that enables IOT devices to process data, thus reducing communication overhead and latency and ensuring decentralization of data. The facilitation of on-chip analytics offered by edge computing can prove to be pivotal for autonomous platforms such as drones and self-driving cars as well as smart appliances. In addition, smart edge devices can play a significant role in healthcare monitoring systems and medical applications. Intelligent edge devices can be further leveraged for swarm intelligence based applications. However, computing in these resource constrained edge devices comes with its own challenges. Deep learning models are usually large in size and computationally intensive, thus making them difficult to implement in low-power and memory-constrained IOT devices. Thus, there is a need to design deep learning models which can perform effectively while requiring less memory and less computations.

One approach toward compressing neural network models is to modify the network architecture itself, such that it has fewer parameters, such as SqueezeNet [10]. Another method of compression is pruning which aims to reduce redundancies in over-parameterized networks. To that effect, researchers have investigated several network pruning techniques, both during training [11, 12] and inference [13, 14].

A different technique of model compression is representing weights and activations with reduced precision. Quantized networks [15] help achieve reduction in energy consumption as well as improve memory compression compared to full-precision networks. Binary neural networks [16]

are an extreme case of quantization where the activations and weights are reduced to binary representations. These networks drastically reduce the energy consumption by replacing the expensive multiply and accumulate (MAC) operations with simple add or subtract operations. This massive reduction in memory usage and computational cost make them particularly suitable for edge computing. However, despite these benefits, the networks suffer from performance and scalability issues, especially for complex pattern recognition tasks.

Several training algorithms [17]

have been proposed to optimize network performance to achieve state-of-art accuracy in extremely quantized neural networks. Although such training methodologies recover the performance hit caused by binarizations weights alone, they fail to completely counter the degradation caused by binarizing both weighs and activations. In this work, we present Hybrid-Net, a mixed-precision network topology fashioned by the combination of binary and high-precision inputs and activations in different layers of a network. We use Principal Component Analysis (PCA) to determine significance of layers based on the ability of a layer to expand data into higher dimensional space, with the ultimate aim of linear separability. Viewing a neural network as an iterative projection of input onto a successively higher dimensional manifold at each layer, until the data is eventually linearly separable allows us to identify layers that contribute relevant transformations. Following the algorithm in

[18]

, we find the ‘significant dimensions’ in a layer as the number of dimensions that cumulatively explain 99.9% of the total variance of the output activation map generated by that layer. Since we want the data to be expanded into higher dimensions at each layer, we deem the layers at which significant dimensions increase from the previous layer as significant. Following the identification of significant layers, we increase the bit-precision of the inputs and weights of those layers, keeping the rest of the layers entirely binary. Traditionally, PCA has been used primarily as a dimensionality reduction technique. It was also recently used to identify redundancies in different layers of a neural network and prune out the redundant features

[18]. We propose a methodology where we use PCA in a reverse manner, i.e., to increase the precision of the important layers. Hybrid-Net remarkably improves the performance of extremely quantized neural networks, while keeping the activations and weights of the most of the layers binary. This ensures low energy consumption and high memory compression of extremely quantized neural networks while achieving significantly enhanced classification performance compared to binary networks such as XNOR networks. This work not only achieves signficant progress in the challenge of quantizing neural networks to binary representations but also paves way for optimized yet highly accurate quantized networks suitable for enabling intelligence at the edge.

2 Related Work

Various techniques have been proposed to improve the performance of quantized networks. Fully binary networks [16, 17] are constructed by replacing the activations with their sign. However, these networks usually suffer from significant degradation in accuracy, especially for larger datasets such as CIFAR-100 and ImageNet. One intuitive way of recovering quantization errors is using wider networks [19] but it comes at the cost of increased energy consumption. There have been efforts focusing on gradient calculations for approximated sign functions to ameliorate the effect of binarization [20]. More general quantization schemes have also been explored for weights and activations [21, 22]. Although weight quantization can be compensated by training the network with quantized weights [15]

, it has been observed that input quantization pose a serious challenge to classification performance for precisions lower than 4 bits. One approach that addresses this challenge involve clipping the activations by setting an upper bound. Although this approach seems to be heuristic, recent efforts have focused on using trainable quantization that can be dynamically manipulated

[23, 24]. One such approach involves parameterized clipping where the clipping level is dynamically adjusted through gradient descent [25]

. Another approach proposed the use of batch-normalization layers after ReLU activations to bound the activation values for effective quantization

[26]. Note, that most of these works focus on optimizing the activations when the quantization precision is 2 bits or more. Binary networks with both 1-bit activations and weights, despite offering the most benefits in terms of computation cost and memory compression, still suffer from significant degradation in performance compared to full-precision networks.

An alternative path towards improving the accuracy of binary neural networks focuses on network design techniques. To that effect, improved input representations through shortcut connections in deep networks can significantly improve performance of binary neural networks without any increase in computation cost [20]. This is because shortcut connections are usually identity in nature and do not comprise of expensive MAC operations. Combinations of different kinds of input precisions have also been explored across different layers to circumvent the significant decrease in classification accuracy of such binarized networks [27]. There has been considerable effort in making the search for optimum neural architecture more sophisticated through efficient design space exploration [28]. A theoretical approach towards predicting layer-wise precision requirement has been also explored [29]. Our work differs from most of the current efforts in quantized neural networks as it lies in the realm of hybrid network design for more optimal performance of neural networks where most of the layers still have 1-bit weights and activations. This motivates us to propose an algorithm to identify important layers and judiciously reinforce those particular layers with higher bit-precision representation. To follow such a motivation, it is necessary to understand the significance of layers, which we explain in the next section.

Figure 1:

(a) The PCA function in Algorithm. 1 for a particular layer showing flattening of an instance of a 4-D tensor and subsequent PCA analysis yielding the plot showing how the cumulative explained variance accrues with the number of filters in a particular layer. Number of Significant filters,

is defined as the number of filters at which the cumulative sum reaches a threshold (say 0.99). (b) The Main() function in Algorithm. 1 is explained as we take a standard binary network (leftmost, with first and final layers having full-precision weights and activations) and perform the aforementioned PCA function on each binary layer (shown in white). The resulting plot (middle) shows the layer-wise variation in in a ResNet-18 on ImageNet for example. A layer is considered significant (shown by red markers), when increases by at least . The new Hybrid-Net (rightmost) is designed by increasing the precision of weights and activations for the significant layers (shown in blue).

3 PCA-driven Hybrid-Net Design

A Hybrid-Net is a neural network that employs two different bit precisions for its weights and activations. The base network is of low precision, for example 1 bit, and certain layers are selected and set to a higher bit precision. For selecting the layers, we use Principal Component Analysis (PCA) on the output feature maps of each of the layers. Given any set of correlated variables, such as the feature maps, PCA does an orthogonal transformation to map them to uncorrelated variables called Principal Components(PCs), which also form the orthogonal basis set for these tensors. Each of these resulting basis vectors identify directions of varying variance in the data, and are ordered in decreasing manner, with the first vector in the direction of highest variance.

In a neural network, each layer applies a transformation on its input and projects it to a new feature space of ideally higher or same dimension with the objective of achieving linear separability. PCA provides the ability to study the directions of maximum variance in the input data. The pre-ReLU activation map generated by a filter is considered to composed of many instances of that particular filter. Performing PCA and finding the number of filters needed to explain a pre-defined cumulative percentage of variance identifies the number of significant dimensions for each layer. More the number of principal components needed to preserve a significant percentage, say , of the total variance in the input, lesser is the redundant information carried by those tensors, and higher is the significant dimensionality of those tensors. Ideally, we want the number of PC’s required to explain

% of the total variance of the feature space to increase as we move deeper into the network in order to extract more uncorrelated, unique features from the data, and project it into a higher dimensional space that will eventually lead to linear separability at the classifier layer. Thus, the layers for which the number of PCs explaining variance in the output data is more than that in the input data, contribute to significant transformations on the input data. In this section, we propose a methodology to identify these significant layers and subsequently design mixed-precision networks by increasing the bit-precision of those layers.

algocf[h!]    

3.1 PCA-driven identification of significant layers

We perform our analysis on activations of each layer, which provide a notion of activity of each filter in that layer. Let us consider the activation matrix of the layer, . Layer has filter banks, each containing filters of size . and are the number of input and output channels. The first element of the output map of is the result of convolution of the first sized input patch with the filter bank. The rest of the elements of any particular output map of

can be obtained by striding over the entire input. Thus, if we consider

as the size of a minibatch, is a 4-dimensional matrix of size where and are the height and width of the each output map. If we flatten the 4-D data into a 2-D matrix , we would obtain samples, each containing elements equivalent to the number of filter banks. This process is shown in Fig. LABEL:fig:pcaexp_(a).

When PCA is performed over the aforementioned 2-D matrix

, the singular value decomposition (SVD) of the mean normalized, symmetric matrix

generates eigenvectors

and eigenvalues

. The total variance in the data is given by sum of the variances of individual parameters:

(1)

The contribution of any component, , towards the total variance can be expressed as . To calculate the number of significant components, we set a threshold value which is amount of variance the first significant components are able to explain. This can be expressed as:

(2)

An example of a typical curve of the cumulative sum of variance for different filter numbers, obtained by PCA, is shown in Fig. 1 (a) (rightmost). As the PCA analysis produces the most significant components to explain fraction of the total variance, we proceed to identify the significant layers. We define a significant layer as the layer which transforms the input data such that the number of significant components to explain fraction of the variance, increase from that required for the output of previous layer. Let be the number of significant components corresponding to the layer. Then, it can be said that layer contributes a relevant transformation on the input data if . It means that the layer requires more significant components to explain the variance in the data at its output than the previous layer. However, for a better control on deciding the important layers, we check the condition whether to determine if the layer is significant. This is explained in Fig. 1 (b) (middle) where the dots marked in red denote the significant layers where .

3.2 Hybrid-Net Design

The PCA analysis helps us identify a set of important layers in an N-layer network. We design a hybrid network, ‘Hybrid-Net’, where we set the bit-precision of weights and inputs of the important layers to a higher value, , than the other layers which have binary weights and inputs. This is shown in Fig. 1 (b) (rightmost). The weights and inputs of the first and the final layers of a N-layer network are kept full-precision, according to standard practice [17, 21, 25]. The quantization algorithm for any -bit quantized layer can be readily derived from XNOR-Net [17] where the quantized levels are:

(3)

In a layer with -bit weights and activations, is used instead of the function in layers with binary weights and activations. We use a slightly modified version of quantized networks, proposed in [17], where the weights have a scaling factor instead of just being quantized. The convolution operation in between inputs and weights in such a network is approximated as:

Binary: (4)
-bit: (5)

Here, is the L1-norm of and act as a scaling factor for the binary weights. In binary layers, the activation gradients are clipped such that they lie between -1 and 1. In the

-bit layers, we get rid of the activation gradient clipping for better representation. Each layer of a N-layer Hybrid-Net have either binary or

-bit weight kernels and the activations after each convolution are again quantized before passing to the next layer.

Hybrid-Net is expected to have a higher computation cost than a binary network. The parameter

decides the number of important layers to consider and hence a penalty is incurred due to increase in bit-precision. We can estimate the penalty in computation cost incurred due to the increase in bit-precision a in network with

number of significant layers as

(6)

Here is the computation cost of a binary layer and is the overhead of -bit computation over binary computation. We will present a detailed analysis of energy consumption and memory usage later in the manuscript.

For residual networks, ResNets, we include another design feature in addition to the PCA-driven Hybrid-Net, improving input representations through residual connections. This has been alluded to by Liu et al in

[20] where adding identity shortcut connections at every layer improves representational capability of binary networks. In standard residual networks [5]

, such identity connections are added to address the vanishing gradient problem in deep neural networks. However, in case of binary networks, these connections serve to provide an improved representation by carrying floating-point information from the previous layer. As a result, the Hybrid-Net design also considers the effect of adding such highway connections at every layer. Note, in case of convolution layers which induce a change in size of each feature map, the shortcut connections consist of

convolution weight layers to account for the change in size [5].

Figure 2: Different network configurations based on ResNet-20/ResNet-32 and VGG-15 architectures showing binary, -bit and full-precision layers in each of the networks as described in Section 4.2. The width of the layer shown in this figure is for CIFAR-100 dataset

4 Experiments, Results and Discussion

4.1 Experiments

We evaluated the performance of all the networks described in this section 4.2 in PyTorch

[30]. We perform image classification on the datasets CIFAR-100 [31] and ImageNet [32]. The CIFAR-100 dataset has 50000 training images and 10000 testing images of size for 100 classes. For the CIFAR-100 dataset, we explore the proposed Hybrid-Net design methodology on standard network architectures, ResNet-20, ResNet-32 and VGG-15, where the training algorithm for the quantized layers has been adopted from [17]. We extended our analysis to the ImageNet dataset [32] which is the most challenging dataset pertaining to image classification tasks. It consists of 1.2 million training images and 50000 validation images divided into 1000 categories. For simplicity, we considered ResNet-18 for our ImageNet evaluation. We explore different network configurations, for ResNet, shown in Fig. 2 (a) and VGG architectures, shown in Fig. 2 (b) to compare the proposed Hybrid-Net. Note that Hybrid-Comp A is formed by inter-layer sectioning, i.e., dividing the network into 2 parts ( binary and -bit layers) where is the number of the layers between the first and last layer. The widths of the network architectures, shown in Fig. 2 (a) and Fig. 2 (b) are for CIFAR-100 dataset. For ImageNet, we have used a wider network architecture, which we describe in Table. 1.

ResNet - 18
77 conv 64 stride 2
33 maxpool stride 2
33 conv 64 stride 1 ( 4)
33 conv 128 stride 2
33 conv 128 stride 1 ( 3)
33 conv 256 stride 2
33 conv 256 stride 1 ( 3)
33 conv 512 stride 2
33 conv 512 stride 1 ( 3)
Linear 1000
Table 1: Networks architectures for ImageNet classification task

4.1.1 Energy efficiency and Memory compression

We have briefly alluded to the possible penalty incurred due to increasing the bit-precision of certain layers in a network. To identify its effect with respect to the entire network metrics and further illustrate the benefits of the proposed Hybrid-Nets, we perform a storage and computation analysis to calculate the energy efficiency and memory compression of the proposed networks. For any two networks A and B, the energy efficiency and memory compression of Network A with respect to Network B can be defined as:

(7)

where and are the energy consumed by Network A and Network B respectively, and are the memory used for storing the weights of Network A and Network B, respectively. We estimate energy efficiency (E.E) and memory-compression (M.C) with respect to an full-precision network and normalize it with respect to an XNOR-Net network which is an entirely binary network except the first and final layer. Thus, the normalized E.E () and normalized M.C () of any network A can be written as:

(8)

Here, is the energy (memory) consumed by the layer of a network with full-precision weights and activations whereas is the energy (memory) consumed by the layer of any network A under consideration.

4.2 Results - PCA

We perform PCA analysis on the activations of each convolutional layer and extract the number of principal components required to explain a fraction of variance in the data. The design parameters such as and subsequently are heuristically. For all analysis, we fix as this makes the increases in significant components, across various layers clearly distinguishable. The values are chosen based on the variation in across layers. A higher value yields less number of significant layers. For clarity, we perform our analysis for various values.

4.2.1 ResNet Architectures - CIFAR-100

For ResNet architectures, we perform the PCA on a plain version of a binary network devoid of any residual connections. We decided to do this to isolate the effect of the convolution layers on the activations, instead of having residual connections. This is done because we focus on the quantization of the filters of the layers and the residual additions may distort the output feature space and hence the information we seek from it. Fig. 3 (a) and (b) shows the variation in the number of filters required to explain with different layers for ResNet-20 and ResNet-32 architectures respectively. As expected, the maximum change in occurs when the number of output channels increase. However, we observe a trend in both networks, that the layers just after the output channels increase from, say 16 to 32 or 32 to 64, attribute for the maximum change in the number of significant filters. Based on our criteria for significant layers, discussed in Section 3.1, we fix a for Resnet-20 and for ResNet-32 to identify the layers where the number of significant components undergo a change more than . Fig. 2 (a) also shows those layers marked by red dots. Note, by varying the , more or less number of layers can be considered as significant. After performing this analysis on a plain version of the ResNet architecture, we perform network simulations on the standard version with residual connections.

Figure 3: PCA analysis plot showing the number of principal components required to explain 99% variance at the output of convolutional layers across different layers for a) ResNet-32 (), b) ResNet-20 (), and c) VGG-15 () on CIFAR-100 dataset and d) for ResNet-18 () on ImageNet dataset
CIFAR-100
Network Arch Significant layers
ResNet-20 () 8, 9, 10, 14, 15, 16, 18
ResNet-20 () 8, 9, 14, 15
ResNet-32 () 12, 13, 22, 23, 24
VGG-15 () 3, 5, 8, 11, 12
VGG-15 () 3, 5, 8
ImageNet
Network Arch Significant layers
ResNet-18 () 6, 10, 14, 15
ResNet-18 () 6, 10, 11, 14, 15, 16
ResNet-18 () 6, 7, 10, 11, 14, 15, 16
Table 2: Significant Layers identified by PCA analysis

4.2.2 VGG Architectures - CIFAR-100

For VGG architecture, we perform the PCA of a binary network which has binary weights and activations for all layers except the first and the last. Fig. 3 (c) shows the plot showing how the number of filters required to explain of the variance changes with different layers for a VGG-15 architecture. We observe that the number of significant filters mostly increase when the number of filter bank increases at a particular layer. For rest of the layers, it remains fairly constant. As the PCA plot shows very little variation across layers, we consider a relatively lower with respect to the number of filters. We mark the significant layers by red dots in Fig. 3 (c). Table. 2 lists the different combination of significant layers obtained from ResNet and VGG architectures through the PCA analysis for different values for CIFAR-100 dataset. Note, we did not choose a lower for ResNet-32 as it would have included many layers which would increase the computation cost without a significant benefit in accuracy.

4.2.3 ResNet Architectures - ImageNet

We further perform PCA analysis on ResNet-18 architecture for the ImageNet dataset. Fig. 3 (d) shows the plot showing how the number of filters required to explain of the variance changes with different layers for ResNet-18 for . The significant layers identified by our proposed methodology are marked with red dots. We observe a similar trend as in case of CIFAR-100, that maximum increase in the number of significant filters, , occur in the first few layers after every change in filter size. We perform the PCA analysis for to identify the significant layers, listed in Table. 2.

4.3 Image Classification Results - CIFAR-100

4.3.1 ResNet Architectures

The ResNet-N architecture consists of N-1 convolution layers and a fully-connected classifier. As discussed before, the first convolution layer and the classifier have full precision inputs and weights. For CIFAR-100 dataset, we consider and . Further, we consider a slightly modified version of ResNet, where we add identity shortcut connections at every layer instead of every two layers for better input representation, as discussed earlier. We increase the bit-precision of weights and inputs of the layers obtained from PCA analysis to bit precisions and to form Hybrid-Net (, ). The rest of the layers have binary representations for weights and inputs. We also compare the proposed Hybrid-Net with Hybrid-Comp A (, ) (), which is formed by splitting the entire network into binary and -bit sections. Table. 3 shows that accuracy, energy efficiency and memory compression of the proposed Hybrid-Net based on ResNet-20 and ResNet-32 in comparison to XNOR-Net and other kinds of hybrid networks discussed in Fig. 2.

ResNet-20
FP Accuracy - 69.49%
- 16.35, - 17.26
Network Type Accuracy (%)
XNOR 51.55 1 1
Binary-Shortcut 1 53.91 0.99 1
Hybrid-Net (2,2) () 62.18 0.87 0.77
Hybrid-Net (4,4) () 62.66 0.7 0.53
Hybrid-Net (2,2) () 60.38 0.93 0.88
Hybrid-Net (4,4) () 61.37 0.82 0.7
Quantized (2,2) 64.55 0.73 0.65
XNOR - 2x width 65.11 0.39 0.33
Hybrid-Comp A (2,2) (k=6) 61.49 0.88 0.71
Hybrid-Comp A (2,2) (k=12) 62.17 0.8 0.67
ResNet-32
FP Accuracy - 70.62%
- 18.42, - 20.44
Network Type Accuracy (%)
XNOR 53.21 1 1
Binary-Shortcut 1 57.2 0.99 1
Hybrid-Net (2,2) () 63.87 0.94 0.87
Hybrid-Net (4,4) () 64.27 0.84 0.69
Quantized (2,2) 67.46 0.7 0.61
XNOR - 2x width 63 0.38 0.31
Hybrid-Comp A (2,2) (k=6) 62.26 0.91 0.76
VGG-15
FP Accuracy - 68.31%
- 21.77, - 26.24
Network Type Accuracy (%)
XNOR 48.69 1 1
Hybrid-Net (2,2) () 61.81 0.84 0.75
Hybrid-Net (4,4) () 63.38 0.64 0.5
Hybrid-Net (2,2) () 59.55 0.93 0.92
Hybrid-Net (4,4) () 60.02 0.81 0.80
Quantized (2,2) 67.65 0.65 0.55
XNOR - 2x width 57.03 0.29 0.3
Hybrid-Comp A (2,2) (k=3) 57.62 0.85 0.72
Table 3: Comparison of different networks on CIFAR-100

We observe that the proposed Hybrid-Net achieves a much superior trade-off between accuracy, energy efficiency and memory compression compared to other kinds of hybridization techniques. Moreover, in case of both ResNet-20 and ResNet-32, Hybrid-Net increases the classification accuracy by 10-11% compared to a XNOR-Net with minimal degradation in efficiency and compression. While quantizing the entire network to 2-bit inputs and weights (Quantized (2,2)) achieves a slightly higher accuracy, we show that the our principle of increasing the bit-precision of few significant layers captures most of the increase in accuracy from an XNOR-Net to a 2-bit networks. Hybrid-Net thus consumes less energy and less memory for ResNet-20 than a 2-bit network with a performance within of the latter. For ResNet-32, the benefits of Hybrid-Net is even pronounced where it consumes less energy and less memory than a 2-bit network while achieving accuracy within of the latter. Hybrid-Net thus ensures a signficant improvement in accuracy over a binary network without making the entire network 2-bit. We also show that Hybrid-Net achieves a higher accuracy than Hybrid-Comp A networks while consuming less energy for both ResNet-20 and ResNet-32, thus demonstrating the effectiveness of the design methodology.

4.3.2 VGG architecture

We further extend our analysis to VGG architectures. We considered VGG-15, which consists of 13 convolutional and 2 fully-connected layers as shown in Fig. 2 (b). We kept one of the fully-connected layer binary to preserve energy-efficiency. Table 3 lists the accuracy, energy efficiency and memory compression results for VGG-15 on CIFAR-100 for different networks. We consider and for our analysis and for each of the network configurations we use for the significant layers. We observe that Hybrid-Net achieves 13% higher accuracy than a XNOR-Net with minimal degradation in efficiency. When we make the inputs and weights of the entire network 2-bit (Quantize (2,2)), we achieve an even higher accuracy. We also show that Hybrid-Net with 2-bit layers achieve a better performance than Hybrid-Comp A for iso-efficiency in energy and memory. Similar to trends in ResNet, we observe that making the significant layers 4-bit while keeping the rest of the layers binary improves performance, however, at the cost of energy-efficiency. An entirely 2-bit network proves to be a more efficient solution. In summary, even for VGG architecture, we show that Hybrid-Net achieves significantly superior performance compared to a XNOR network while keeping most of the layers binary.

Resnet-18
FP Accuracy - 69.15%
- 17.00, - 13.35
Network Type Accuracy (%) E.E M.C
XNOR 50.33 1 1
Binary-Shortcut 1 54.05 1 1
Hybrid-Net (2,2) () 60.38 0.9 0.87
Hybrid-Net (2,2) () 61.95 0.84 0.8
Hybrid-Net (2,2) () 62.73 0.82 0.8
Hybrid-Net (4,4) () 61.70 0.75 0.7
Quantize (2,2) XNOR-kbit 64.51 0.7 0.71
DoReFA 62.6
PACT 67
Hybrid-Comp A (2,2) (k=4) 59.47 0.86 0.77
Table 4: Comparison of different networks on ImageNet

4.4 Image Classification Results - ImageNet

We evaluate the proposed Hybrid-Net design Table. 4. We observe that the XNOR network suffers a significant degradation in accuracy from a full-precision network. Even the Binary-Shortcut 1 network with residual connections at every layer fail to recover the classification accuracy. Compared to these binary network, we observe that the proposed Hybrid-Net, considering both 2-bit and 4-bit weights and activations achieves upto 10 % higher accuracy than corresponding XNOR network, while achieving a normalized energy-efficiency of 90% and normalized memory compression of 87% of a fully binary network. Quantizing the activations and weights of the entire network to 2-bits can further increase the accuracy by 1-2% but at the cost of a 15-20% increase in energy consumption than Hybrid-Net. Note, that we have provided results for different input quantization algorithms, such as DoReFA-Net [21] and PACT [25], for the Quantize (2,2) network although we have used the XNOR quantization (described in Section 3.2) in this work. This work shows that increasing the bit-precision of a few significant layers can remarkably boost the performance of binary neural networks without making the entire network higher precision.

4.5 Discussion

The proposed Hybrid-Net design uses PCA-driven hybridization of extremely quantized neural networks, resulting in significant improvements and observations as listed. One key contribution of the proposed methodology is that we can design hybrid networks without any iterations. It does not require an iterative design space exploration to identify optimal networks. Moreover, this methodology shows that increasing the bit-precision of only the significant layers in a binary network achieves performance close to that of a network that is entirely composed of layers with higher bit-precision weights and activations. Intuitively, a 2-bit network performs much better than a binary network. However, our analysis shows that it is not necessary to make the weights and activations of the entire network 2-bit. Hybrid-Net achieves more than improvement over a XNOR network, which is a fully binary network except the first and final layers, by increasing the bit-precision of less than half of the entire network. In fact, for deeper networks, like ResNet-34, this improvement is achieved with only increase in energy consumption from a XNOR-Net [17]. Thus, Hybrid-Net goes a long way in reaching close to high-precision accuracies with networks which are mostly binary and attain comparable energy-efficiency and memory compression to binary networks such as XNOR-Net [17] and BNN [16]. Moreover, this methodology can be extended to any network where making significant layers of the network -bit while keeping the rest of the network -bit (), can potentially produce comparable performance with enhanced energy-efficiency than an entirely -bit network.

Secondly, the performance of Hybrid-Net is subject to the nature of the plots obtained from PCA on the binary version of the networks. For example, for ResNet architectures (Fig. 3 (a), (b) and (c) for CIFAR-100 and Fig. 3 (d) for ImageNet), we observe the number of significant components increase for the layers which are adjacent to the ones where the number of output channels increase and then decrease for the later layers which have the same number of output channels. It can be said that the later layers are not adding to the linear separability of the data and binarizing them preserve the accuracy as observed. Or in other words, the significant layers identified using our proposed methodology contribute remarkably higher to the linear separability than the other layers. This is reflected in the results where we show the performance difference between Hybrid-Net and a 2-bit network is less than . However, for VGG architectures, we observe that the PCA plot remains fairly flat, which means that the identified significant layers are not remarkably different in their contribution towards linear separability of the data in comparison to the other layers. This is reflected in the performance difference () of Hybrid-Net from a 2-bit network for a VGG-15 network.

Thirdly, we observe that increasing the bit-precision of the weights and activations of the significant layers to 4-bits while keeping the rest of the layers binary is not the most energy-efficient way of improving accuracy of a network. An entire 2-bit network proves to be more energy-efficient while performing better. It may be because the loss due to binarization can not be significantly recovered by increasing the bit-precision much higher than binary, while keeping most of the layers binary. Thus, the proposed methodology performs best when the precision of the significant layers is close to the base precision of the network (binary in our case).

In this work, we have considered the quantization scheme, explored in [17]. Since then, there has been a plethora of works focused on improving quantization for both inputs and weights [21, 25]. Hybrid-Net focuses on improving the performance of binary neural networks through mixed-precision network design and we believe the improved quantization schemes should further increase the accuracy of both Hybrid-Nets and entirely 2-bit or 4-bit networks.

The humongous computing power and memory requirements of deep networks stand in the way of ubiquitous use of AI for performing on-chip analytics in low-power edge devices. The significant energy efficiency offered by the compressed hybrid networks increases the viability of using AI, powered by deep neural networks, in edge devices. With the proliferation of connected devices in the IOT environment, AI-enabled edge computing can reduce the communication overhead of cloud computing and augment the functionalities of the devices beyond primitive tasks such as sensing, transmission and reception to in-situ processing.

5 Conclusion

Binary neural networks offer significant energy-efficiency and memory compression compared to full-precision networks. In this work, we propose a one-shot methodology for designing mixed-precision, hybrid networks with binary and higher bit-precision inputs and weights to improve the performance of extremely quantized neural networks in terms of classification accuracy while still achieving significant energy efficiency and memory compression. The proposed methodology uses PCA to identify significant layers in a binary network which transform the input data such that the output feature space require more significant dimensions to explain variance in data. PCA is usually exploited to perform layer-wise dimensional reduction. We use PCA in an opposite manner in order to determine which layers cause the number of signficant dimensions to increase across input and output. Next, we increase the bit-precision of the weights and activations of the significant layers and keeping that of the other layers binary. The proposed Hybrid-Net achieves more than improvement over XNOR networks for ResNet and VGG network architectures on CIFAR-100 and ImageNet with only increase in energy consumption, thus ensuring more than reduction in energy consumption and memory compression from full-precision networks. Memory compression along with the close match to high-precision accuracies offered by the proposed mixed-precision network design using layer-wise information allows us to explore interesting possibilities in the realm of hardware-software co-design. This work thus proposes an effective, one-shot methodology for designing hybrid, compressed neural networks and potentially paves the way toward using energy-efficient hybrid networks for AI-based on-chip analytics in low-power edge devices with accuracy close to full-precision networks.

6 Methods

6.1 Energy Efficiency and Memory calculations

6.1.1 Energy Efficiency

The primary model-dependent metrics that affect the energy consumption of classification task are the energies consumed by the computations (multiply-and-accumulate or MAC operations) and memory accesses in our calculations for energy efficiency. We exclude energy consumed due to data flow and instruction flow in the architecture. For a convolutional layer, there are input channels and output channels. Let the size of the input be , size of the kernel be and size of the output be . Thus, in Table 5 we present the number of memory-accesses and computations for standard full-precision (FP) networks:

Operation Number of Operations
Input Read
Weight Read
Computations (MAC)
Memory Write
Table 5: Operations in neural networks

The number of binary memory accesses () and computations () in a binary layer is same as the corresponding number in full-precision layer of equivalent dimensions. As explained in Eq. 1, we consider additional full-precision memory accesses and computations for parameter , where is the scaling factor for each filter bank in a convolutional layer. Number of accesses for is equal to the number of output maps, . Number of full-precision computations are . Table 6 lists the number of k-bit and full-precision memory access and computations of any layer.

Operation Term Number of Operations
k-bit Memory Access +
k-bit Computations (MAC)
FP Memory Access
FP Computations
Table 6: Number of operations in a -bit layer

We calculated the energy consumption from projections for 45 nm CMOS technology [33, 13]. Considering 32-bit representation as full-precision, the energy consumption for both binary and 32-bit memory accesses and computations are shown in Table. 7.

Operation Term Energy (pJ)
k-b Memory Access 2.5
32-b MULT FP 3.7
32-b MULT INT 3.1
32-b ADD FP 0.9
32-b ADD INT 0.1
k-bit MAC INT ((3.1*)/32+0.1)
k-bit MAC FP 4.6
Table 7: Energy Consumption chart

Then, energy consumed by any layer with k-bit weights and activations is given by

(9)

Note, this calculation is a rather conservative estimate which does not take into account other hardware architectural aspects such as input-sharing or weight-sharing. However, our approach concerns with modifications of network architecture and we compare the ratios of energy consumption. These aspects of the hardware architecture affect all the networks equally and hence can be taken out of consideration. Further, FP MAC operations can be optimized for lower energy consumptions. In our calculations, we have bluntly taken it as the sum of a 32-b FP Multiply and 32-b FP Add operations. These optimizations are catered towards FP networks, and reduce the FP energy consumption. This, in turn, will reduce the energy efficiency of the binary and hybrid networks. In this work, we are focused on comparing different kinds of binary and hybrid network, and hence, this assumption of FP MAC energy is not going to affect the analysis.

6.1.2 Memory Compression

The memory required for any network is given by product of the total number of weights in the network multiplied by the precision of the weights. The number of weights in any layer is given by:

(10)

considering usual notations describer earlier. Thus, the total memory requirements can be simply written as where is the precision of weights in the layer. We can estimate memory compression (M.C) with respect to a full-precision network and normalize it with respect to an XNOR-Net network which is an entirely binary network except the first and final layer.

Note that the assumption for the energy and storage calculations for binary layers hold for custom hardware capable of handling fixed-point binary representations of data, thus leveraging the benefits offered by quantized networks.

Acknowledgement

This work was supported in part by the Center for Brain-inspired Computing Enabling Autonomous Intelligence (C-BRIC), one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, in part by the National Science Foundation, in part by Intel, in part by the ONR-MURI program and in part by the Vannevar Bush Faculty Fellowship.

References