Cluster-Based Partitioning of Convolutional Neural Networks, A Solution for Computational Energy and Complexity Reduction

06/29/2020 ∙ by Ali Mirzaeian, et al. ∙ University of California-Davis George Mason University University of Maryland, Baltimore County 14

In this paper, we propose a novel solution to reduce the computational complexity of convolutional neural network models used for many class image classification. Our proposed model breaks the classification task into three stages: 1) general feature extraction, 2) Mid-level clustering, and 3) hyper-class classification. Steps 1 and 2 could be repeated to build larger hierarchical models. We illustrate that our proposed classifier can reach the level of accuracy reported by the best in class classification models with far less computational complexity (Flop Count) by only activating parts of the model that are needed for the image classification.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The research and development of Deep Neural Networks (DNNs) combined with the availability of parallel processing units for training and executing them, have significantly improved their applicability, performance, modeling capability, and accuracy. Many of the recent publications and products affirm that the state of the art DNN solutions achieves superior accuracy in a wide range of applications when compared to the outcome of the same task that is performed or programmed by a human. This is especially true when DNN models are deployed to solve problems that either have no closed-form solution or are too complex for developing a programmable solution. The trend of development, deployment, and usage of DNN is energized by the rapid development of massively parallel processing hardware (and their supporting software) such as Graphical Processing Unit (GPU) (Sanders and Kandrot, 2010)

, Tensor Processing Units (TPU)

(Abadi et al., 2016), Field Programmable Gate Arrays (FPGA), Neural Processing Units (NPU) (Chen et al., 2014; Du et al., 2015; Chen et al., 2016; Mirzaeian et al., 2020, 2020b; Faraji et al., 2019, 2020), and many-core solutions for parallel processing of these complex, yet parallelizable models.

The ability to train and execute deeper models, in turn, has resulted in significant improvement in the modeling capability and accuracy of CNNs, a trend that could be tracked from early CNN solutions such as 5-layer Lenet-5 (LeCun et al., 2015) for handwritten digit detection to much deeper, more complex, and fairly more sophisticated 152 layer ResNet-152 (He et al., 2016) used for 1000-class image classification with an accuracy that significantly surpasses that of human capability. Generally, going deeper (or wider) in CNNs improves their accuracy at the expense of increased computational complexity. However, increasing a model complexity reduces the range of hardware that could execute the model, and increases the energy consumed per model invocation (Neshatpour et al., 2018; Neshatpour and et al., 2019). Hence, many researchers in the past few years have visited the problem of reducing the computational complexity of CNNs (Xiao et al., 2014; Neshatpour et al., 2018; Srivastava and Salakhutdinov, 2013; Deng et al., 2014; Hosseini et al., 2019) to widen their applications.

In this paper, we propose an efficient solution to reduce the computational complexity of CNNs used for many-class image classification. Our proposed model breaks the classification task into two stages of 1) Clustering, and 2) Conditional Classification. More precisely we transform a difficult -class classification problem into a -group clustering and -class classification task such that . The

group (a.k.a Hyper-Class) clustering problem is solved by a convolutional encoder (first-stage of our proposed model) followed by a Fully Connected (FC) layer for clustering the input image into one of the hyper-classes. In this model, each Hyper-class is composed of a set of classes with shared features that are closely related to one another. The decision of which classes are grouped into the same cluster is made by applying the spectral clustering algorithm

(Ng et al., 2002)

on the similarity matrix obtained from the K-Nearest Neighbour algorithm (KNN)

(Shi and Malik, 2000) on the latent spaces corresponding to the input samples. After validating the membership of an input image to a cluster, the output of the convolutional encoder is pushed to a small -class classifier that is specifically tuned for the classification of that hyper-class. By knowing the hyper-plane (cluster-plane), the complexity of detecting the exact class is reduced as we can train and use a smaller CNN when classification space (the number of classes) is reduced. To generalize the solution, we formulate a systematic transformation flow for converting the state of the art CNNs (original model) into a 2-stage Clustering-Classification model with significantly reduced computational complexity and negligible impact on the classification accuracy of the overall classifier.

2. Related Works

Utilizing hierarchical structures for training and inference phase of Convolutional Neural Networks for improving their classification accuracy has been previously studied (Xiao et al., 2014; Neshatpour et al., 2018; Mirzaeian et al., 2020a; Srivastava and Salakhutdinov, 2013; Deng et al., 2014; Liu et al., 2013). However, the focus of most of these studies was on improving the model’s accuracy rather than addressing its complexity problem. Notably, in some of these studies, it is shown that employing hierarchical structures could even degrade the model’s efficiency. For example, in (Yan et al., 2015), the authors reported an increase in both memory footprint and classification delay (computational complexity) as noticeable side effects of deploying hierarchical classification for improving the model’s accuracy. Similar to this group of studies, we explore the hierarchical staging of CNN models, but with a different design objective: We propose a systematic solution for converting a CNN model into a hierarchical 2-stage model that reduces the computational complexity and model’s memory footprint with negligible impact on its accuracy.

The problem of model complexity reduction is visited by many scholars. A group of related previous studies has addressed the problem of reducing the average-case computational complexity by breaking the CNN models into multiple stages and giving the option of an early exit using mid-model classifiers (Neshatpour et al., 2018; Panda et al., 2016; Teerapittayanon et al., 2016). For example, in (Neshatpour et al., 2018) the average computational complexity of the model (over many input samples) is reduced by breaking a large CNN model into a set of smaller CNNs that are executed sequentially. In this model, each smaller CNN (uCNN) can classify and terminate the classification if an identified class has reached a desired (and user-defined) confidence threshold. Similarly, In (Panda et al., 2016)

, a Conditional Deep Learning Network (CDLN) is proposed in which, FC layers are added to the intermediate layers to produce early classification results. The forward pass of CDLN starts with the first layer and monitors the confidence to decide whether a sample can be classified early, skipping the computation in the proceeding layers. While CDLN only uses FC layers at each exit point, BranchyNet

(Teerapittayanon et al., 2016) proposes using additional CONV layers at each exit point (branch) to enhance the performance. Unfortunately, this group of solutions suffers from 2 general problems: 1) although, they reduce the average-case computational complexity, their worst-case complexity (when all uCNN or additional FC and CONV layers are executed) is worse than the non-branchable (no early termination) solutions. 2) Introducing additional Fully Connected (FC) layers makes them suffer from a parameter-size explosion as FC layers require a far larger number of parameters than CONV layers, worsening their memory footprint. Our proposed solution addresses the shortcomings of these models by making the execution time uniform across different input samples, keeping the FC layer memory footprint in check, while reducing the complexity of the model.

3. Proposed Method

A CNN model is composed of several Convolution (CONV) layers and usually one or more Fully Connected (FC) layer for final classification. Each CONV layer extracts a set of features from its input feature map (ifmap) and generates a more discriminative output feature map (ofmap). The ofmap of each layer is the ifmap to its proceeding layer. The CONV layers close to image input will become specialized in extracting generic (class-independent) features. But, as we move deeper into the CNN, the CONV layers extract more abstract (higher-level representation) features of the input image from their ifmap. The CONV layers close to the output (softmax layer) become specialized in extracting the most abstract and class-specific features. This allows the last layer (i.e. FC and softmax layer) to identify and assign a probability to each class based on the activation map of neurons in the last CONV layer. In short, earlier CONV layers extract low-level features needed for the classification of all input images, while the late staged CONV layers are specialized for extracting abstract features for the classification of specific classes.

Figure 1. Applying the proposed method on a DNN. s0 to sN are N switches that controls which path should connect, and the Hyper-class1 to Hyper-classN, shown with to , are the clusters in which trained separately and attached together.

Motivated by this view of CONV layers’ functionality, we present a simple yet efficient and systematic solution to re-architect the state-of-the-art CNN models into a hierarchical CNN model such that any given input image activates only parts of the model that is needed for its classification. Our proposed (target) model architecture, as illustrated in Fig. 1.bottom, is composed of three main modules: (1) : hared Clustering layer(s), (2) : id cluster classifier(s) (a.k.a. clustifier), and (3) : a set of yper-class specific micro CNN models. The layer(s) is used to extract low-level features from an input image. The layer is used for classifying the input image into one of clusters (hyper-classes). Based on the result of clustifier , the associated cluster-specific model is activated to classify the image to one of its possible classes. Considering that and , clustering and classification could be performed by a much shallower (and smaller) CNN. Also, note that we can have clusters of different sizes. In this model, we divide the -class clustering problem into clustering problem, each containing classes such that , while still honoring the and for each , . Finally note that, as illustrated in Fig. 1.bottom, by using additional and layers, we can hierarchically break a large cluster into smaller clusters and use a dedicated for each of the smaller clusters, while allowing many of the clusters to share a larger set of shared () CONV layers.

3.1. Proposed Architecture

To build our proposed model, we designed (1) a mechanism to break and translate a state of the art CNN into a trainable 3-stage clustifier-classifier model that preserves the model accuracy, and (2) developed an effective solution for clustering classes with shared features into the same cluster. Details of our systematic solutions for constructing the model and its training are discussed next.

Shared Classifier : In our proposed solution, the shared layers are directly borrowed from the original model. To construct the shared section, we sweep the number of shared layers included in and investigate the tradeoff between the resulting clustifier-classifier model complexity and model accuracy. The model architect then determines the number of layers (of the original model) that should be partitioned into the shared section of the new model. To illustrate this tradeoff a detailed case study on Resnet18 is shown in Table 1. The green blocks in the first section of the table are the CONV layers that are shared across different clusters. These shared layers are borrowed from the original Resnet18, and the value of CONV parameters are fixed (frozen) during the clustifier-classifier model training).

Mid Clustifier : The implementation of clustifier is more involved, as the performance of clustifier significantly impacts the accuracy of the overall solution. For a given input X, If actives an incorrect hyper-class classifier, the input is miss-classified. To improve the accuracy of the proposed solution, we propose a confidence-thresholding mechanism in which the clustifier could activate a minimum set of hyper-class classifiers, such that the commulative confidence of the hyper-classes selected by the clustifier is above a given threshold. We refer to this group of selected clusters as minimum activation set.

To achieve this objective, the clustifier

considers the cluster probabilities (confidence) suggested by the clustifier along with the data in the confusion matrix (CM) of the clustifier

to activate the related hyper-classes for each input sample X. The confidence of the clustifier is the probability suggested by the softmax layer of the Clustifier for the input label. The confusion matrix of the clustifier is a two-dimensional table that contains the confusion score of each class with other classes and is obtained by benchmarking the clustifier using a set (i.e. test set) of labeled inputs. In this paper, is the value of unit of the confusion matrix when label is predicted. We also use the notation to refer to the highest score class that is confused with the class of input X as suggested by the confusion matrix, where i determine the ranking of confused class in the matrix (i.e i=1 represents the class that is mostly confused with the class of X.

To increase the likelihood of including the correct hyper-class classifier in the activation set, we first define a confidence threshold (i.e. 90%) and a variable for holding the confidence summation results which is initially set the highest cluster probability suggested by . If the clustifier’s confidence (suggested probability) for the selected hyper-class is below the confidence threshold, we refer to the confusion matrix of the clustifier , and select the hyper-class (i.e. i=1, for the class most confused with the selected class). Then we find the suggested confidence of the selected hyper-class from , and add the suggested confidence to the . This process is repeated until the . The exit condition is expressed in Eq. 1.

(1)

At this point, the clustifer activates all selected classes in the set contributing to the . This procedure is captured in the Alg. 1.

procedure Activator(Clustifier , Input , Confidence threshold , Hyperclass pool , Confusion Matric )
      V =
      for (; size of ; do
            if  then
                 index = argmax(V)
                 if V[index]  then
                       activate HPool[index]
                                                     
            actives = argNmax(N=, CM[:,index])
            temp = Nmax(N=, CM[:,index])
            cmVal = temp/sum(temp)
            if sum(V[actives])  then
                 activate (HPool[actives]*cmVal)
                                    
Algorithm 1 Hyper-class activation policy

Fig. 2 shows an example of this algorithm when three hyper-classes are activated. In this example the clustifier has predicted the label N for the input sample , however, its confidence, , doesn’t pass the defined threshold . So the and that respectively have probability and (as suggested by ) are added to activation set.

Figure 2. A example of Hyper-class Selection policy. X is the sample input, is the output of the clustifier’s softmax layer, is the output of the shared layer related to hyper class, is the value of unit of the confusion matrix when label is predicted and finally ) is the dedicated part of the hyper class.

The next challenge for training a clustifier is identifying which classes could be grouped to improve the accuracy of the clustifier. We propose that grouping similar classes in a cluster is an efficient solution for achieving high clustering accuracy while keeping the computational and model complexity of the clustifier in check. Note that his approach, improves the accuracy of the mid-clustifier at the expense of posing a harder the task on the hyper-class classifier. Nevertheless, because the hyper-class classifier is a deeper network than the mid-classifier, it should be more capable in descriminating between classes that are grouped in the same cluster for higher similarity. To achieve our objective of grouping similar classes in the same cluster, we employed the unnormalized spectral clustering introduced in (Ng et al., 2002)(Shi and Malik, 2000). Note that the cluster sizes in this approach are not uniform, suggesting that the size of the hyper-class classifiers could also be different. Our implementation of spectral clustering is discussed next:

Given a set of points in , they can be clustered into k coarse classes following the algorithm 2

. First step of using spectral clustering is to define a similarity matrix between different classes. For obtaining the similarity matrix, we first obtain the probability of each class on a (labeled) evaluation set. then we compute the average probability vector of each class across all input images available for that class in the evaluation set. We refer to the vector of probabilities as indicator vector, denoted by

. The indicator vector is computed using Eq. 2.

(2)

In this equation, is the ground truth label for image , and the is the vector of probabilities generated for image . The next step is to apply the K-Nearest Neighbour(KNN) clustering on the indicator vectors to build a similarity matrix. The connectivity parameter of KNN algorithm (which indicates the number of the nearest neighbors) is set to the smallest value (in the range [1, N]) which leads to a connected graph. This is because the spectral clustering algorithm performs best when the similarity matrix represents a connected graph. The similarity matrix is then fed to the unnormalized Spectral Clustering algorithm (described in (Ng et al., 2002) and (Shi and Malik, 2000)

). Then using the eigengap heuristic (described in

(Von Luxburg, 2007)) the number of suitable coarse classes are selected. As described earlier, using our proposed solution, the number of classes in each cluster may be different. For example, after executing algorithm 2, the obtained number of hyper-classes for CIFAR100 dataset is 6, and the number of members at each of hyper-class C0 to C5 is 9, 28, 23, 15, 14, 11 respectively (see Table. 1 in the result section).

procedure Extractore(, , …, )
      1) Constructing similarity matrix A using K-Nearest Neighbor(KNN):
         A = KNN(, , …, ).
      2) Define Degree matrix D:
         .
      3) Constructing unnormalized Laplacian matrix L:
         .
      4) solving the generalized eigenproblem:
         .
      5)

related to the lowest k eigenvalues of

      6) Construct the matrix Y as:
         .

      7) Apply K-means on each row of Y as a data point in

.
      8) datapoint cluster j, if and only if j.
Algorithm 2 Cluster Membership Assigment

width=2center Model Architecture and selection of architectural Parameters for each of the CNN blocks in the transformed ResNet18 L-Name O-Size L0 (Original Model) L1 L2 L22 L3 L33 L4 L44 Block1 7x7, 64, stride2 Block2 Block3 Block4 Block5 Block6 Average Pooling, FC, Softmax Top 1 Accuracy Cluster #Members L0 (Original Model) L1 L2 L22 L3 L33 L4 L44 C0 9 0.83 0.9 0.898 0.892 0.869 0.869 0.843 0.843* C1 28 0.671 0.769* 0.726 0.705 0.687 0.666 0.660 0.601 C2 23 0.623 0.748* 0.716 0.691 0.708 0.689 0.673 0.667 C3 15 0.800 0.810* 0.791 0.755 0.75 0.707 0.753 0.650 C4 14 0.832 0.921 0.886 0.900 0.876 0.894 0.871 0.841* C5 11 0.757 0.855 0.861 0.853 0.845 0.816 0.841 0.832* Clustifier 6 0.869 0.953 0.876 0.957 0.941 0.919 0.931 0.923* Computational Complexity Reduction of the Overall Model L0 (Original Model) L1 L2 L22 L3 L33 L4 L44 Complexity Reduction 0% 14.7% 28.7% 42.8% 60.5% 39.7% 53.8% 79.3%

Table 1. L1, L2, L3, L4, L22, L33, L44 are some of the compressed micro CNNs that could be generated from the application of our proposed model compression flow on Resnet18. The green blocks in the first section of the table indicate CONV layers borrowed (not changed) from the original Resnet18 (and froze during the training). The blue blocks are the sections that are replaced with a bottleneck or bottleneck-compression blocks and are trained. The mid-section of the table presents the top1 accuracy for different clusters when different compressed CNN configurations are used for classification, while the bottom section captures the associated flop count (computational complexity). Units marked with * shows one of the possible mapping of each cluster to the corresponding structure, accordingly the selected configuration is {C0:L44, C1:L1, C2:L1, C3:L1, C4:L44, C5:L44, Clustifier:L44} in which X:Y means cluster X is mapped to the structure Y.

Hyper-Class classifier : The hyper-class classifiers are smaller (micro) hyper-class specialized CNNs that are trained from scratch to specialize in classifying each cluster. Considering that the size of clusters may be different, the size of the hyper-class classifiers may also vary. To design the hyper class classifiers we need to solve two issues: 1) considering that more than one may be activated at a time, we need to find a solution to select or sort the suggested classes by different s; 2) we need a mechanism to transform the non-shared portion of the original CNN to these smaller and hyper-class specific CNNs. Each of these is discussed next:

For solving the first problem, We propose sorting the weighted confidence of classifiers’ prediction and choose the top (i.e. top 1 or top 5) as the prediction of the overall model. To compute the weighted confidence, we propose using the cluster confidence scores obtained from the confusion matrix (which was used for activation of hyper-class classifiers) to scale the class probabilities and then sort the weighted probabilities to determine the top 1 or top 5 classes. The Eq. 3 illustrates how the class probabilities are weighted for the example given in Fig. 2.

(3)

The next problem is designing the micro-CNNs that act as hyper-class classifiers. For this purpose, we propose a solution to automates the transformation of non-shared layers of the original model to micro CNN models. For this purpose, we propose reducing the size of non-shared CONV layers by replacing some of the CONV layers with a combination of two CONV layer configurations and , in which an entry in form of () represent a kernel of size with channels. The first block is known as a bottleneck block, and we refer to the second block as bottleneck-compression block.

Our model compression flow is as following: 1) Starting from the last CONV layer of the original model, we identify target blocks that could be replaced with bottleneck layers. Let’s assume the ifmap to the first CONV layer an identified block is and the ofmap of the last CONV layer in the identified block is , in which and are the width and height of each channel, and is the number of channels. In this case the targeted block could be replaced by a bottleneck block if or

. In the first case, the stride of the bottleneck block is set to 1, and in the second case, the stride is set to 2. In addition, for each targeted block if

an skip connection (as describe in (He et al., 2016)) is added. The compression could be pushed further by identifying two consecutive bottleneck blocks and replacing it with a bottle-neck compression block. This translation process is illustrated in Fig. 3. Depending on how many bottlenecks or bottleneck-compression blocks are inserted, we can have a wide range of compressed CNNs.

Figure 3. Left: compression flow for a hypothetical target block. Right: replacing two consecutive bottle-neck with a bottle-neck compression block. At the configuration (X, Y, Z), X is the channel number, Y and Z are width and height of the image shape.

4. Experimental Results

In this section, we evaluate the effectiveness of model compression solution (in terms of accuracy and computational complexity reduction) when translating the a complex model (e.g. ResNET 18) into its CP-CNN counterpart. We further investigate the impact of changing the value of the confidence threshold and its impact on the model complexity and accuracy.

4.1. Evaluating the Model Compression Solution for Building HyperClass Classifiers

We first illustrate the effectiveness of our propose compression process in terms of its impact on model complexity and accuracy. For this purpose, we apply our solution to compress the Resnet18. We also used the algorithm 2 to divide the CIFAR100 data set into different clusters. The algorithm suggests 6 clusters with 9, 28, 23, 15, 14, 11 classes in each hyper-class. These hyper classes are respectively denoted as c0, c1, c2, c3, c4, c5.

The first section of the Table. 1 captures some of the possible configurations from the application of bottleneck and bottleneck-compression blocks on resent18. As illustrated, the compression solution generates a wide range of compressed micro CNN. The second section of the table captures the accuracy of the compressed network for each cluster and each compressed network configuration, while the third section captures the reduction in the complexity for each compressed model (compared to the -original- case). As illustrated, the compressed networks are still able to achieve very high accuracy with a significant reduction (up to 79%) in their computational complexity.

4.2. Evaluating the CP-NN accuracy and Complexity

In section 4.1 only the accuracy of a model composed of the shared CONV layer (green blocks in Table. 1) and hyper-class specific compressed layers s (blue blocks in Table. 1) was evaluated. However, the overall accuracy of the model is also impacted by the accuracy of the Mid-Clustifier and the combined accuracy of selected Hyper-class classifiers (i.e. . To evaluate the overall accuracy of the CP-CNN we selected the following configurations for building the hyper-class classifiers for each of the 6 clusters that we previously identified: {C0:L44, C1:L1, C2:L1, C3:L1, C4:L44, C5:L44, Clustifier:L44}. These configurations are highlighted with a asterisk () in table 1. We reported the accuracy and complexity result of the CP-CNN model that we evaluated for 10,000 images of CIFAR100 in our test set.

Table 2 captures the number of activated Hyper-Classes(HC) when the confidence threshold is varied in the range to . As illustrated in Table 2, increasing the value of also increases the number of activated hyper-classes. This is expected, because according to the Eq. 1, in order to meet the , a larger number of hyper-class classifiers should be activated. Fig. 4 captures the change in the accuracy and increase in the computational complexity (Flop count) when the varies in that range. From this figure, it is obvious that increasing the beyond 0.7 results in negligible (or even zero) gain in the CP-CNN accuracy. However, increasing the beyond 0.7 results in the activation of a larger number of hyper classifiers and an increase in computational complexity. This implies that for this particular scenario the best is 0.7.

width=0.98center #1 #2 #3 #4 #5 #6 Acc. Gain CC. Save 0.5 9379 619 2 0 0 0 -0.0638 0.3392 0.6 8545 1434 21 0 0 0 -0.0274 0.2996 0.7 7683 2166 150 1 0 0 0.0021 0.2575 0.8 6641 2789 546 24 0 0 0.0023 0.2000 0.9 5198 3294 1228 263 17 0 0.0023 0.1137 0.95 4044 3372 1783 684 115 2 0.0023 0.0368 CC-ideal 10000 0 0 0 0 0 -0.2365 0.3781

Table 2. Number of the activated hyper-classes when varies in the range 0.5 to 0.95. CC-ideal shows the hypothetical ideal scenario that only one hyper-class activated per each input sample. Noted the summation of activated hyper-classes at each row is 10000.

Fig. 4.(bottom) also captures the breakdown of the total computational complexity for different values of as it varies at the range (0.5, 0.95). Considering that in the evaluation set, we had an equal number of images from each class, it was expected that clusters with a higher number of member-classes contribute to a lager FLOP count.

Figure 4. The relation between computational complexity(cc), and accuracy(acc), for some of the design space has been . indicates the confusion sum threshold and the CC metric has been calculated in FLOPS.

5. Acknowledgment

This research was supported by the National Science Foundation (NSF Award# 1718538), and in part by the Design Knowledge Company and Air Force Research Lab of the USA.

6. Conclusion

In this paper, we proposed CP-CNN, a novel hierarchical CNN model that reaches a level of accuracy in the range of the state of the art solutions, with a significantly lower computational complexity. The CP-CNN uses a first stage CNN block () to extract class independent features, utilizes a Mid-level Clustifier (() to predict the membership of the input image to one or few of the possible clusters, and then activates small and hyper-class specific classifier(s) to classify the input image. We illustrate how an existing model, such as ResNet18, could be translated into CP-CNN. We reported negligible loss in accuracy while observing up to 30% reduction on the overall computational complexity of the proposed model (depending on the selection of compression and model parameters) compare to the original ResNET18 model, when ResNET was translated to its CP-CNN counterpart.

References

  • (1)
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265–283.
  • Chen et al. (2014) Tianshi Chen, Zidong Du, et al. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM Sigplan Notices, Vol. 49. ACM, 269–284.
  • Chen et al. (2016) Yu-Hsin Chen, Joel Emer, et al. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 367–379.
  • Deng et al. (2014) Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, and Hartwig Adam. 2014. Large-scale object classification using label relation graphs. In

    European conference on computer vision

    . Springer, 48–64.
  • Du et al. (2015) Zidong Du, Robert Fasthuber, Tianshi Chen, et al. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 92–104.
  • Faraji et al. (2020) S. Rasoul. Faraji, Pierre Abillama, et al. 2020. HBUCNNA: Hybrid Binary-Unary Convolutional Neural Network Accelerator. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS).
  • Faraji et al. (2019) S Rasoul Faraji, M Hassan Najafi, Li, et al. 2019. Energy-efficient convolutional neural networks with deterministic bit-stream processing. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1757–1762.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . 770–778.
  • Hosseini et al. (2019) Morteza Hosseini, Mark Horton, et al. 2019. On the complexity reduction of dense layers from O (N 2) to O (NlogN) with cyclic sparsely connected layers. In 2019 56th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
  • LeCun et al. (2015) Yann LeCun et al. 2015. LeNet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet 20 (2015).
  • Liu et al. (2013) Baoyuan Liu, Fereshteh Sadeghi, Marshall Tappen, Ohad Shamir, and Ce Liu. 2013. Probabilistic label trees for efficient large scale image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 843–850.
  • Mirzaeian et al. (2020a) Ali Mirzaeian et al. 2020a. Learning Diverse Latent Representations for Improving the Resilience to Adversarial Attacks. arXiv preprint arXiv:2006.15127.
  • Mirzaeian et al. (2020b) Ali Mirzaeian et al. 2020b. Nesta: Hamming weight compression-based neural proc. engine. In Proceedings of the 25th Asia and South Pacific Design Automation Conference.
  • Mirzaeian et al. (2020) A. Mirzaeian et al. 2020. TCD-NPE: A Re-configurable and Efficient Neural Processing Engine, Powered by Novel Temporal-Carry-deferring MACs. In 2020 International Conference on ReConFigurable Computing and FPGAs (ReConFig).
  • Neshatpour et al. (2018) Katayoun Neshatpour, Farnaz Behnia, Houman Homayoun, and Avesta Sasan. 2018. ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 551–556.
  • Neshatpour and et al. (2019) K. Neshatpour and et al. 2019. Exploiting Energy-Accuracy Trade-off through Contextual Awareness in Multi-Stage Convolutional Neural Networks. In 20th International Symposium on Quality Electronic Design (ISQED). IEEE, 265–270.
  • Ng et al. (2002) Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002.

    On spectral clustering: Analysis and an algorithm. In

    Advances in neural information processing systems. 849–856.
  • Panda et al. (2016) Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2016. Conditional deep learning for energy-efficient and enhanced pattern recognition. In Design, Automation & Test in Europe conf., 2016. IEEE, 475–480.
  • Sanders and Kandrot (2010) Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming, Portable Documents. Addison-Wesley Professional.
  • Shi and Malik (2000) Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. Departmental Papers (CIS) (2000), 107.
  • Srivastava and Salakhutdinov (2013) Nitish Srivastava and Ruslan R Salakhutdinov. 2013.

    Discriminative transfer learning with tree-based priors. In

    Advances in Neural Information Processing Systems. 2094–2102.
  • Teerapittayanon et al. (2016) Surat Teerapittayanon, Bradley McDanel, and HT Kung. 2016. Branchynet: Fast inference via early exiting from deep neural networks. In Pattern Recognition (ICPR), 2016 23rd int. conf. on. IEEE, 2464–2469.
  • Von Luxburg (2007) Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and computing 17, 4 (2007), 395–416.
  • Xiao et al. (2014) Tianjun Xiao, Jiaxing Zhang, et al. 2014. Error-driven incremental learning in deep convolutional neural network for large-scale image classification. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 177–186.
  • Yan et al. (2015) Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, and Yizhou Yu. 2015. HD-CNN: hierarchical deep convolutional neural networks for large scale visual recognition. In Proceedings of the IEEE international conference on computer vision. 2740–2748.