Recently, Convolutional Neural Networks ( CNNs ) have outperformed traditional machine learning models in many computer vision tasks. However, it required extensive research in discovering high-performance CNN architectures. As far as the large-scale image classification task is concerned, state-of-art CNN’s are going beyond deep, and single chain structured layouts(DBLP:journals/corr/SimonyanZ14a, ; 7780459, ).
All these networks are trained on datasets with many classes and are over-represented when they are used on smaller tasks with fewer classes (DBLP:journals/corr/SimonyanZ14a, ). This over-representation translates into large and inefficient models, that require too many weights to represent redundant features and too many computations to compute. Instead of a single chain structured CNN, a hierarchically structured CNN would be more efficient, as it allows using only the necessary features to represent a specific subset of classes, rather than the full set of features used to represent the full class domain set.
So, our objective is to design a network that handles large classes and simultaneously inhibits the over-representation between the classes with minimal manual interference and design time. Though there are 2-level hierarchical strategies exploited in (yanhd, ; DBLP:journals/corr/abs-1802-05800, ; Zhu2017BCNNBC, )
, methods to deploy model for classifying specific classes without retraining the network are not found. So, we adopt filter sensitivity analysis in(DBLP:conf/cvpr/GuoP17, )
, and form Impact score class vectors (Iscv). Iscvs also helps in automated designing of network architecture.
In this paper, we propose an automated way to create computationally efficient and compressed Hierarchically Self Decomposed CNN’s (HSD-CNN), based on existing pre-trained models. In the proposed algorithm, classes are organized hierarchically without manual intervention. The automated design flow of an HSD-CNN is a four-step process, detailed as in Section 3. Part of HSD-CNN that discriminates specific set of classes, named as Subnetwork, is selected in achieving our objective, and the corresponding sub-model can be deployed without retraining, detailed in Section 4.5.1.
This paper has two main contributions:
Any state of the art CNN’s can be automatically decomposed hierarchically into multiple levels using class specific filter analysis (DBLP:conf/cvpr/GuoP17, ). This is the first time a CNN is decomposed and pruned simultaneously based on class-specific filters. HSD-CNN algorithm allows the discriminating learned features to be organized hierarchically. These features are not limited to just two levels - coarse and fine as in (yanhd, ; DBLP:journals/corr/abs-1802-05800, ; Zhu2017BCNNBC, ). HSD-CNN has more than two layers, where classes can be grouped hierarchically.
Sub-network of HSD-CNN can be used to overcome over-represented CNN models. Part of HSD-CNN corresponding to a specific class domain are adapted as efficient subnetworks. Subnetworks are deployed without retraining to any application scenarios where only a subset of classes is used. And, results show that sub-networks perform better for most cases by a good margin.
2. Recent Literature
With an increase in the complexity of architecture, high computation and memory requirements of these models hinder their deployment on low power embedded devices. So far, many researchers have focused on pruning individual parameters on powerful models, at the cost of performance degradation. Le Cun et al. (Cun:1990:OBD:109230.109298, ) analytically prunes those parameters that have fewer effects when these parameters are perturbed. In (Hassibi:1992:SOD:645753.668069, )
, second order derivatives on loss function are used to determine the parameters which need to be pruned. Han et al.(Han:2015:LBW:2969239.2969366, ) achieves impressive memory savings by removing weights with magnitudes smaller than a threshold. In (Anwar:2017:SPD:3051701.3005348, ; 7303876, ), filters and parameters are pruned at different levels of the model using statistical analysis of filters and their feature maps. Method of pruning individual parameters and filters have resulted mostly in compressing the model, but not speeding up the inference time.
As filters at each layer in CNNs are tensors or matrices ( tensor slices ), applying low rank approximation methods(Denton:2014:ELS:2968826.2968968, ; DBLP:journals/corr/JaderbergVZ14, ; 8099688, ; Zhang:2016:AVD:3026801.3026837, ; DBLP:journals/corr/HeZS17, ; DBLP:conf/iccv/WenXWWCL17, ; 8354187, ) to decompose these filters into lightweight layers have been useful in increasing the efficiency during inference, and simultaneously reducing the number of parameters. Design methodolgy found in (Denton:2014:ELS:2968826.2968968, ; DBLP:journals/corr/JaderbergVZ14, ; 8099688, ; Zhang:2016:AVD:3026801.3026837, ; DBLP:journals/corr/HeZS17, ; DBLP:conf/iccv/WenXWWCL17, ; 8354187, ) lead to discovering compact and efficient networks like MobileNet (DBLP:journals/corr/HowardZCKWWAA17, ), SqueezeNet (DBLP:journals/corr/IandolaMAHDK16, ), Grouped Convolutions (DBLP:journals/corr/ZhangZLS17, ; Huang_Shichen_VanderMaaten_Weinberger_2018, ).
Very few attempts have been made to exploit category hierarchies (yanhd, ; DBLP:journals/corr/abs-1802-05800, ; Zhu2017BCNNBC, ) in deep CNN models, and they are restricted to two levels. Hierarchical deep CNN’s (HD-CNN) (yanhd, ) embed deep CNN’s into a two-level category hierarchy. They separate easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers. Tree-CNN (DBLP:journals/corr/abs-1802-05800, ) proposes a training method for incremental learning, albeit the network is limited to similar 2-level hierarchy as in (yanhd, ). A similar 2-level hierarchy along with Branch Training strategy is introduced in Branch-CNN (Zhu2017BCNNBC, ). The branch training strategy balances the strictness of the prior with the freedom to adjust parameters on the output layers to minimize the loss.
Distributed representations in the hidden layers of deep feed-forward neural networks have excellent generalization abilities (DBLP:journals/nature/LeCunBH15, ), though these representations are difficult to analyze. Because any particular feature activation depends on the effects of all other units in the same layer in its distributed representation.
Despite the limited understanding of the hidden representation that discriminates the class, in(DBLP:journals/nature/LeCunBH15, ), the authors of (DBLP:journals/corr/abs-1711-09784, )
proposes mimicking a neural net as a decision tree that makes soft decisions. This method allows to form a short representation as a decision tree and with faster execution. However, they do not explore distilling the CNNs as a decision tree. Because CNNs have a lot of information which cannot be represented in the form of a decision tree. Information will be lost if CNNs are represented as a normal decision tree. So, we require a method to represent CNNs as a decision tree, without losing information.
Authors of (DBLP:journals/corr/abs-1802-00121, ) explain that logic behind each prediction made by a pre-trained CNN can be quantitatively represented by a decision tree. It also explains that each filter of a layer might represent a specific or group of object parts. But, authors present no strategy to influence inference computation and model compression. So, pruning based on model interpretability is still a significant challenge in neural networks ( NNs ).
Authors of (DBLP:conf/cvpr/GuoP17, ) proposes a filter sensitivity analysis method to decide the filter importance specific to a class. We adopt this method in interpreting the trained model of a network that can handle a large number of classes. So, the proposed network decomposes itself based on the interpretation computed earlier using class specific filter sensitivity analysis. And, the decomposed network structure depicts a decision tree when observed in its computational graph analogy form.
3. Hierarchically Self Decomposing CNN
Consider a classical image classification setting on a conventional CNN learned through a training dataset over a set of classes . The dataset is composed of samples of images and corresponding labels , where is a 3-dimensional input image and is associated with one of the class labels in .
The goal of the CNN in image classification is to learn a mapping function . Softmax function at the final layer of the network produces a posterior distribution over classes . Then the network minimizes a classification loss function over all samples of dataset to search optimal parameters required for the mapping function.
Surprisingly, any modification in Equation  formation affects the performance of the assumed image classification setting. For example, deep CNNs (DBLP:journals/corr/SimonyanZ14a, ), with many layers, performs better for a large scale image scale classification.
Yet, optimizing the loss function in Eq. [1
] is troublesome due to the curse of dimensionality, i.e, increase in complexity of the network and its parameters. Designing new networks catering to the scenario of a large number of classes is a time-consuming and tedious task. Using large networks for application domains that are restricted to a subset of its classesis an overkill, as the network is over-represented in chain structured conventional CNN. Because, the learned discriminative features may be more biased towards a specific set of classes, and may worsen the performance (like accuracy, inference, latency speed) on the remaining class set.
So, our objective is to design a network that simultaneously handles large classes and inhibits the over-representation between the classes with minimal manual interference and design time. For easier understanding of our objective, it is useful to represent CNN as a directed acyclic graph (DAG), , where is the set of nodes and edges in network graph , respectively. Visual representation of network graph G is in Figure 1.
Let us also represent as the parent node of node , for all nodes except root node in . Root node is the input image and leaf node
is the probability score over classes. Also, let us refer all nodes except root node and leaf nodes as inner nodes, which represent activation maps.
We also represent the number of edges from the root node to any node as . It signifies that the node is present at layer in network . Also, network at layer may have more than one nodes with same or different parent nodes. Layer definition is similar to the definition of depth in graphs.
Node is a 3-dimensional activation map. The output node is the convolution output formed between input activation map and layer filters in edge . Generally, each edge e has 3-dimensional filters which may vary with edges in .
Node is useful in discriminating classes - a subset of classes in dataset . For any inner node node , . We also restrict overlapping class subsets, . So, , and .
In a conventional CNN , each node has one child , except for leaf nodes . Number of CONV layers in conventional network from input image to present layer activation map is counted as .
Similarly, in HSD-CNN , each inner node is restricted to at least one child node ( ) and at most two children nodes ( ).
Our algorithm proposes self-decomposition of conventional chain-structured CNN models into a tree-structured CNN layout ( HSD-CNN ). The proposed inherently categorizes whole classes of into a hierarchical group of subsets in the following four steps:
3.2.1. Impact score class vector
Let us analyze the impact of a channel on a certain class at layer in the network. It requires a large number of variables to examine the channels are inter-dependent on its predecessor and successive layer features. Our proposed method adapts filter sensitivity analysis (DBLP:conf/cvpr/GuoP17, ) to calculate the impact of each channel in layers on each class present in the dataset as a score.
Let the trained conventional CNN be . Assign a weight for each channel at layer in the network , with no changes to other layers. Let the modified network be . Suppose an sample , with corresponding ground truth class , in the training data produces a feature map at layer in network . Its response at corresponding layer of results in as the weight variable is included in .
Let , an element of leaf node vector in be the probability score corresponding to the class that the sample
belongs to. It is calculated from the softmax layer output.
Then the impact score that the channel at layer node in has on class is defined as ratio of the change in the probability score with effective change in weight variable at corresponding channel.
For inclusion of robustness in the impact score , let us calculate the sum of absolute values of the impact scores produced for each sample whose class label is .
As there are channels to layer node , there are impact scores for each class . Let us represent all the scores at particular layer for a class as a feature vector, namely Impact score class vector, . Lastly, we normalize the vector by dividing with its maximum element , i.e, and is the index of maximum element. It helps in bringing vectors in a convenient range for comparison with other class vectors. Normalization of Iscv mitigates minimum-maximum variation values for different classes. Calculation of Iscv features is repeated for all classes at each layer in . These Iscvs help in self formation of our tree-structured CNN, .
3.2.2. Formation of HSD-CNN
Generally, a decision tree forms from a supervised algorithm that classifies data based on a hierarchy of rules learned over the training samples. Each internal node in the tree represents an attribute, while each leaf node represents a decision on the input sample. To build a tree, we start with a root node. Similarly, HSD-CNN is formed by self-decomposition of nodes from conventional CNN . starts with a root node, which is an input sample.
Let the layer for a node in given by .
As in Algorithm 1, HSD-CNN tree graph is initialized with a root node, along with queue . For each node out of queue , select node from at layer along with its child node . As the is a single chain structured CNN, only one and one is available.
Decompose Node module of the algorithm either clusters the classes into two subsets or one set, and discussed in details further. For a given node and its Iscvs, Decompose Node results in either one or two nodes , which are attached as child nodes to with an empty edge in . Correspondingly, these nodes are inserted in a queue to repeat the steps till the queue is empty. If it results in only one child, then .
For clustering classes into subsets, select vectors of corresponding to classes in . Clustering is performed with Ward’s agglomerative clustering method (doi:10.1080/01621459.1963.10500845, ) in a bottom-up approach. In this approach, feature samples are assumed to be one cluster for each class in
. At each step, find those pair of clusters among them that lead to a minimum increase in total within-cluster variance, and later merge each pair as a new cluster. This increase is based on a weighted squared distance between cluster centers. The cluster distances are defined to be the squared Euclidean distance between vectors. Merging process is continued up to the hierarchy until we obtain two clusters. Final two clusters represents two class subsets .
When the above clustering results in either one or cardinal number, both nodes and are merged to form a single node. The number of elements in the set is the cardinal number of class subsets, here.
Though , a 3-dimensional map, has channels, we select only channels for its decomposed nodes , separately. Only those channels are selected that have high impact scores in for the classes to form node . The same is repeated for , though the selected channels may differ.
Yet, tree structured CNN formed from Algorithm 1 has un-weighted edges.
3.2.3. Parameter decomposition - Transferring model
However, designing such networks
increases the number of parameters and nodes. Such a network cannot be optimally trained with limited samples. We require a transfer learning based approach where the initialization of the newly formed network treeis improved more suitably.
As in Algorithm. 2, for each node , filters are obtained, depicted in Fig. 3.a), where . All the green channels of node and its corresponding filter channels in edge are omitted as seen in Figure 3.b. Later, the orange input channel maps of and its corresponding filter parameters in edge are excluded as in Figure 3.c. The truncated filters form an edge value , as in Figure 3.d. Parameter transferring is repeated for all edges in .
Though edges are transferred from , further fine-tuning is required as the nodes position, and input to leaf nodes differ. So, the HSD-CNN network is further fine-tuned with the training dataset
We implemented the decomposition of CNN and performed experiments using Pytorch library(pytorch, ). Training is conducted on NVIDIA GeForce 1080Ti based workstation. GPU speeds are measured on GeForce 1080Ti GPU and Intel(R) Xeon(R) E5-1660 v4 CPU. In training original network and decomposed network
, we use stochastic gradient descent optimizer ( SGD )(DBLP:journals/corr/Ruder16, )
. Also, the learning rate is reduced by ten times for every 50 epochs during training with an initial learning rate of 0.01.
We evaluate our proposed approach on different class sizes of datasets, namely CIFAR10, CIFAR100 and CALTECH101.
CIFAR (citeulike:7491128, ): The CIFAR dataset consists of natural images with a resolution of . CIFAR10 is drawn from classes, whereas CIFAR100 consists of classes. The train and test sets in both CIFAR10 and CIFAR100 contain and images respectively.
CALTECH101 (Fei-Fei:2007:LGV:1235884.1235969, ): This is another dataset used for validating HSD-CNN for a large number of classes with higher resolution of . It has images containing with classes and one background category. The dataset is split into training images, and testing images.
In the pre-processing step, color distortion is applied to each image. Due to inconsistency in image sizes, we re-size all samples to common size using bi-linear interpolation as the images available are not consistent in their size. Further whitening effect on samples normalizes the intensity values.
|DataSet||Accuracy||Compression||SpeedUp||Saved computations||Leaf Nodes|
We experimented the decomposition algorithm on VGG16 architecture (DBLP:journals/corr/SimonyanZ14a, )
. VGG16 network majorly consists of 5 max-pooling layers and 13 convolutional ( CONV ) layers followed by three full connected ( FC ) layers. All the CONV layers use
filters, inclusive of batch normalization followed by a ReLU non-linear unit. Though it is possible to calculate impact scores for FC layer, it is restricted to only CONV layers. Because we either use 1 FC layer orfiltered CONV layer in combination with an adaptive global average pooling layer after CNN feature maps.
Accuracy: Accuracy compares the top predicted class with the ground truth class, and labels them as correct if both labels are same.
Accuracy drop is the difference between originally trained model accuracy and the model performance accuracy obtained after hierarchical decomposition algorithm.
Assuming the number of parameters, operations and running time for a sample in original network model as , respectively. Similarly, assume representation for decomposed model too. Following metrics are given by
4.4. Implementation and Experimental Results
We chose one conventional network - VGG16 to test our algorithmic approach and perform varied experiments to demonstrate the effectiveness of our proposed approach.
|Dataset||Parameters||Compression Rate||computations||Saved computations Ratio (%)||Accuracy (%)|
|3.7M||3.98||89M||71.57||93.53(3), 97.3(3), 97.73(4)|
|3.7M||3.97||89M||71.57||85.85(13), 84.9(10), 71.17(6), 85(11), 88(6), 78(6), 91.25(4), 94.75(4), 87.38(8), 72.81(16), 95.6(6), 92.6(3), 99.5(2), 98.5(2), 89.3(3)|
|3.7M||3.97||4.36G||71.13||80.1(35), 89.23(9), 91.89(17), 94.44(3), 80.57(12), 100(2), 90.63(3), 96.67(4), 97.18(6), 92.74(11)|
4.4.1. Decomposing VGG16 using CIFAR and CALTECH datasets
Let us first consider CIFAR10 dataset. We first train VGG16 network with specifications detailed in Section 4.2 for the dataset. Impact class score vectors are calculated for each layer with respect to each of the 10 classes in CIFAR10, following the Section 3.2.1. As there are 13 CONV layers in VGG network chosen, we cluster only at CONV layers into two subsets for each parent node received from its predecessor CONV layer, respectively. We chose these layers as these are immediately followed by max-pool layers. Based on the Algorithm. 1 in Section 3.2.2, a new HSD-CNN graph from Iscvs of VGG16 on CIFAR10 is formed. At the final layer, we obtain 3 leaf nodes as the number of classes in CIFAR10 are only 10. Later, transfer the parameters obtained from a trained model of CIFAR10 dataset to the newly formed HSD-CNN, following the detailed strategy in Section 3.2.3. Accuracy drop after fine-tuning is , almost negligible degradation in performance.
Similar experiments are repeated for other two datasets, with detailed results shown in Tables. 2 and 2. The proposed algorithm results in 15 and 10 leaf nodes at their final layers for CIFAR100 and CALTECH101 datasets, respectively. Yet CIFAR10 has 3 leaf nodes. Because the number of classes present in those datasets is more than 100. So, it might have been difficult to discriminate the classes at the decomposition nodes. Being with less number of classes for CIFAR10, HSD-CNN algorithm leads to less number of leaf nodes at its end layer for CIFAR10. Simultaneously, we also limit the least number of classes possibly discriminated by any node to a minimum two.
We also observe that there is only accuracy drop for CIFAR100 from its original accuracy. However, there is an improvement in accuracy for CALTECH101 from to . Though training samples for CALTECH101 (6̃K) are less in comparison to CIFAR100 ( 50K), CALTECH101 has 5 fewer leaf nodes. It may imply that forming a wider HSD-CNN network may not always improve the performance. Although, there are other reasons like size of input and activation maps used are different ( - CIFAR100, - CALTECH101), information in (Iscv) vectors might also affect their performance.
Experimental results in Tables. 2 and 2 indicate that performing HSD-CNN algorithm either increase the performance or have negligible degradation in accuracy. We can further use the formed HSD-CNN to limit over-representation of networks and explore other applications, as detailed in Section 4.5.
Usually, higher the compression rate, better the compression algorithm. As we increase decomposed nodes in each layer as in Section 3.2.2 of HSD-CNN, the number of parameters increases gradually. The increased parameters and features might be redundant. To address this, we simultaneously prune half the channels for each node in decompose node of Section 3.2.2. This leads to one-fourth decrease of parameters for each node. As we are also pruning the parameters, our algorithm restricts the increase of parameters. As HSD-CNN for CIFAR10 is relatively thinner than CIFAR100 and CALTECH101, there is times of compression. As there are more number of leaf nodes in large class domain problem, the compression rate in CIFAR100 is just above half of that of CALTECH101. There is a chance of an increase in the compression rate for sub-network applications, explained in Section 4.5.
We also observe a computations savings of and in CIFAR10 and CALTECH101, respectively. However, we note no improvement for CIFAR100 ( ). Similar performance is also observed for speedup rate metric. In summary, we prove that our HSD-CNN results in comparable performance for CIFAR10 and CALTECH101. A better improvement can also be found for CIFAR100 if there is a limit in the number of leaf nodes formed.
4.4.2. Comparison with state of the art
Algorithms in (yanhd, ; DBLP:journals/corr/abs-1802-05800, ; Zhu2017BCNNBC, ) are formed in 2-level hierarchy. And there is manual interference in forming their bottom finer level architecture. However, there is no manual interfering in HSD-CNN. The network is self-formed. Hierarchy with more than two levels is also established in between the object categories. Results from Table. 4 also show better performance in our proposal for CIFAR datasets. As the network design in (yanhd, ; DBLP:journals/corr/abs-1802-05800, ; Zhu2017BCNNBC, ) is not fixed, it is computationally expensive and time-consuming in designing and training new networks. However, our proposed HSD-CNN algorithm automatically designs from any standard network and simultaneously loads suitable pre-initialized parameters.
It is also found that the proposed HSD-CNN performs better than the method in (DBLP:conf/cvpr/GuoP17, ) in accuracy in almost all parameter cases.
First, it is easy to notice that all the classes in the dataset can inherently be represented hierarchically from the HSD-CNN structure layout. In a dataset with no category annotation, classes with similarities can be grouped in one category level, while unfamiliar classes lie in different categories. At the same time, it is easy to visualize all the classes and their parent categories in a hierarchical representation. However, HSD-CNN focuses more on how much the calculated features affect the classes, not on the similarity. Hierarchical visual of CIFAR10 from HSD-CNN is in Figure 4. Plane, Car, and Truck are in one category. As they are formed at earlier stages, early layer features chose those classes that have a better impact in discriminating these classes. By Iscv vectors, horse is also included in the same category.
Second, our HSD-CNN forms more than one paths. The structure layout facilitates in computing the path in parallel in different cores of CPUs or GPUs. In this way, resources can be efficiently utilized without extra allocation and overhead.
Third, as discussed earlier in Section. 1, conventional CNN produces over-represented features in discriminating classes. Our HSD-CNN proposal modifies the CNN structure and can be used for limited subsets of classes which has appropriate representation of features to discriminate only those set of classes in the application. Detailed explanation in Section 4.5.1.
Generally, CNNs are designed and trained for hundreds or thousands of classes. These designs might prove better performance in applications of large class domains. However, all applications do not work with all the classes, and only require a subset of classes. And, it is tedious to design and train again for these applications. Our algorithm helps in retaining the performance without designing and retraining new CNN every time a different subset of a class domain is used. Experimental results from Table 3 indicate that sub-networks corresponding to leaf-nodes results in better accuracy performance. Also, these sub-networks has nearly compression rate compared to their original network and saves computations by . These results are performed without retraining the sub-networks, and we are able to achieve better performance in almost all cases.
For example, HSD-CNN in CIFAR100 dataset is trained for 100 classes. Choose any 20 classes subset from its 100 classes of CIFAR100. Let us mark all the corresponding paths for these 20 classes in HSD-CNN. And form a sub-graph containing all these paths from the HSD-CNN with all the corresponding edges included. Now, the obtained sub-graph can be utilized directly for deployment without any further training. In this way, our HSD-CNN algorithm supports in suitable representation for limited classes, even if the network is originally trained for large class domains. Even there is no requirement of further training in subnetworks. The sub-network formed will have less number of parameters and computations, leading to an increased compression rate, speedup rate, and saved computations ratio.
CIFAR10: As there are subsets for 10 classes, we group all the combinations of classes with the same cardinal number subsets as one category ( Subsets of classes in X-axis). We omit subset categories containing cardinal number 1 because calculating accuracy for a single class will not be valid. We perform direct inference for all these sub-networks without retraining HSD-CNN and visualize the accuracy performance in Fig. 6. We also omit category with 10 cardinal number in X-axis and mark HSD-CNN accuracy for the same full class domain set as the horizontal dotted line for comparison. We observe that sub-network accuracy for the most combination of classes is higher than the full class set HSD-CNN accuracy ( dotted horizontal line ). As combinations resulting in similar accuracy, the shape at vertical line spreads more ( more points at similar accuracy ). Similarly, the thin line in the figure corresponds to those combinations of classes, which has extreme accuracy in its group. When the spread of the vertical shape is thin, it indicates there are very few combinations of classes that results in the corresponding accuracy.
We perform similar experiments for CIFAR100 and CALTECH101 datasets. However, there are nearly sub-graphs possible for these subsets. Similar experimentation requires huge memory and large time. For convenience, we chose 100 combinations of two classes for cardinal number 2. Repeat the similar 100 combinations for other cardinal numbers also. As there are nearly 100 categories on X-axis, it is difficult to visualize all on a single graph. So, we chose the cardinal numbers with multiples of 10 and category with cardinal number 2 for easier understanding of the experimentation. Even the results for other cardinal number sub-graphs follow the similar pattern observed in Figure 6 and 7. It is observed that sub-network performance is better their full set HSD-CNN accuracy for most cases in both datasets.
We propose a novel strategy to self-decompose conventional CNN in a hierarchical tree structure. We adapt class filter sensitivity analysis in calculating impact score class vectors. Iscv vectors help in decomposition of nodes while forming HSD-CNN architecture. We also strategically initialize the newly formed HSD-CNN with parameters transferred from its original trained model. Experimental results also show that our algorithm performs better than other hierarchical methods. We also discussed sub-network applications of HSD-CNN, which limits the over-representation used in conventional CNNs. However, an increase in direct compression rate and saved computations ratio might speed up the inference time. Suitable modifications can be extended to our HSD-CNN proposal to achieve improvement in speedup rate and compression rate. As the structure of HSD-CNN looks like a tree, deciding the path of the input sample computed would definitely enhance the speedup rate and energy savings, which we intend to explore in future.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society, 2016.
-  Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, and Yizhou Yu. HD-CNN: hierarchical deep convolutional neural networks for large scale visual recognition. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015 , pages 2740–2748.
-  Deboleena Roy, Priyadarshini Panda, and Kaushik Roy. Tree-cnn: A deep convolutional neural network for lifelong learning. CoRR, abs/1802.05800, 2018.
-  Xinqi Zhu and Michael Bain. B-CNN: branch convolutional neural network for hierarchical classification. CoRR, abs/1709.09890, 2017.
-  Jia Guo and Miodrag Potkonjak. Pruning convnets online for efficient specialist models. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops, Honolulu, HI, USA, July 21-26, 2017, pages 430–437. IEEE Computer Society, 2017.
-  Yann Le Cun, John S. Denker, and Sara A. Solla. Advances in neural information processing systems 2. chapter Optimal Brain Damage, pages 598–605. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.
-  Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5, [NIPS Conference], pages 164–171, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.
-  Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 1135–1143, Cambridge, MA, USA, 2015. MIT Press.
-  Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. J. Emerg. Technol. Comput. Syst., 13(3):32:1–32:18, February 2017.
-  A. Polyak and L. Wolf. Channel-level acceleration of deep face representations. IEEE Access, 3:2163–2175, October 2015.
-  Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pages 1269–1277, Cambridge, MA, USA, 2014. MIT Press.
-  Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In Michel François Valstar, Andrew P. French, and Tony P. Pridmore, editors, British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014. BMVA Press, 2014.
-  X. Dong, J. Huang, Y. Yang, and S. Yan. More is less: A more complicated network with less inference complexity. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1895–1903, July 2017.
-  Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell., 38(10):1943–1955, October 2016.
-  Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017 , pages 1398–1406.
-  Wei Wen, Cong Xu, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Coordinating filters for faster deep neural networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017 , pages 658–666.
-  Q. Huang, K. Zhou, S. You, and U. Neumann. Learning to prune filters in convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 709–718, March 2018.
-  Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
-  Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.
-  Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083, 2017.
-  Gao Huang, Liu Shichen, Laurens Van der Maaten, and Kilian Weinberger. Condensenet: An efficient densenet using learned group convolutions. CVPR 2018, Apr 2018.
-  Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
-  Nicholas Frosst and Geoffrey E. Hinton. Distilling a neural network into a soft decision tree. CoRR, abs/1711.09784, 2017.
-  Quanshi Zhang, Yu Yang, Ying Nian Wu, and Song-Chun Zhu. Interpreting cnns via decision trees. CoRR, abs/1802.00121, 2018.
-  Joe H. Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244, 1963.
-  Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. http://pytorch.org/.
-  Sebastian Ruder. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016.
-  Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, 2009.
-  Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst., 106(1):59–70, April 2007.
-  Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and Shih-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015 , pages 2857–2865.
-  2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 2015.
-  IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 2017.