HSD-CNN: Hierarchically self decomposing CNN architecture using class specific filter sensitivity analysis

11/11/2018 ∙ by K. SaiRam, et al. ∙ ERNET India IIT Kharagpur 0

Conventional Convolutional neural networks (CNN) are trained on large domain datasets, and are hence typically over-represented and inefficient in limited class applications. An efficient way to convert such large many-class pre-trained networks into small few-class networks is through a hierarchical decomposition of its feature maps. To alleviate this issue, we propose an automated framework for such decomposition in Hierarchically Self Decomposing CNNs (HSD-CNN), in four steps. HSD-CNNs are derived automatically using a class specific filter sensitivity analysis that quantifies the impact of specific features on a class prediction. The decomposed and hierarchical network can be utilized and deployed directly to obtain sub-networks for subset of classes, and it is shown to perform better without the requirement of retraining these sub-networks. Experimental results show that HSD-CNNs generally do not degrade accuracy if the full set of classes are used. However, when operating on known subsets of classes, HSD-CNNs lead to an increased accuracy using a much smaller model size, requiring much less operations. HSD-CNN flow is verified on the CIFAR10, CIFAR100 and CALTECH101 data sets. We report accuracies up to 85.6% ( 94.75% ) on scenarios with 13 ( 4 ) classes of CIFAR100, using a VGG-16 network pretrained on the full data set. In this case, the used HSD-CNN requires 3.97 × fewer parameters and 3.56 × fewer operations than the VGG-16 baseline containing features for all 100 classes.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recently, Convolutional Neural Networks ( CNNs ) have outperformed traditional machine learning models in many computer vision tasks. However, it required extensive research in discovering high-performance CNN architectures. As far as the large-scale image classification task is concerned, state-of-art CNN’s are going beyond deep, and single chain structured layouts

(DBLP:journals/corr/SimonyanZ14a, ; 7780459, ).

All these networks are trained on datasets with many classes and are over-represented when they are used on smaller tasks with fewer classes (DBLP:journals/corr/SimonyanZ14a, ). This over-representation translates into large and inefficient models, that require too many weights to represent redundant features and too many computations to compute. Instead of a single chain structured CNN, a hierarchically structured CNN would be more efficient, as it allows using only the necessary features to represent a specific subset of classes, rather than the full set of features used to represent the full class domain set.

So, our objective is to design a network that handles large classes and simultaneously inhibits the over-representation between the classes with minimal manual interference and design time. Though there are 2-level hierarchical strategies exploited in (yanhd, ; DBLP:journals/corr/abs-1802-05800, ; Zhu2017BCNNBC, )

, methods to deploy model for classifying specific classes without retraining the network are not found. So, we adopt filter sensitivity analysis in

(DBLP:conf/cvpr/GuoP17, )

, and form Impact score class vectors (

Iscv). Iscvs also helps in automated designing of network architecture.

In this paper, we propose an automated way to create computationally efficient and compressed Hierarchically Self Decomposed CNN’s (HSD-CNN), based on existing pre-trained models. In the proposed algorithm, classes are organized hierarchically without manual intervention. The automated design flow of an HSD-CNN is a four-step process, detailed as in Section 3. Part of HSD-CNN that discriminates specific set of classes, named as Subnetwork, is selected in achieving our objective, and the corresponding sub-model can be deployed without retraining, detailed in Section 4.5.1.

This paper has two main contributions:

  • Any state of the art CNN’s can be automatically decomposed hierarchically into multiple levels using class specific filter analysis (DBLP:conf/cvpr/GuoP17, ). This is the first time a CNN is decomposed and pruned simultaneously based on class-specific filters. HSD-CNN algorithm allows the discriminating learned features to be organized hierarchically. These features are not limited to just two levels - coarse and fine as in (yanhd, ; DBLP:journals/corr/abs-1802-05800, ; Zhu2017BCNNBC, ). HSD-CNN has more than two layers, where classes can be grouped hierarchically.

  • Sub-network of HSD-CNN can be used to overcome over-represented CNN models. Part of HSD-CNN corresponding to a specific class domain are adapted as efficient subnetworks. Subnetworks are deployed without retraining to any application scenarios where only a subset of classes is used. And, results show that sub-networks perform better for most cases by a good margin.

2. Recent Literature

With an increase in the complexity of architecture, high computation and memory requirements of these models hinder their deployment on low power embedded devices. So far, many researchers have focused on pruning individual parameters on powerful models, at the cost of performance degradation. Le Cun et al. (Cun:1990:OBD:109230.109298, ) analytically prunes those parameters that have fewer effects when these parameters are perturbed. In (Hassibi:1992:SOD:645753.668069, )

, second order derivatives on loss function are used to determine the parameters which need to be pruned. Han et al.

(Han:2015:LBW:2969239.2969366, ) achieves impressive memory savings by removing weights with magnitudes smaller than a threshold. In (Anwar:2017:SPD:3051701.3005348, ; 7303876, ), filters and parameters are pruned at different levels of the model using statistical analysis of filters and their feature maps. Method of pruning individual parameters and filters have resulted mostly in compressing the model, but not speeding up the inference time.

As filters at each layer in CNNs are tensors or matrices ( tensor slices ), applying low rank approximation methods

(Denton:2014:ELS:2968826.2968968, ; DBLP:journals/corr/JaderbergVZ14, ; 8099688, ; Zhang:2016:AVD:3026801.3026837, ; DBLP:journals/corr/HeZS17, ; DBLP:conf/iccv/WenXWWCL17, ; 8354187, ) to decompose these filters into lightweight layers have been useful in increasing the efficiency during inference, and simultaneously reducing the number of parameters. Design methodolgy found in (Denton:2014:ELS:2968826.2968968, ; DBLP:journals/corr/JaderbergVZ14, ; 8099688, ; Zhang:2016:AVD:3026801.3026837, ; DBLP:journals/corr/HeZS17, ; DBLP:conf/iccv/WenXWWCL17, ; 8354187, ) lead to discovering compact and efficient networks like MobileNet (DBLP:journals/corr/HowardZCKWWAA17, ), SqueezeNet (DBLP:journals/corr/IandolaMAHDK16, ), Grouped Convolutions (DBLP:journals/corr/ZhangZLS17, ; Huang_Shichen_VanderMaaten_Weinberger_2018, ).

Very few attempts have been made to exploit category hierarchies (yanhd, ; DBLP:journals/corr/abs-1802-05800, ; Zhu2017BCNNBC, ) in deep CNN models, and they are restricted to two levels. Hierarchical deep CNN’s (HD-CNN) (yanhd, ) embed deep CNN’s into a two-level category hierarchy. They separate easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers. Tree-CNN (DBLP:journals/corr/abs-1802-05800, ) proposes a training method for incremental learning, albeit the network is limited to similar 2-level hierarchy as in (yanhd, ). A similar 2-level hierarchy along with Branch Training strategy is introduced in Branch-CNN (Zhu2017BCNNBC, ). The branch training strategy balances the strictness of the prior with the freedom to adjust parameters on the output layers to minimize the loss.

Distributed representations in the hidden layers of deep feed-forward neural networks have excellent generalization abilities (DBLP:journals/nature/LeCunBH15, ), though these representations are difficult to analyze. Because any particular feature activation depends on the effects of all other units in the same layer in its distributed representation.

Despite the limited understanding of the hidden representation that discriminates the class, in

(DBLP:journals/nature/LeCunBH15, ), the authors of (DBLP:journals/corr/abs-1711-09784, )

proposes mimicking a neural net as a decision tree that makes soft decisions. This method allows to form a short representation as a decision tree and with faster execution. However, they do not explore distilling the CNNs as a decision tree. Because CNNs have a lot of information which cannot be represented in the form of a decision tree. Information will be lost if CNNs are represented as a normal decision tree. So, we require a method to represent CNNs as a decision tree, without losing information.

Authors of (DBLP:journals/corr/abs-1802-00121, ) explain that logic behind each prediction made by a pre-trained CNN can be quantitatively represented by a decision tree. It also explains that each filter of a layer might represent a specific or group of object parts. But, authors present no strategy to influence inference computation and model compression. So, pruning based on model interpretability is still a significant challenge in neural networks ( NNs ).

Authors of (DBLP:conf/cvpr/GuoP17, ) proposes a filter sensitivity analysis method to decide the filter importance specific to a class. We adopt this method in interpreting the trained model of a network that can handle a large number of classes. So, the proposed network decomposes itself based on the interpretation computed earlier using class specific filter sensitivity analysis. And, the decomposed network structure depicts a decision tree when observed in its computational graph analogy form.

3. Hierarchically Self Decomposing CNN

3.1. Preliminaries

Consider a classical image classification setting on a conventional CNN learned through a training dataset over a set of classes . The dataset is composed of samples of images and corresponding labels , where is a 3-dimensional input image and is associated with one of the class labels in .

The goal of the CNN in image classification is to learn a mapping function . Softmax function at the final layer of the network produces a posterior distribution over classes . Then the network minimizes a classification loss function over all samples of dataset to search optimal parameters required for the mapping function.


Surprisingly, any modification in Equation [1] formation affects the performance of the assumed image classification setting. For example, deep CNNs (DBLP:journals/corr/SimonyanZ14a, ), with many layers, performs better for a large scale image scale classification.

Yet, optimizing the loss function in Eq. [1

] is troublesome due to the curse of dimensionality, i.e, increase in complexity of the network and its parameters. Designing new networks catering to the scenario of a large number of classes is a time-consuming and tedious task. Using large networks for application domains that are restricted to a subset of its classes

is an overkill, as the network is over-represented in chain structured conventional CNN. Because, the learned discriminative features may be more biased towards a specific set of classes, and may worsen the performance (like accuracy, inference, latency speed) on the remaining class set.

So, our objective is to design a network that simultaneously handles large classes and inhibits the over-representation between the classes with minimal manual interference and design time. For easier understanding of our objective, it is useful to represent CNN as a directed acyclic graph (DAG), , where is the set of nodes and edges in network graph , respectively. Visual representation of network graph G is in Figure 1.

Figure 1. CNN as DAG, , where are activation map nodes, and are the filters associated in their layers. Here, inner nodes are marked in gray, root node as thick black border, and leaf nodes are in plain color. By parent node representation, , , and . are left and right child nodes of , and is single child node of

Let us also represent as the parent node of node , for all nodes except root node in . Root node is the input image and leaf node

is the probability score over classes

. Also, let us refer all nodes except root node and leaf nodes as inner nodes, which represent activation maps.

We also represent the number of edges from the root node to any node as . It signifies that the node is present at layer in network . Also, network at layer may have more than one nodes with same or different parent nodes. Layer definition is similar to the definition of depth in graphs.

Node is a 3-dimensional activation map. The output node is the convolution output formed between input activation map and layer filters in edge . Generally, each edge e has 3-dimensional filters which may vary with edges in .

Node is useful in discriminating classes - a subset of classes in dataset . For any inner node node , . We also restrict overlapping class subsets, . So, , and .

In a conventional CNN , each node has one child , except for leaf nodes . Number of CONV layers in conventional network from input image to present layer activation map is counted as .

Similarly, in HSD-CNN , each inner node is restricted to at least one child node ( ) and at most two children nodes ( ).

3.2. Algorithm

Our algorithm proposes self-decomposition of conventional chain-structured CNN models into a tree-structured CNN layout ( HSD-CNN ). The proposed inherently categorizes whole classes of into a hierarchical group of subsets in the following four steps:

3.2.1. Impact score class vector

Let us analyze the impact of a channel on a certain class at layer in the network. It requires a large number of variables to examine the channels are inter-dependent on its predecessor and successive layer features. Our proposed method adapts filter sensitivity analysis (DBLP:conf/cvpr/GuoP17, ) to calculate the impact of each channel in layers on each class present in the dataset as a score.

Let the trained conventional CNN be . Assign a weight for each channel at layer in the network , with no changes to other layers. Let the modified network be . Suppose an sample , with corresponding ground truth class , in the training data produces a feature map at layer in network . Its response at corresponding layer of results in as the weight variable is included in .

Let , an element of leaf node vector in be the probability score corresponding to the class that the sample

belongs to. It is calculated from the softmax layer output.

Then the impact score that the channel at layer node in has on class is defined as ratio of the change in the probability score with effective change in weight variable at corresponding channel.


For inclusion of robustness in the impact score , let us calculate the sum of absolute values of the impact scores produced for each sample whose class label is .


As there are channels to layer node , there are impact scores for each class . Let us represent all the scores at particular layer for a class as a feature vector, namely Impact score class vector, . Lastly, we normalize the vector by dividing with its maximum element , i.e, and is the index of maximum element. It helps in bringing vectors in a convenient range for comparison with other class vectors. Normalization of Iscv mitigates minimum-maximum variation values for different classes. Calculation of Iscv features is repeated for all classes at each layer in . These Iscvs help in self formation of our tree-structured CNN, .

3.2.2. Formation of HSD-CNN

Generally, a decision tree forms from a supervised algorithm that classifies data based on a hierarchy of rules learned over the training samples. Each internal node in the tree represents an attribute, while each leaf node represents a decision on the input sample. To build a tree, we start with a root node. Similarly, HSD-CNN is formed by self-decomposition of nodes from conventional CNN . starts with a root node, which is an input sample.

Result: Nodes and its empty edges in .
1 initialization: Empty ;
2 ;
3 ;
4 ;
5 while queue do
6       queue.pop();
7       ;
8       child node of at layer of ;
9       if  then
10             continue;
12       end if
13      ;
14       ;
15       ;
16       if  then
17             queue.insert();
18             ;
19             ;
20             ;
22       end if
23      if  then
24             queue.insert();
25             ;
26             ;
27             ;
29       end if
31 end while
Algorithm 1 Formation of HSD-CNN layout , given and Iscv

Let the layer for a node in given by .

As in Algorithm 1, HSD-CNN tree graph is initialized with a root node, along with queue . For each node out of queue , select node from at layer along with its child node . As the is a single chain structured CNN, only one and one is available.

Decompose Node module of the algorithm either clusters the classes into two subsets or one set, and discussed in details further. For a given node and its Iscvs, Decompose Node results in either one or two nodes , which are attached as child nodes to with an empty edge in . Correspondingly, these nodes are inserted in a queue to repeat the steps till the queue is empty. If it results in only one child, then .

Figure 2. Visualization of HSD-CNN graph formed for a specific CNN. Pre-trained CNN in a) is transformed to graph in b). Dotted circles signify the inclusion of pooling layers in the filters, and also a requirement of decomposing node module. Following the Algorithm 1, a new HSD-CNN is formed. In the example shown, has three leaf nodes.

For clustering classes into subsets, select vectors of corresponding to classes in . Clustering is performed with Ward’s agglomerative clustering method (doi:10.1080/01621459.1963.10500845, ) in a bottom-up approach. In this approach, feature samples are assumed to be one cluster for each class in

. At each step, find those pair of clusters among them that lead to a minimum increase in total within-cluster variance, and later merge each pair as a new cluster. This increase is based on a weighted squared distance between cluster centers. The cluster distances are defined to be the squared Euclidean distance between vectors

. Merging process is continued up to the hierarchy until we obtain two clusters. Final two clusters represents two class subsets .

When the above clustering results in either one or cardinal number, both nodes and are merged to form a single node. The number of elements in the set is the cardinal number of class subsets, here.

Figure 3. Parameters from CNN graph are sequentially transferred to HSD-CNN . Transfer process for each edge is visualized here.

Though , a 3-dimensional map, has channels, we select only channels for its decomposed nodes , separately. Only those channels are selected that have high impact scores in for the classes to form node . The same is repeated for , though the selected channels may differ.

Yet, tree structured CNN formed from Algorithm 1 has un-weighted edges.

3.2.3. Parameter decomposition - Transferring model

However, designing such networks

increases the number of parameters and nodes. Such a network cannot be optimally trained with limited samples. We require a transfer learning based approach where the initialization of the newly formed network tree

is improved more suitably.

Result: Edges
1 /* Omit root node */
2 for  each node  do
3       node at layer of ;
4       ;
5        // K 3-d filters
6       set of channels selected for ;
7       ;
8       set of channels selected for ;
9       ;
10       ;
12 end for
Algorithm 2 Transfer edge filter parameters from to

As in Algorithm. 2, for each node , filters are obtained, depicted in Fig. 3.a), where . All the green channels of node and its corresponding filter channels in edge are omitted as seen in Figure 3.b. Later, the orange input channel maps of and its corresponding filter parameters in edge are excluded as in Figure 3.c. The truncated filters form an edge value , as in Figure 3.d. Parameter transferring is repeated for all edges in .

3.2.4. Training

Though edges are transferred from , further fine-tuning is required as the nodes position, and input to leaf nodes differ. So, the HSD-CNN network is further fine-tuned with the training dataset

4. Evaluation

We implemented the decomposition of CNN and performed experiments using Pytorch library

(pytorch, ). Training is conducted on NVIDIA GeForce 1080Ti based workstation. GPU speeds are measured on GeForce 1080Ti GPU and Intel(R) Xeon(R) E5-1660 v4 CPU. In training original network and decomposed network

, we use stochastic gradient descent optimizer ( SGD )

(DBLP:journals/corr/Ruder16, )

. Also, the learning rate is reduced by ten times for every 50 epochs during training with an initial learning rate of 0.01.

4.1. Datasets

We evaluate our proposed approach on different class sizes of datasets, namely CIFAR10, CIFAR100 and CALTECH101.

CIFAR (citeulike:7491128, ): The CIFAR dataset consists of natural images with a resolution of . CIFAR10 is drawn from classes, whereas CIFAR100 consists of classes. The train and test sets in both CIFAR10 and CIFAR100 contain and images respectively.

CALTECH101 (Fei-Fei:2007:LGV:1235884.1235969, ): This is another dataset used for validating HSD-CNN for a large number of classes with higher resolution of . It has images containing with classes and one background category. The dataset is split into training images, and testing images.

In the pre-processing step, color distortion is applied to each image. Due to inconsistency in image sizes, we re-size all samples to common size using bi-linear interpolation as the images available are not consistent in their size. Further whitening effect on samples normalizes the intensity values.

DataSet Accuracy Parameters Computations Time (mSec)
CIFAR10 93.41 14.7M 313M 0.246
CIFAR100 72.07 14.7M 313M 0.242
CALTECH101 76.39 14.7M 15.3G 14.15
Table 2. Performance of HSD-CNN algorithm on VGG16 for CIFAR and CALTECH datasets.
DataSet Accuracy Compression SpeedUp Saved computations Leaf Nodes
Drop Rate Rate ratio (%)
CIFAR10 0.08 1.34 1.51 32.27% 3
CIFAR100 0.85 0.37 0.97 -39.94% 15
CALTECH101 -2.56 0.56 1.82 1.31% 10
Table 1. Standard baseline statistics for VGG16 on CIFAR10, CIFAR100 and CALTECH101 datasets.

4.2. Networks

We experimented the decomposition algorithm on VGG16 architecture (DBLP:journals/corr/SimonyanZ14a, )

. VGG16 network majorly consists of 5 max-pooling layers and 13 convolutional ( CONV ) layers followed by three full connected ( FC ) layers. All the CONV layers use

filters, inclusive of batch normalization followed by a ReLU non-linear unit. Though it is possible to calculate impact scores for FC layer, it is restricted to only CONV layers. Because we either use 1 FC layer or

filtered CONV layer in combination with an adaptive global average pooling layer after CNN feature maps.

4.3. Metrics

Accuracy: Accuracy compares the top predicted class with the ground truth class, and labels them as correct if both labels are same.

Accuracy drop is the difference between originally trained model accuracy and the model performance accuracy obtained after hierarchical decomposition algorithm.

Assuming the number of parameters, operations and running time for a sample in original network model as , respectively. Similarly, assume representation for decomposed model too. Following metrics are given by

4.4. Implementation and Experimental Results

We chose one conventional network - VGG16 to test our algorithmic approach and perform varied experiments to demonstrate the effectiveness of our proposed approach.

Figure 4. Hierarchical representation of classes for CIFAR10 from HSD-CNN graph.
Dataset Parameters Compression Rate computations Saved computations Ratio (%) Accuracy (%)
CIFAR10 10.98M 1.34 212M 32.27 93.33
3.7M 3.98 89M 71.57 93.53(3), 97.3(3), 97.73(4)
CIFAR100 40M 0.37 438M -39.94 71.22
3.7M 3.97 89M 71.57 85.85(13), 84.9(10), 71.17(6), 85(11), 88(6), 78(6), 91.25(4), 94.75(4), 87.38(8), 72.81(16), 95.6(6), 92.6(3), 99.5(2), 98.5(2), 89.3(3)
CALTECH101 26.3M 0.56 15.1G 1.31 78.95
3.7M 3.97 4.36G 71.13 80.1(35), 89.23(9), 91.89(17), 94.44(3), 80.57(12), 100(2), 90.63(3), 96.67(4), 97.18(6), 92.74(11)
Table 3. Comparison of HSD-CNN with its leaf node sub-networks on VGG16 for CIFAR and CALTECH datasets. In the format x(y) of Accuracy, x represents accuracy of sub-network calculated for y classes.

4.4.1. Decomposing VGG16 using CIFAR and CALTECH datasets

Let us first consider CIFAR10 dataset. We first train VGG16 network with specifications detailed in Section 4.2 for the dataset. Impact class score vectors are calculated for each layer with respect to each of the 10 classes in CIFAR10, following the Section 3.2.1. As there are 13 CONV layers in VGG network chosen, we cluster only at CONV layers into two subsets for each parent node received from its predecessor CONV layer, respectively. We chose these layers as these are immediately followed by max-pool layers. Based on the Algorithm. 1 in Section 3.2.2, a new HSD-CNN graph from Iscvs of VGG16 on CIFAR10 is formed. At the final layer, we obtain 3 leaf nodes as the number of classes in CIFAR10 are only 10. Later, transfer the parameters obtained from a trained model of CIFAR10 dataset to the newly formed HSD-CNN, following the detailed strategy in Section 3.2.3. Accuracy drop after fine-tuning is , almost negligible degradation in performance.

Similar experiments are repeated for other two datasets, with detailed results shown in Tables. 2 and 2. The proposed algorithm results in 15 and 10 leaf nodes at their final layers for CIFAR100 and CALTECH101 datasets, respectively. Yet CIFAR10 has 3 leaf nodes. Because the number of classes present in those datasets is more than 100. So, it might have been difficult to discriminate the classes at the decomposition nodes. Being with less number of classes for CIFAR10, HSD-CNN algorithm leads to less number of leaf nodes at its end layer for CIFAR10. Simultaneously, we also limit the least number of classes possibly discriminated by any node to a minimum two.

We also observe that there is only accuracy drop for CIFAR100 from its original accuracy. However, there is an improvement in accuracy for CALTECH101 from to . Though training samples for CALTECH101 (6̃K) are less in comparison to CIFAR100 ( 50K), CALTECH101 has 5 fewer leaf nodes. It may imply that forming a wider HSD-CNN network may not always improve the performance. Although, there are other reasons like size of input and activation maps used are different ( - CIFAR100, - CALTECH101), information in (Iscv) vectors might also affect their performance.

Experimental results in Tables. 2 and 2 indicate that performing HSD-CNN algorithm either increase the performance or have negligible degradation in accuracy. We can further use the formed HSD-CNN to limit over-representation of networks and explore other applications, as detailed in Section 4.5.

Usually, higher the compression rate, better the compression algorithm. As we increase decomposed nodes in each layer as in Section 3.2.2 of HSD-CNN, the number of parameters increases gradually. The increased parameters and features might be redundant. To address this, we simultaneously prune half the channels for each node in decompose node of Section 3.2.2. This leads to one-fourth decrease of parameters for each node. As we are also pruning the parameters, our algorithm restricts the increase of parameters. As HSD-CNN for CIFAR10 is relatively thinner than CIFAR100 and CALTECH101, there is times of compression. As there are more number of leaf nodes in large class domain problem, the compression rate in CIFAR100 is just above half of that of CALTECH101. There is a chance of an increase in the compression rate for sub-network applications, explained in Section 4.5.

We also observe a computations savings of and in CIFAR10 and CALTECH101, respectively. However, we note no improvement for CIFAR100 ( ). Similar performance is also observed for speedup rate metric. In summary, we prove that our HSD-CNN results in comparable performance for CIFAR10 and CALTECH101. A better improvement can also be found for CIFAR100 if there is a limit in the number of leaf nodes formed.

Dataset CIFAR10 CIFAR100
HD-CNN (yanhd, ) - 67.31
Tree-CNN (DBLP:journals/corr/abs-1802-05800, ) 86.24 60.46
B-CNN (Zhu2017BCNNBC, ) 88.22 64.42
HSD-CNN( ours ) 93.33 71.22
Table 4. Comparison of our proposed HSD-CNN with other hierarchical CNN methods

4.4.2. Comparison with state of the art

As seen in Table. 4, our proposed HSD-CNN performs better than other hierarchical methods (yanhd, ; DBLP:journals/corr/abs-1802-05800, ; Zhu2017BCNNBC, ) in both CIFAR10 and CIFAR100 datasets.

Algorithms in (yanhd, ; DBLP:journals/corr/abs-1802-05800, ; Zhu2017BCNNBC, ) are formed in 2-level hierarchy. And there is manual interference in forming their bottom finer level architecture. However, there is no manual interfering in HSD-CNN. The network is self-formed. Hierarchy with more than two levels is also established in between the object categories. Results from Table. 4 also show better performance in our proposal for CIFAR datasets. As the network design in (yanhd, ; DBLP:journals/corr/abs-1802-05800, ; Zhu2017BCNNBC, ) is not fixed, it is computationally expensive and time-consuming in designing and training new networks. However, our proposed HSD-CNN algorithm automatically designs from any standard network and simultaneously loads suitable pre-initialized parameters.

It is also found that the proposed HSD-CNN performs better than the method in (DBLP:conf/cvpr/GuoP17, ) in accuracy in almost all parameter cases.

4.5. Application

First, it is easy to notice that all the classes in the dataset can inherently be represented hierarchically from the HSD-CNN structure layout. In a dataset with no category annotation, classes with similarities can be grouped in one category level, while unfamiliar classes lie in different categories. At the same time, it is easy to visualize all the classes and their parent categories in a hierarchical representation. However, HSD-CNN focuses more on how much the calculated features affect the classes, not on the similarity. Hierarchical visual of CIFAR10 from HSD-CNN is in Figure 4. Plane, Car, and Truck are in one category. As they are formed at earlier stages, early layer features chose those classes that have a better impact in discriminating these classes. By Iscv vectors, horse is also included in the same category.

Second, our HSD-CNN forms more than one paths. The structure layout facilitates in computing the path in parallel in different cores of CPUs or GPUs. In this way, resources can be efficiently utilized without extra allocation and overhead.

Third, as discussed earlier in Section. 1, conventional CNN produces over-represented features in discriminating classes. Our HSD-CNN proposal modifies the CNN structure and can be used for limited subsets of classes which has appropriate representation of features to discriminate only those set of classes in the application. Detailed explanation in Section 4.5.1.

Figure 5. Performance of Sub-networks formed from HSD-CNN for CIFAR10.
Figure 6. Performance of Sub-networks formed from HSD-CNN for CIFAR100.

4.5.1. Sub-network

Generally, CNNs are designed and trained for hundreds or thousands of classes. These designs might prove better performance in applications of large class domains. However, all applications do not work with all the classes, and only require a subset of classes. And, it is tedious to design and train again for these applications. Our algorithm helps in retaining the performance without designing and retraining new CNN every time a different subset of a class domain is used. Experimental results from Table 3 indicate that sub-networks corresponding to leaf-nodes results in better accuracy performance. Also, these sub-networks has nearly compression rate compared to their original network and saves computations by  . These results are performed without retraining the sub-networks, and we are able to achieve better performance in almost all cases.

For example, HSD-CNN in CIFAR100 dataset is trained for 100 classes. Choose any 20 classes subset from its 100 classes of CIFAR100. Let us mark all the corresponding paths for these 20 classes in HSD-CNN. And form a sub-graph containing all these paths from the HSD-CNN with all the corresponding edges included. Now, the obtained sub-graph can be utilized directly for deployment without any further training. In this way, our HSD-CNN algorithm supports in suitable representation for limited classes, even if the network is originally trained for large class domains. Even there is no requirement of further training in subnetworks. The sub-network formed will have less number of parameters and computations, leading to an increased compression rate, speedup rate, and saved computations ratio.

We also perform experiments to the usage of sub-networks application over HSD-CNN, as seen in Figure 6, 6, and 7.

CIFAR10: As there are subsets for 10 classes, we group all the combinations of classes with the same cardinal number subsets as one category ( Subsets of classes in X-axis). We omit subset categories containing cardinal number 1 because calculating accuracy for a single class will not be valid. We perform direct inference for all these sub-networks without retraining HSD-CNN and visualize the accuracy performance in Fig. 6. We also omit category with 10 cardinal number in X-axis and mark HSD-CNN accuracy for the same full class domain set as the horizontal dotted line for comparison. We observe that sub-network accuracy for the most combination of classes is higher than the full class set HSD-CNN accuracy ( dotted horizontal line ). As combinations resulting in similar accuracy, the shape at vertical line spreads more ( more points at similar accuracy ). Similarly, the thin line in the figure corresponds to those combinations of classes, which has extreme accuracy in its group. When the spread of the vertical shape is thin, it indicates there are very few combinations of classes that results in the corresponding accuracy.

Figure 7. Performance of Sub-networks formed from HSD-CNN for CALTECH101.

We perform similar experiments for CIFAR100 and CALTECH101 datasets. However, there are nearly sub-graphs possible for these subsets. Similar experimentation requires huge memory and large time. For convenience, we chose 100 combinations of two classes for cardinal number 2. Repeat the similar 100 combinations for other cardinal numbers also. As there are nearly 100 categories on X-axis, it is difficult to visualize all on a single graph. So, we chose the cardinal numbers with multiples of 10 and category with cardinal number 2 for easier understanding of the experimentation. Even the results for other cardinal number sub-graphs follow the similar pattern observed in Figure 6 and 7. It is observed that sub-network performance is better their full set HSD-CNN accuracy for most cases in both datasets.

5. Conclusion

We propose a novel strategy to self-decompose conventional CNN in a hierarchical tree structure. We adapt class filter sensitivity analysis in calculating impact score class vectors. Iscv vectors help in decomposition of nodes while forming HSD-CNN architecture. We also strategically initialize the newly formed HSD-CNN with parameters transferred from its original trained model. Experimental results also show that our algorithm performs better than other hierarchical methods. We also discussed sub-network applications of HSD-CNN, which limits the over-representation used in conventional CNNs. However, an increase in direct compression rate and saved computations ratio might speed up the inference time. Suitable modifications can be extended to our HSD-CNN proposal to achieve improvement in speedup rate and compression rate. As the structure of HSD-CNN looks like a tree, deciding the path of the input sample computed would definitely enhance the speedup rate and energy savings, which we intend to explore in future.