Convolutional neural networks (CNNs) have been broadly applied on various visual tasks due to its superior performance ([vgg], [resnet], [densenet]). But the huge computation burden prevents convolutional neural networks from running on mobile devices. Some works had been done to prune neural networks into smaller ones ([slimming], [pruning1], [pruning2]). In addition, there are also many light weight network structures ([mobilenet], [mobilenetv2], [shufflenet]
) were proposed to adapt convolutional neural networks to computational limited mobile devices. However, these methods usually require running a whole pre-trained network whatever the task is. i.e. the first task requires the discrimination power of cats and dogs, and the second task requires the discrimination power of apples and watermelons. If one has a CNN which was pre-trained on ImageNet, he must run the whole CNN on each task, which is usually time consuming and computation wasted.
Our work focuses on a basic problem, i.e. can we run only parts of a CNN? To achieve this goal, we need to find a method to dissect the whole network into pieces and reconstruct some of these pieces according to specific tasks. The reconstructed CNN should have a smaller computation cost and better performance. Meanwhile, the process to generate this substructure should be quick and easy. Therefore this technology can be applied on mobile devices and small robots such as cell-phones and unmanned aerial vehicles. Using these technologies, these devices only need to store one complete CNN and some information of the substructure generating program. When specific tasks come, these devices can generate a smaller substructure in an instant and run on it, rather than run the whole original CNN.
In this paper, we proposed a novel and interpretable algorithm to generate these smaller substructures. Our method is inspired by an interpretable method of CNN. As shown in Figure 1: the original CNN have many channels, but not every channel is useful for the discrimination of every class. What we need to do is to find the channels important to every class and combine them for the specific task. Actually, this method looks similar to the previous work: structured network pruning ([slimming], [pruning3], [pruning4]). However, all of these pruning methods need fine-tuning, which is time-consuming and not allowed on mobile devices. And these pruning methods are usually lack of interpretability which is greatly needed by human-beings when using CNNs. Therefore, we do not mean to propose a pruning method and make CNN smaller, but to find the best channels for each class, and combine them for specific tasks. Our method not only can be used on VGG and ResNet, but also some light structures such as MobileNetV2. In addition, we make this process quick and interpretable.
Suppose that we have a pretrained CNN structure, we first extract channels that is important for each class, and give each channel a weight to represent its contribution to the final decision. The vector of channels for each class is called channel weights vector (CWV). This process is called ’Dissection’. Then, when specific task comes, we combine channel weights vectors (CWVs) of several classes needed, this result vector is a 0-1 vector called combined channel weight vector (CCWV). This process is called ’Reconstruction’. Note that CWV refers to channel weights vector for one class (e.g. one piece of lego block), and CCWV refers to channel weights vector for several classes combined with CWVs (w.r.t combination of lego blocks).
As for the dissection process, our method is inspired by the method proposed in [cdrp]
named critical data routine path (CDRP), which is a method designed for interpreting CNNs. This method simply adds a control gate after the ReLU operation after each channel in the pre-trained CNNs, then trains the gates for each image while trying to maintain the network s output the same as the original network. We use this method to extract critical data routine path (CDRP) for each image, then we let the channel weights vector (CWV) of one class be the average value of all the CDRPs of images in that class. Note that the CDRP is not a 0-1 vector but a continuous vector. We will briefly introduce the critical data routine path (CDRP) method in Section 3.1.
During the reconstruction process, we try several methods to combine the CWVs of classes, such as union set, cross set and difference set. After we get a combined channel weights vector (CCWV) for the specific task, we perform pruning on the original CNN according to the vector. Note that we do not apply any fine-tuning process. The detailed method of getting combined channel weights vector (CCWV) will be elaborated in 3.2.
The channel extraction process is implemented by an interpretable manner. We know that different part of CNN contributes to different classes, and the CWVs proposed by us can give a quantitative weights to the contribution of every channel to every class. According to our experiments, one single channel of the traditional CNN might contribute to several classes. There is previous work done by Zhang et al. that force each channel contributes to only one object part in one class (see [interpretableCNN] for details). But this method actually creates a new model and they do not actually explain the working manners of traditional CNNs. We only use traditional CNNs such as VGG and MobileNetV2 which are usually considered un-interpretable. The methods used in our dissection and reconstruction process (e.g. CDRP, union set) are all easy to understand and interpret.
In this paper, our work mainly contributes in the following aspect:
We proposed a method inspired by the previous work CDRP [cdrp] to generate an interpretable channel weights vector (CWV) for every class.
We proposed a novel method to generate efficient subnetworks for different tasks using the channel weights vectors (CWVs).
We did some research on the interpretability of channel weights vectors (CWVs), pointing that our method is one way to adapt deep convolution neural networks to applications with given knowledge.
2 Related work
2.1 Interpretable manners of CNNs
The interpretability of CNNs has draw much attention in recent years. Many researchers used visualization methods to interpret how CNNs work ([visualizing1], [visualizing2], [visualizing3]). Zhou et al. addressed the balance between interpretability and performance and proposed a method to measure the interpretability quantitatively [dissection]
. Guillaume et al. suggested to use a linear classifier probes to measure the discrimination power of every layer in CNNs[linear_probes].
Zhang et al. designed an interpretable CNN which forced each channel of the last convolutional layer only be activated by a single semantic part of a single class [interpretableCNN]
. Zhang et al. used a decision tree based on interpretable CNNs to give the contribution of each part to the final prediction[decision_tree]. Squeeze-and-Excitation network proposed by Hu et al. added a bottleneck structure after every layer, and automatically gave weights to every channel [SE]
. Li et al. proposed to use the SE block and three kinds of loss function to measure the importance of channels in CNNs.[decoupling] Wang and Su proposed the CDRP method to interpret convolutional neural network (see [cdrp] for details). Qiu et al. proposed to dissect channels into neural, and extracted neural path according to their activation value to do adversarial sample detection [path_extraction]. These last two methods are both path extraction manners. The former one extracts path based on channels while the latter is based on neural. It is known that if we break CNN into neural, we will have to design new hardware to accelerate computation. As a result, we use the former method which treat the channel as a basic block of CNN, in order to make our method run finely on existing devices.
2.2 Network pruning
Network pruning technology has become very mature and been used in many areas. Un-structured pruning means to dissect convolution kernels into neural and perform pruning on neural level. Han et al. proposed to prune the unimportant connections with small weights in trained CNN and fine-tune [pruning1]. Structured pruning suggests to pruning the network on channel level ([slimming], [pruning3], [pruning4]). The most typical work done by Liu et al. was to add channel scaling factors after each layer and trained the network with the scaling factors, then removed the channels with low factor values and fine-tuned [slimming].
Our method is different from network pruning. We do not allow the fine-tune process while network pruning does, and we generate substructures for specific tasks while network pruning is for the complete task.
2.3 Layer skipping
There are also some work on saving resources during the prediction process. [skipnet] proposed to skip some of the layers when applying a CNN to applications. They learnt an self adaptive method to skip some of the middle layers to save computation resources when make predictions. [branchynet] proposed a method to exit early when using CNN to make prediction. They found that they could only use the feature maps of front layers to make prediction for most of the test samples. So they designed an algorithm to exit the front layers if the confidence coefficient of predictions made by the front layer was high enough.
However, these works was still designed for the complete task of a CNN and they did it among the layers of CNN. Our work is designed to tackle the task specific problem and we do it among the channels of CNN, which are very different from the previous works.
Our method has two stages: ’dissection’ and ’reconstruction’. During the dissection process, we use the critical data routine path (CDRP) method to get every single image s crital data routine path (CDRP). Then we use the average value of all images CDRPs in one class to represent the channel weights vector (CWV, a float vector) of one class. During the reconstruction process, we try several manners such as union set, cross set and difference set to get the combined channel weights vector (CCWV, a 0-1 vector) for several class serving for specific task.
3.1 Dissection process
In this section, we will briefly introduce the method critical data routine path (CDRP). Then we use CDRPs to generate channel weights vector (CWV) for each class. Figure 2 shows the final result of the CWV for a certain class.
Suppose that we have a pretrained CNN . Every convolutional layer of CNN consists of channels. We put a control gate after the ReLU activation channel-wisely. The control gates can be considered as the weight of channels, representing the importance of each channel for a single image. It can also be considered as a new form of activation (see [cdrp]).
During the forward propagation, the outputs of a channel simply multiply the control gate value. We use only one image to run the whole CNN every time, output . The loss in CDRP is inspired by knowledge distillation (see [KD]). The loss can be written as the following form:
is the cross entropy loss between the original CNN s output and the output after adding the control gates . is the vector of all control gates value. The first item of the loss indicates that the network with control gates will try to output exactly the same as the original network. The second item is the -norm of control gates where is the weight of the -loss. The loss forces most of values in the CDRP to zero in order to get a more sparse vector.
During the backward propagation. This algorithm keep the weight of the original CNN unchanged (set
in PyTorch) and only update the value of control gates by:
When implementing the algorithm, we initialize every element of with 1, iterating times for each image using the SGD optimizer with learning rate of 0.1, momentum of 0.9 and no weight decay. We set to balance the performance and sparsity in experiments. We clip the value of control gates in [0, 10] after every iteration. Notice that we always keep the top-1 prediction the same as the original network. If we can not do that we set all control gates of this image to 1, which means the network output exactly the same as the original one.
By using this CDRP algorithm, we get a CDRP for each image . Then we merge the CDRPs of each class and generate a new vector for each class . We use the average of image CDRPs to represent the CWV of this single class:
refers to the total image numbers in class .
3.2 Reconstruction process
In order to generate a task specific CNN, we wish to get only one combined channel weights vector (CCWV) for n classes needed. And the vector should be a 0-1 vector so that it can decide which channel to reserve and which channel to drop. We tried a lot of methods to get , such as union set, cross set and difference set. We find that union set can get highest pruning rate while maintaining satisfying accuracy. Difference set can do well in two-classes classification task but it is not easy to be applied on three or more classes missions. Cross set has the worst performance because it drops too many channels. We only introduce union set and difference in the following sections.
3.2.1 Union set
The most efficient method of getting combined channel weights vector (CCWV) is to union all of needed classes. If we want a network to classify several classes, we just need to union important channels for every classes. Figure 3 shows the basic idea of union set. As for the implementation, we use the maximum value among all the for each channel to represent the union set. If we have C classes to classify, we use the following formula to generate CCWV for these C classes:
Actually, this way of getting union set comes from fuzzy sets. is a threshold to balance the sparsity of and accuracy of the subnetworks. Note that is a 0-1 vector.
After we get the combined channel weights vector, we prune the CNN according to . If the value of is 1, we reserve that channel and the weights. Otherwise, we simply drop that channel and all of the connections connected to it.
3.2.2 Difference set
The union set can get high accuracy, but it reserves most of channels needed for all the classes in the specific task. However, it still reserves too many channels. According to the semantic meaning of the , the value of control gates after each channel represents the importance of channels for class . If we only have 2 classes to classify, we only need the different channels between these two classes to discriminate them. As a result, we can use the different part of their s, and this is what we call difference set. Figure 4 shows the basic idea of difference set.
Suppose that we have two channel weights vector and for class and class , we preserve the most different part between them:
The is a value to balance the sparsity of and the accuracy of the subnetworks.
In our experiments, the difference set only gets similar performance with union set in two-classes classification task, and it is not easy to be applied on tasks which need more than two classes, so that we will not show the results of difference set in section 4.
In this paper, we use 3 state of the art CNN models: VGG16, ResNet18 and MobileNetV2. VGG16 represents the classical deep convolutional structure. ResNet18 represents the typical skip connection structure. And MobileNetV2 stands for the light structures using on mobile devices. Experiments show that our method can be finely applied on all of these structures.
As for the dataset, we use Cifar-10, Cifar-100 and ILSVRC-2012. Cifar-10 is a light data set composed with 60,000 images-50,000 for training and 10,000 for validation. It has been divided into 10 classes and every class has 5,000 training images and 1,000 validation images. Every single picture of Cifar-10 is a 32x32 pixel RGB graph. This dataset is broadly used in image classification and is always considered one of the most famous benchmark datasets in the deep learning field. Cifar-100 is similar to Cifar-10, but it has 100 classes and each class hasimages compared to Cifar-10. ILSVRC-2012 is another famous image dataset used for ImageNet challenge. It has 1.5 million RGB images with 1,000 classes. The size of each picture is not the same, but it is much larger than Cifar-10. It also has a validation set of 50,000 images-50 for each class. We choose 10 classes in ILSVRC-2012 to generate CWVs and test the substructures on the validation set.
For every experiment, we will give the original network accuracy for all categories on each dataset. When it comes to the specific task, the original accuracy stands for the accuracy of the whole network only considering the categories in the specific task. w.r.t we mask all the probabilities (given by the softmax layer) of other categories to zero, and compare the prediction probabilities of classes given in the ’Classes’ column. The ’Ori acc.’ stands for the accuracy of the complete CNNs on given classes. The ’Acc.’ column stands for the accuracy of the CNNs pruned by our CCWVs on given classes. We also define the ’Sparsity’ of a network as the remaining channel number of the pruned CNNs divided by the original channel number.
Note that all of the results shown in our paper used the union set method described in 3.2.1.
4.1 Experiment on Cifar-10
In this experiment, we first pre-trained VGG16, ResNet18 and MobileNetV2 on 10 classes Cifar-10 dataset with SGDR optimizer. The initial learning rate was 0.1. The momentum was 0.9 and the weight decay is
After the pre-training process, we extracted the CWV for every CNN of every class using the method and presetting described in Section 3.1. Note that we only used 100 training images for every class to generate the CWV for that class. Then we randomly chose some 2-classes and 3-classes classification tasks, and combined their CWVs to get CCWVs. When performing pruning, we set the (mentioned in 3.2.1, to balance the sparsity and accuracy of the pruned structures) to 0.001 for VGG16 and ResNet18, 0.0008 for MobileNetV2. We performed pruning on all of the layers in VGG16 and ResNet18. As for MobileNetV2, it was light enough so that we only pruned the high-level layers where the channels number was 320 and above.
The result of VGG16 and ResNet18 are shown in table 1. We choose some typical classes combinations to display. The plane-truck combination is the most easy pair to differ, and the cat-dog classes pair is the most difficult two-class task in Cifar-10 dataset. The car-deer pair stands for the average discriminating difficulty among all two-class combinations. As for VGG16, we have only up to 0.65% accuracy loss in all three kinds of two-class task, and we can reduce the channel number to 33% 32% and 40% relevantly. we have only up to 0.26% accuracy loss in all three kinds of three-class task. Note that the deer-dog-frog combination even does not loss any accuracy, and the channel sparsity is 55%. The result in ResNet18 is not as good as VGG16. But the accuracy loss is still very small. For two class task, we need to reserve about 53% channels and 81% for three class task. Note that the dog-cat task in ResNet18 need to reserve more channels due to the difficulties of differing these two classes.
The result of MobileNetV2 are shown in table 2. In this table we only display the last two layers’ sparsity. However, even we nearly did not prune MobileNetV2 except the last two layers, we can receive significant computation save because the most computation-dense part of MobileNetV2 is the tail part. This is mentioned in the original article of MobileNetV3 (see [mobilenetv3] for details). We can tell from the result that the accuracy loss is still very small. The 320 channels layer can be pruned to 12% on average no matter in two-classes task and three-classes task. As for 1280 channels layer, the two-classes task need less than 50% channels and the three-classes task need a little more than 50% channels in the last 1280 channels layer.
4.2 Experiment on Cifar-100
We also did experiments on Cifar-100 dataset. During the pre-training process, we preset all parameters the same as experiments on Cifar-10. Actually, the performance of our method on cifar-100 is very similar to cifar-10 due to the similarity of these two datasets. As a result, we will show more statistical results in this section.
As we know, 100 class have (4,950) kinds of 2-classes combinations and (161,700) kinds of 3-classes combinations. We randomly choose 200 kinds of 2-classes and 200 kinds of 3-classes combinations for test. When performing pruning, we set the to 0.01 for VGG16, 0.006 for ResNet18 and 0.02 for MobileNetV2. We compute the average original accuracy, the average accuracy of the pruned networks, the average accuracy loss of the networks, and the average sparsity of the pruned networks. The results are shown in Table 3 and Table 4.
|CNNs||Ori acc.||Acc.||Acc. loss||Sparsity|
|CNNs||Ori acc.||Acc.||Acc. loss||Sparsity|
In our experiments, the original accuracy of VGG16, ResNet18 and MobileNetV2 on all classes of cifar-100 dataset is 73.8%, 76.4% and 75.41%. Table 3 displays the result of 2-classes combinations. The average original accuracy is nearly 99%. The VGG16 has 1.04% loss on average and the ResNet18 gets 1.44% average accuracy loss. However, the average sparsity of the ResNet18 2-classes substructure is greater than VGG16. That means the ResNet18 which gets higher accuracy on this dataset is much difficult to be dissected. As for the MobileNetV2, it only has 0.54% accuracy loss on average with the sparsity of 36% and 12.7% on the last two convolution layers.
Table 4 shows the result of 3-classes combinations. The average accuracy loss is similar to the 2-classes experiments, but the sparsity of all the three CNN structures are higher than the 2-classes tasks. These two experiments show that our methods still have effect cifar-100, and our method can efficiently prune the network channel-wisely according to the specific task without fine-tuning process, and push the CNN structures to quick applications in reality.
4.3 Experiment on ILSVRC-2012
As for ILSVRC-2012 dataset, we used the pre-trained VGG16 for experiment. The pre-trained model was for all the 1,000 classes in ILSVRC-2012. However, no matter what value we used, results showed that every category need nearly 90% channels in order to maintain the accuracy. If we combined CWVs of two categories, the sparsity would straightly rise to over 90%. That meant we could not get a satisfying smaller network even if we only need to classify two categories.
However, the accuracy on any two classes using our reconstructed CNN defined by CCWVs and the was significantly higher than using the original network. The result is shown in Figure 5.
|Classes||Max acc. rise||Min acc. rise||Average acc. rise|
Although our method did not get rid of a lot of channels, but we can see from Figure 5 that the accuracies are significantly increased. Table 5 shows the max, min and average accuracy rise for 2 and 3 classes task. The 2-classes task gains 14.29% accuracy rise on average, and the 3-classes task gains 7.43% on average. As for the reason, we think the expression ability of a single channel is limited in certain network. ILSVRC-2012 has 1,000 classes. VGG16 can not deliver different channels for different classes during the training process. It needs more joint information to get the discriminate power of 1,000 classes, that means the VGG16 trained on ILSVRC-2012 can not be class-wisely dissected. However, our method still remove some channels. These channels might be useless even harmful for specific tasks. If we remove these useless channels, the network focus on the specific task and gets higher accuracy reasonably.
5 Interpretability of class-wise CWVs
5.1 The distribution of channel sparsity
Figure 6 shows the channel sparsity of VGG16 for each category in cifar-10 dataset according to their CWVs. w.r.t the ratio of none-zero value in CWV. All data come from the experiment in section 4.1. From the columns we can tell that the car and truck need the least channels, and the bird needs the most channels. The channel numbers needed for each category stands for the feature complexity. The more channels needed, the more complex that category is. And we also find that similar categories share similar number of channels (see section 5.2 and figure 8).
As for the distribution of sparsity among layers, we give one example for explanation. Figure 7 shows the layer-wise channel sparsity of VGG16 on plane-truck combinations of the cifar-10 dataset. The blue columns stand for the total number of channels in layers, and the pink columns represent the remained channel number for the reconstructed CNN. We can see from the figure that this 2-class discrimination task needs most of the the channels in layer 2 to 7. However, only very limited channels remains in the 8 to 13 layers, and there are also less than a half channels remain in the first layer.
It is universally acknowledged that the front layers of CNNs are for basic feature extractions, and the higher layers are for more concrete semantical information. As a result, the higher layers must be more sensitive to different categories than the front layers, showing the quality that they can easily be dissected category-wisely. According to this, it is easy to explain why the two-class discrimination task needs most of the channels in the front layers and only a small parts of channels in the higher layers. However, the sparsity of the first layer does not obey these rules. There are only less than a half channels remain in the first layer. The network was limited in these two classes, and the first layer can roughly estimate the area that this network need to pay attention to. We consider this phenomenon as a form of knowledge learning from our dissection and reconstruction process.
5.2 The semantic meaning of the CWVs of each classes
The CWVs have their semantic meanings. We draw a heat map for the similarity of the categories in the cifar-10 dataset using the CWVs of VGG16. We define the similarities of two classes and as the size of important channels’ cross set divided by the size of important channels’ union set (see formula 6). represent the channel set whose weights are none-zero in CWV of class . Notice that the similarities are relative values, w.r.t these values only express the similarity differences among all the two-class combinations, and they do not mean the exactly similarity of two categories.
In figure 8, the most obvious semantic meaning lies in the last row. The truck is not similar to other categories in cifar-10 dataset except the car category, cause they both have 4 wheels, some windows, similar vehicle bodies and they both run on the road or park in the parking lots. And the most different category to truck is bird, because trucks run on the ground and bird fly up into the sky, trucks is made of iron and bird is a kind of biological things, they absolutely have no connection. Another interesting thing is that we can see from the heat map that plane, ship and bird share great similarity which seems ridiculous. However, when we take a look at these three kinds of images in cifar-10, we find that these three kinds of objects usually appear in the area where the background is blue (see figure 9). The planes and birds usually fly on the blue sky, and the ships always appear on the blue ocean. In our opinion, the blue background may contribute to the final prediction of plane, bird and ship, and this makes it easier to explain why these three kinds of objects share similar channels.
We can also conclude from Figure 6 and Figure 8 that similar categories have similar important channel numbers. e.g. the car and truck share great similarity in the similarity heat map, they both have 0.24 sparsity of channel numbers in Figure 6, and the ship, plane bird also share great similarity in similarity heat map, they also have the top 3 important channel number sparsity.
5.3 Application with given knowledge
In this paper, we proposed to use the CWVs for specific tasks. In our point of view, the ’specific task’ we mentioned in this paper is actually a kinds of application with given knowledge. The way of using deep learning model with given knowledge is very limited. Our method proposed a new way to apply CNNs on specific task with given knowledge, saving computation resource without losing too much accuracy or simply rise the accuracy in the sub-task of complete CNNs.
In this paper, we proposed a new method to efficiently apply CNNs on specific tasks. We dissected the channels of a CNN class-wisely to get CWVs and combine several classes’ CWVs to get CCWV, then we reconstruct a CNN for specific task according to the CCWV. Experiments on VGG16, ResNet18 and MobileNetV2 showed that our method can efficiently reduce the number of channels for 2-classes and 3-classes tasks on Cifar dataset without losing too much accuracy, and can significantly rise accuracy on the 2-classes and 3-classes tasks on ILSVRC-2012 dataset. Note that our method does not need any process of fine-tune, which makes it impossible to be applied on the mobile devices. Analyses showed the strong interpretability of our method, pointing that our method is a new way of applying CNNs on tasks with given knowledge.