Log In Sign Up

An Experimental Study of the Impact of Pre-training on the Pruning of a Convolutional Neural Network

by   Nathan Hubens, et al.

In recent years, deep neural networks have known a wide success in various application domains. However, they require important computational and memory resources, which severely hinders their deployment, notably on mobile devices or for real-time applications. Neural networks usually involve a large number of parameters, which correspond to the weights of the network. Such parameters, obtained with the help of a training process, are determinant for the performance of the network. However, they are also highly redundant. The pruning methods notably attempt to reduce the size of the parameter set, by identifying and removing the irrelevant weights. In this paper, we examine the impact of the training strategy on the pruning efficiency. Two training modalities are considered and compared: (1) fine-tuned and (2) from scratch. The experimental results obtained on four datasets (CIFAR10, CIFAR100, SVHN and Caltech101) and for two different CNNs (VGG16 and MobileNet) demonstrate that a network that has been pre-trained on a large corpus (e.g. ImageNet) and then fine-tuned on a particular dataset can be pruned much more efficiently (up to 80


page 1

page 2

page 3

page 4


Pruning neural networks: is it time to nip it in the bud?

Pruning is a popular technique for compressing a neural network: a large...

Fine-Pruning: Joint Fine-Tuning and Compression of a Convolutional Network with Bayesian Optimization

When approaching a novel visual recognition problem in a specialized ima...

Classifying CMB time-ordered data through deep neural networks

The Cosmic Microwave Background (CMB) has been measured over a wide rang...

Memory Efficient Adaptive Attention For Multiple Domain Learning

Training CNNs from scratch on new domains typically demands large number...

Paoding: Supervised Robustness-preserving Data-free Neural Network Pruning

When deploying pre-trained neural network models in real-world applicati...

Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning

The success of convolutional neural networks (CNNs) in various applicati...

Transfer Learning with intelligent training data selection for prediction of Alzheimer's Disease

Detection of Alzheimer's Disease (AD) from neuroimaging data such as MRI...

1 Introduction

Over the last years, Convolutional Neural Networks (CNNs) have exhibited state-of-the-art performance in various computer vision tasks, including image classification and object detection

[27, 29]. While their performance continuously improved, such networks kept growing deeper in complexity and depth. This led to increasing needs of parameters storage and convolution operations, inhibiting the use of deep neural networks for applications with memory or processing time limitations.

As the majority of the parameters of CNNs are located in the fully-connected layers, recent works have attempted to reduce the memory demands of such networks by removing the fully-connected part of the network, which is replaced by an average pooling layer [30, 21]. Such approaches make it possible to decrease significantly the number of parameters without affecting the accuracy. On the other hand, the convolution layers are those where most of the computations are realized. This computation cost can be reduced by downsampling the feature maps earlier in the network [11] through various pooling mechanisms or by replacing the convolution operations by factorized convolutions [14, 4].

Nowadays, a widely used approach when training neural networks for specific applications concerns the fine-tuning [31]. It consists of using the weights of a model pre-trained on a generic database as initialization and then training the model on a particular dataset. Neural network fine-tuning offers the advantage of making possible to obtain results for particular applications, where the training datasets are not sufficiently large. The dataset used for the pre-training is often very large and contains various examples (e.g. ImageNet [5]), making the network able to extract a large variety of different features. Generally, the network that has been pre-trained on the large dataset also has a very high capacity and can be over-parameterized for the particular dataset. This can lead to redundant or irrelevant weights. Removing such weights will then decrease the number of parameters of the network without significantly degrading its accuracy.

In this work, we propose an experimental analysis of the sensitivity of the pruning methods with respect to the training strategy. We consider for evaluation: (1) a network that has been pre-trained on a generic dataset (i.e., ImageNet), and then fine-tuned on a target dataset; and (2) a network that has been directly trained on the target dataset from randomly initialized weights. The experimental results obtained on various datasets (CIFAR10, CIFAR100, SVHN and Caltech101) and for two different networks (VGG16 and MobileNet) demonstrate the superiority of the first approach, which consists of using pre-training process.

The rest of the paper is organized as follows. Section 2 briefly presents an overview of the state-of-the-art pruning methods. Section 3 introduces the retained evaluation methodology, with adopted pruning technique, datasets and network architectures. The experimental results obtained are presented and discussed in Section 4. Finally, Section 5 concludes the paper and opens perspectives of future work.

2 Related Work

Recently, there has been a line of work concerning the lottery ticket hypothesis [7, 8, 24]. This hypothesis argues that, in a regular neural network architecture, there exists a sub-network that can be trained to the same level of performance as the original one, as long as it starts from the original initial conditions. This means that a neural network does not require all of its parameters to perform correctly and that having an over-parametrized neural network is only useful to find the ”winning ticket”. In practice, those sub-networks can be found by training the original network to convergence, removing the unnecessary weights, then resetting the value of the remaining weights to their original value.

Early work on pruning methods dates back to Optimal Brain Damage [18] and Optimal Brain Surgeon [10]

, where the weights are pruned based on the Hessian of the loss function. More recently,

[9] proposes to prune the weights with small magnitude. This kind of pruning, performed on individual weights, is called unstructured pruning [1], as there is no intent to preserve any structure or geometry in the network. The unstructured pruning methods are the most efficient as they prune weights at the most fine-grained level. However, they lead to sparse weight matrices. Consequently, taking the advantage of the pruning results requires dedicated hardware or libraries able to efficiently deal with such sparsity.

To overcome this limitation, so-called structured pruning methods have been introduced. In this case, the pruning is operated at the level of the filters, kernels or vectors

[3, 12]. One of the most popular structured pruning technique is the so-called filter pruning. Here, at each pruning stage, complete convolution filters are removed and the convolution structure remains unchanged. The filter selection methods can be based on the filter weight norm [20], average percentage of zeros in the output [15], or on the influence of each channel on the final loss [23]. Intrinsic correlation within each layer can also be exploited to remove redundant channels [28]. Pruning can also be combined with other compression techniques such as weight quantization [26], low-rank approximations of weights [6], knowledge distillation [13, 2] to further reduce the size of the network.

In our work, we have adopted a structured pruning approach, performed at the filter level. As in [20, 22], we use the -norm to determine the importance of filters and select those to prune. We decided to adopt an iterative pruning process, where the pruning is performed one layer at a time, and the network is retrained after each pruning phase, to allow it to recover from the loss of some parameters.

3 Methodology

In this section, we describe in detail the methodology we followed for training and pruning a model.

3.1 Pruning method

The pruning method [20] used in our work prunes the filters of a trained network that have lowest sensitivity to pruning. The sensitivity to pruning of a layer is closely related to the -norm of its filters. Figure 1

b shows the sensitivity to pruning of each layer of a VGG16 network fine-tuned on the MNIST dataset

[17]. The layers that are the most sensitive, i.e. the layers where the accuracy drops the fastest when removing filters, are also the layers with the most high -norm filters, as illustrated in Figure 1a.

Based on this observation, we propose to remove filters with the lowest -norm as they produce feature maps with weaker activations compared to other filters in that layer.

(a) Filters ranked by ascending -norm for VGG16 trained on MNIST. (b) Sensitivity of each convolutional layer pruned individually.
Figure 1: Visualization of the sensitivity to pruning of a VGG-16 trained on MNIST.

More precisely, the procedure of pruning filters from a layer is the following:

  1. Compute the -norm of each filter of layer .

  2. Sort the filters by their -norm.

  3. Prune filters with the smallest -norm and their corresponding feature maps in layer . The removed feature maps will not participate to the next computations, thus their corresponding kernels in the successive convolution filters are also removed (Figure 2).

  4. Retrain the whole network until new convergence.

Figure 2: Representation of filter pruning of a convolutional layer. The pruned filter, as well as its corresponding feature map and kernels in the following layer, are removed.

3.2 Datasets

We have carried out our experiments on four different datasets, with a good variability of number of classes, image resolution and content. As targeted datasets, we have CIFAR-10 and CIFAR-100

[16], consisting of RGB images of pixels, labelled respectively over 10 and 100 classes. We also retained SVHN [25], a real-world image dataset for recognizing digits and numbers in natural scene images of size . Finally, we have considered the Caltech101 corpus [19], consisting of pictures of objects belonging to 101 categories, of size of approximately pixels.

3.3 Network architectures

In order to validate our results, we have adopted two well-known network architectures. The first one is VGG16 [27]. Here, we have replaced the original fully-connected layers by a Global Average Pooling layer and 2 narrow fully-connected layers. In this way, most parameters are contained in the convolutional layers. The network thus consists of 13 convolutional layers and 2 fully-connected layers.

The second network retained is MobileNetV1 [14], specifically designed to achieve efficiency both in parameter number and in computation complexity. MobileNet uses a factorized form of convolutions called Depthwise Separable Convolutions. The MobileNet architecture used in our experiments thus consists of one standard convolution layer acting on the input image, 13 depthwise separable convolutions, and finally a global average pooling and 2 fully connected layers.

Whatever the network retained, in experiments, Network-A will refer to the network pre-trained on ImageNet and fine-tuned on the target dataset, while Network-B refers to the same network trained on the target datasets from scratch.

4 Experimental results

In this section, we compare the pruning efficiency of both fine-tuned and trained from scratch networks. Our models are first trained until convergence, using the Adam optimizer, with an initial learning rate of and a step decay scheduling, reducing the learning rate by a factor every epochs. They are then tested on the validation set to get the baseline accuracy.

After each pruning phase, a retraining is performed during epochs, with the lowest learning rate reached during baseline training. We also monitor the accuracy on the validation set, and proceed to another pruning phase if it has not dropped by more than from the baseline accuracy.

4.1 Vgg16

To select the layer that will be pruned and decide of many filters to be removed, we conduct a sensitivity analysis on the network. As shown in Figure 3a, most of the low-norm filters are contained in the later layers. This suggests that those layers will be less sensitive to pruning than the others.

On the other hand, Figure 3b shows that Network-B has a more even repartition of filter norms, which suggests that it is more sensitive to pruning than Network-A, specifically in the later layers.

(a) Filters ranked by ascending -norm for Network-A (b) Filters ranked by ascending -norm for Network-B
Figure 3: Visualization of the importance of filters of VGG-16 trained on CIFAR-10. Filters are ranked by -norm.

This observation is confirmed by the results of pruning, summarized in Table 1, which shows that indeed, most of the pruning of Network-A can be performed in the later layers while pruning of Network-B is more evenly distributed.

Layer type #Params Network-A Network-B
Conv 1 1792 1792 1792
Conv 2 36,928 36,928 36,928
Conv 3 73,856 73,856 73,856
Conv 4 147,584 147,584 129,136
Conv 5 295,168 295,168 161,440
Conv 6 590,080 590,080 230,560
Conv 7 590,080 590,080 230,560
Conv 8 1,180,160 442,560 553,344
Conv 9 2,359,808 331,968 1,327,488
Conv 10 2,359,808 331,968 1,327,488
Conv 11 2,359,808 221,312 1,327,488
Conv 12 2,359,808 147,584 1,327,488
Conv 13 2,359,808 147,584 1,327,488
Linear 262,656 66,048 197,120
Linear 5130 5130 5130
Total 14.98M 3.43M 8.26M
Table 1: Parameters remaining for each layer after the pruning of VGG16, trained on CIFAR10.

Table 2 summarizes the results on all the tested datasets, and shows the resulting number of parameters, their corresponding storage size, as well as the number of floating point operation (FLOPs) needed for an input image to traverse the whole network at testing phase. We can observe that for all the tested datasets, pruning is more effective in terms of parameters removed for Network-A than for Network-B.

A second observation is that a lower number of parameters do not necessarily lead to a reduction in FLOPs. This phenomenon can be explained by the fact that, while most of the parameters are contained in the later layers, most of the operations are performed in the first ones, where the resolutions of the activation maps are higher. For Network-A, most of the pruning is performed in the later layers. In contrast, for Network-B, the pruning is more distributed throughout the network. Thus, Network-A has the fewest parameters but Network-B often has the fewest FLOPs.

Dataset Network Params (M) FLOP (M) Size (MB)
CIFAR10 Baseline 14.98 627.48 57.22
A-pruned 3.43 421.20 13.15
B-pruned 8.26 397.85 31.57
CIFAR100 Baseline 15.03 627.57 57.39
A-pruned 8.26 503.47 31.57
B-pruned 8.69 444.56 33.24
SVHN Baseline 14.98 627.48 57.22
A-pruned 3.15 414.12 12.10
B-pruned 4.08 311.04 15.61
Caltech101 Baseline 15.03 30,720.99 57.40
A-pruned 8.44 25,402.11 32.28
B-pruned 13.59 30,171.17 51.93
Table 2: Results of the pruning on VGG16 for the four studied datasets.
(a) Filters ranked by ascending -norm for Network-A (b) Filters ranked by ascending -norm for Network-B
Figure 4: Visualization of the 64 filters in the first convolutional layer of VGG-16 trained on CIFAR-10. Network-A filters have more structure than those of Network-B.

A closer look at the filters of the first convolutional layer (Figure 4

) exhibits the difference of learned filters. The reason of this difference is because the pre-training of Network-A helped to find useful filters on the ImageNet database. As Network-B only had access to fewer and smaller images, it could not learn the same kind of filters by itself and had to distribute the feature extraction across the network, reason why the later layers are more sensitive to pruning.

4.2 MobileNet

The Depthwise Separable Convolutions operations that are used in MobileNet are composed of two operations. The first is a Depthwise Convolution, that filters each input map independently. The second operation is called a Pointwise Convolution, that combines the results of the previous operation. The Pointwise Convolution is similar to a regular convolution but using filter dimensions of . As most of the parameters are contained in the second operation, we decided to operate the pruning only on Pointwise Convolutions.

(a) Filters ranked by ascending -norm for Network-A (b) Filters ranked by ascending -norm for Network-B
Figure 5: Visualization of the importance of filters of MobileNet trained on CIFAR-10. Filters are ranked by -norm.

We follow the same process as for VGG16 and perform a sensitivity analysis on MobileNet before deciding which layer and how many filters to remove. The difference between the sensitivity of Network-A (Figure 5a) and Network-B (Figure 5b) is less clear than in the case of VGG16 but some useful information can still be extracted. The highest norm filters of Network-B are still in the later layers, while it isn’t necessarily the case for Network-A. This again suggests that Network-A can be pruned further in the later layers, and confirmed in Table 3.

Layer type #Params Network-A Network-B
Conv 1 864 567 864
Conv 2 2048 1344 2048
Conv 3 8192 8192 8192
Conv 4 16,384 16,384 16,384
Conv 5 32,768 32,768 28,672
Conv 6 65,536 65,536 50,176
Conv 7 131,072 98,304 64,512
Conv 8 262,144 147,456 82,944
Conv 9 262,144 147,456 82,944
Conv 10 262,144 147,456 82,944
Conv 11 262,144 98,304 82,944
Conv 12 262,144 65,536 78,336
Conv 13 524,288 65,536 82,688
Conv 14 1,048,576 32,768 97,280
Linear 524,800 66,048 164,352
Linear 5130 5130 5130
Total 3.76M 1.05M 0.98M
Table 3: Parameters remaining for each layer after the pruning of MobileNet, trained on CIFAR10. For clarity, only the first convolution and pointwise convolutions are represented.

The results of the pruning of MobileNet on the different datasets are summarized in Table 4. As it was the case for VGG16, Network-A can also be pruned further than Network-B, even if the difference is smaller in this case.

(a) Filters ranked by ascending -norm for Network-A (b) Filters ranked by ascending -norm for Network-B
Figure 6: Visualization of the 32 filters in the first convolutional layer of MobileNet trained on CIFAR-10. Filters of Network-A exhibit more structure than the filters of Network-B.
Dataset Network Params (M) FLOP (M) Size (MB)
CIFAR10 Baseline 3.76 24.23 14.58
A-pruned 1.05 13.83 4.26
B-pruned 0.98 12.26 3.99
CIFAR100 Baseline 3.80 24.32 14.76
A-pruned 2.54 21.42 9.94
B-pruned 3.12 21.89 12.14
SVHN Baseline 3.76 24.23 14.58
A-pruned 1.50 17.66 5.99
B-pruned 1.75 16.57 6.94
Caltech101 Baseline 3.81 1136.59 14.76
A-pruned 3.08 1078.30 11.99
B-pruned 3.43 1105.83 13.32
Table 4: Results of the pruning on MobileNet for the four studied datasets.

Again, looking at the filters of the first convolutional layer exhibits the difference of learned filters between Network-A (Figure 6a) and Network-B (Figure 6b). The filters of Network-A show some structure where those from Network-B don’t appear to. Moreover, as it was suggested in Figure 5a, Network-A has some really low-norm filters, that don’t seem to extract useful information, and can thus be removed.

5 Conclusion and Perspectives

In this paper, we have investigated the impact of the training process on the number of relevant filters involved in a CNN. In particular, we have compared the sensitivity of filter pruning of a network that has been fine-tuned to the sensitivity of a network trained from scratch.

Experiments have been conducted on four different datasets (CIFAR10, CIFAR100, SVHN, Caltech101) and for two different network architectures (VGG16, MobileNet). Results have shown that a CNN that has been pre-trained and then fine-tuned on a target dataset is less sensitive to pruning and thus, can be pruned further than the same network trained from-scratch on the target dataset.

This also supposes that the methodology of training has a strong impact on the part of the network where features are extracted. Pre-training helps the network to be discriminant in early layers and thus allows for more pruning in later layers, where most of the parameters are contained. On the other hand, training a network from randomly initialized weights makes its layers more evenly discriminant and thus the pruning has to be more distributed across the layers, resulting in fewer parameters removed but often a bigger reduction in computation operations.

Further research can be conducted on using this combination of sensitivity analysis and pruning to improve existing architectures of CNNs or find new ones that are more efficient. It can also lead to finding new training strategies either targeting a low number of parameters after pruning, as it is the case with fine-tuning, or targeting fewer computations, as it is the case for a network trained from-scratch.


  • [1] S. Anwar, K. Hwang, and W. Sung (2015) Structured pruning of deep convolutional neural networks. CoRR abs/1512.08571. External Links: Link, 1512.08571 Cited by: §2.
  • [2] J. Ba and R. Caruana (2014) Do deep nets really need to be deep?. In Advances in Neural Information Processing Systems, pp. 2654–2662. External Links: Link Cited by: §2.
  • [3] J. Cheng, P. Wang, G. Li, Q. Hu, and H. Lu (2018) Recent advances in efficient computation of deep convolutional neural networks. CoRR abs/1802.00939. External Links: Link, 1802.00939 Cited by: §2.
  • [4] F. Chollet (2016)

    Xception: deep learning with depthwise separable convolutions

    CoRR abs/1610.02357. External Links: Link, 1610.02357 Cited by: §1.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    pp. 248–255. External Links: Document Cited by: §1.
  • [6] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. CoRR abs/1404.0736. External Links: Link, 1404.0736 Cited by: §2.
  • [7] J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: training pruned neural networks. CoRR abs/1803.03635. External Links: Link, 1803.03635 Cited by: §2.
  • [8] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin (2019) The lottery ticket hypothesis at scale. CoRR abs/1903.01611. External Links: Link, 1903.01611 Cited by: §2.
  • [9] S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both weights and connections for efficient neural networks. CoRR abs/1506.02626. External Links: Link, 1506.02626 Cited by: §2.
  • [10] B. Hassibi, D. G. Stork, G. Wolff, and T. Watanabe (1994) Optimal brain surgeon: extensions and performance comparisons. In Advances in Neural Information Processing Systems 6, pp. 263–270. External Links: Link Cited by: §2.
  • [11] K. He and J. Sun (2014) Convolutional neural networks at constrained time cost. CoRR abs/1412.1710. External Links: Link, 1412.1710 Cited by: §1.
  • [12] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. CoRR abs/1707.06168. External Links: Link, 1707.06168 Cited by: §2.
  • [13] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, External Links: Link Cited by: §2.
  • [14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. External Links: Link, 1704.04861 Cited by: §1, §3.3.
  • [15] H. Hu, R. Peng, Y. Tai, and C. Tang (2016)

    Network trimming: A data-driven neuron pruning approach towards efficient deep architectures

    CoRR abs/1607.03250. External Links: Link, 1607.03250 Cited by: §2.
  • [16] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Cited by: §3.2.
  • [17] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: External Links: Link Cited by: §3.1.
  • [18] Y. LeCun, J. S. Denker, and S. A. Solla (1990) Optimal brain damage. In Advances in Neural Information Processing Systems 2, D. S. Touretzky (Ed.), pp. 598–605. External Links: Link Cited by: §2.
  • [19] F. Li, R. Fergus, and P. Perona (2004-06) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In 2004 Conference on Computer Vision and Pattern Recognition Workshop, Vol. , pp. 178–178. External Links: Document, ISSN Cited by: §3.2.
  • [20] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016) Pruning filters for efficient convnets. CoRR abs/1608.08710. External Links: Link, 1608.08710 Cited by: §2, §2, §3.1.
  • [21] M. Lin, Q. Chen, and S. Yan (2014) Network in network. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, External Links: Link Cited by: §1.
  • [22] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2018) Rethinking the value of network pruning. CoRR abs/1810.05270. External Links: Link, 1810.05270 Cited by: §2.
  • [23] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz (2016)

    Pruning convolutional neural networks for resource efficient transfer learning

    CoRR abs/1611.06440. External Links: Link, 1611.06440 Cited by: §2.
  • [24] A. S. Morcos, H. Yu, M. Paganini, and Y. Tian (2019) One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. ArXiv abs/1906.02773. Cited by: §2.
  • [25] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In Workshop on Deep Learning and Unsupervised Feature Learning, NeurIPS, Cited by: §3.2.
  • [26] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) XNOR-net: imagenet classification using binary convolutional neural networks. CoRR abs/1603.05279. External Links: Link, 1603.05279 Cited by: §2.
  • [27] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §1, §3.3.
  • [28] X. Suau, L. Zappella, V. Palakkode, and N. Apostoloff (2018) Principal filter analysis for guided network compression. CoRR abs/1807.10585. External Links: Link, 1807.10585 Cited by: §2.
  • [29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015-06) Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. Note: ISSN: 1063-6919, 1063-6919 External Links: Document Cited by: §1.
  • [30] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Conference on Computer Vision and Pattern Recognition, CVPR, External Links: Link, Document Cited by: §1.
  • [31] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. CoRR abs/1411.1792. External Links: Link, 1411.1792 Cited by: §1.