Impact of Fully Connected Layers on Performance of Convolutional Neural Networks for Image Classification

01/21/2019 ∙ by S H Shabbeer Basha, et al. ∙ 0

The Convolutional Neural Networks (CNNs), in domains like computer vision, mostly reduced the need for handcrafted features due to its ability to learn the problem-specific features from the raw input data. However, the selection of dataset-specific CNN architecture, which mostly performed by either experience or expertise is a time-consuming and error-prone process. To automate the process of learning a CNN architecture, this letter attempts at finding the relationship between Fully Connected (FC) layers with some of the characteristics of the datasets. The CNN architectures, and recently data sets also, are categorized as deep, shallow, wide, etc. This letter tries to formalize these terms along with answering the following questions. (i) What is the impact of deeper/shallow architectures on the performance of the CNN w.r.t FC layers?, (ii) How the deeper/wider datasets influence the performance of CNN w.r.t FC layers?, and (iii) Which kind of architecture (deeper/ shallower) is better suitable for which kind of (deeper/ wider) datasets. To address these findings, we have performed experiments with three CNN architectures having different depths. The experiments are conducted by varying the number of FC layers. We used four widely used datasets including CIFAR-10, CIFAR-100, Tiny ImageNet, and CRCHistoPhenotypes to justify our findings in the context of the image classification problem. The source code of this research is available at https://github.com/shabbeersh/Impact-of-FC-layers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

Code Repositories

Impact-of-FC-layers

This research is carried out to find the necessity of fully connected (FC) layers in CNN for image classification.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Works

The popularity of Convolutional Neural Networks (CNN) is growing significantly for various application domains related to computer vision, which include object detection lecun2015deep , segmentation he2017mask , localization hariharan2017object

, and many more in recent years. However, there is a limited amount of work carried-out to address many questions related to CNNs in the context of finding a suitable architecture using Reinforcement Learning (RL)

zoph2016neural ; pham2018efficient , Bayesian Optimization wistuba2017bayesian . In this paper, we have investigated some of the factors which affect the performance of the CNN w.r.t. fully connected (FC) layers in the context of the image classification problem. We have also studied the possible interrelationship between the presence of the FC layers in the CNN, the depth of the CNN, and the depth of the dataset.

Deep neural networks generally provide better results in the field of machine learning and computer vision, compared to the handcrafted feature descriptors

lecun2015deep . Krizhevsky et al.krizhevsky2012imagenet proposed a CNN called AlexNet consisting of convolutional () layers and FC layers. The FC layers are placed after all the Conv layers. Zeiler and Fergus zeiler2014visualizing made minimal changes to AlexNet with better hyper-parameter settings in order to generalize it over other datasets. This model is called ZFNet which also has FC layers along with convolution layers. In 2014, Simonyan et al.simonyan2014very further extended the AlexNet model to VGG-16 with learnable layers including FC layers towards the end of the architecture. Later on, many CNN models have been introduced with increasing number of learnable layers. Szegedy et al.szegedy2015going proposed a 22-layer architecture called GoogLeNet, which has a single FC (output) layer. In the same year, He et al.he2016deep introduced ResNet with 152 trainable layers where the last layer is fully connected. However, all the above CNN architectures are proposed for large-scale ImageNet dataset deng2009imagenet . Recently, Basha et al.basha2018rccnet

proposed a CNN based classifier called RCCNet, which is responsible for classifyig the routine colon cancer cells. This CNN model has

learnable layers including FC layers.

In a shallow CNN model, the features generated by the final convolutional layer correspond to a portion of the input image as its receptive field does not cover the entire spatial dimension of the image. Thus, few FC layers are mandatory in such a scenario. Despite their pervasiveness, the hyper parameters like the number of FC layers and number of neurons in FC layers required for a given CNN architecture is not explored.

Figure 1: The illustration of the effect of deeper/wider datasets and depth of CNN (i.e., the number of the Convolutional layers, ) over the number of FC layers (i.e.,

). A typical plain CNN architecture has Convolutional (learnable), Max-pooling (non-learnable) and FC (learnable) layers.

In a typical deep neural network, the FC layers comprise most of the parameters of the network. AlexNet has million parameters, out of which million parameters correspond to the FC layers krizhevsky2012imagenet . Similarly, VGGNet has a total of million parameters, out of which million parameters are from FC layers simonyan2014very . This huge number of trainable parameters in FC layers is required to fit complex nonlinear discriminant functions in the feature space into which the input data elements are mapped. However, these large number of parameters may result in over-fitting the classifier. To reduce the amount of over-fitting, Xu et al.xu2018overfitting proposed a CNN architecture called SparseConnect where the connections between FC layers are sparsed.

The effect of deep or shallow networks on different kind of datasets are well explored in the literature to study the behavioural interrelationship between depth of dataset and the CNN mhaskar2016deep ; mhaskar2016learning . Mhaskar et al.mhaskar2016deep extended a framework for their previous work mhaskar2016learning to investigate when deep networks are better than shallow networks using directed acyclic graph (DAG). Montufar et al.montufar2014number performed a study to “find the complexity of the functions computable by deep neural networks with linear activations”.

To the best of our knowledge, no effort has been made in the literature to analyze the impact of FC layers in CNN for image classification. In this paper, we investigate the impact of FC layers on the performance of the CNN model with a rigorous analysis from various aspects. In brief, the contributions of this paper are summarized as follows,

  • We perform a systematic study to observe the effect of deeper/ shallower architectures on the performance of CNNs with varying number of FC layers, in the context of image classification.

  • We observe the effect of deeper/ wider datasets on the performance of CNN with varying number of FC layers.

  • We generalize one important finding of Bansal et al.bansal2017s to choose the deeper or shallow architecture based on the depth of the dataset. In bansal2017s

    , they have reported the same in the context of face recognition, Whereas, we made a rigorous study to generalize this observation over different kinds of datasets.

  • To make the empirical justification of our findings, we have conducted the experiments on different types (i.e., natural and bio-medical images) of image datasets like CIFAR-10 and CIFAR-100 krizhevsky2009learning , Tiny ImageNet tinyimagenet , and CRCHistoPhenotypes sirinukunwattana2016locality .

Next, we illustrate the deep and shallow CNN architectures developed to conduct the experiments in Section 2. Experimental setup including training details, datasets, etc., are discussed in Section 3. Section 4 presents a detailed study of the observations found in this paper. At last, Section 5 concludes the paper.

Input: Image dimension ()
[layer ] Conv. , S=, P=

; ReLU; BN;

[layer ] Conv. , S=, P=; ReLU; BN;
[layer ] Pool., S=, P=;
[layer ] Conv. , S=, P=; ReLU;
[layer ] Conv. , S=, P=; ReLU;
[layer ] Conv. , S=, P=; ReLU;
[layer ] Flatten; 43264;
Output: (FC layer) Predicted Class Scores
Table 1: The CNN-1 architecture having layers. The , , and

denote stride, padding, and batch normalization, respectively. The output (FC) layer has

, , , and nodes in the case of CIFAR-10, CIFAR-100, Tiny ImageNet, and CRCHistoPhenotypes datasets, respectively.

2 Developed CNN Architectures

The main objective of this paper is to analyze the impact of the number of FC layers and the number of neurons present in FC layers of CNN over the performance. Interdependency between the characteristics of both the datasets and the networks are explored w.r.t. FC layers as shown in Fig. 1. In order to conduct a rigorous experimental study, we have implemented three CNN models with varying depth in terms of the number of convolutional () layers before fully connected (FC) layers. These models are termed as CNN-1, CNN-2, and CNN-3 having , , and layers, respectively.

Deep and Shallow CNNs: As per the published literature ba2014deep ; montufar2014number , a neural network is referred as shallow if it has single fully connected (hidden) layer. Whereas, a deep neural network consists of convolutional layers, pooling layers, and FC layers. However, in this paper, we assume a CNN model as deep/shallow compared to another CNN model , if the number of trainable layers in is more/less than , respectively.

2.1 CNN-1 Architecture

AlexNet krizhevsky2012imagenet is well-known and first CNN architecture, which won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 russakovsky2015imagenet with a huge performance gain as compared to the best results of that time using handcrafted features. The original AlexNet architecture was proposed for the images of dimension , we made minimal changes to the model to fit for low-resolution images. We name this model as CNN-1. Initially, the input image dimension is up-sampled from to in the case of CRCHistoPhenotypes sirinukunwattana2016locality , CIFAR-10, CIFAR-100 krizhevsky2009learning datasets, whereas, it is down-sampled from to in the case of Tiny ImageNet tinyimagenet dataset. The convolutional layer produces

dimensional feature vector by applying

filters of dimension . The layer is followed by another Convolution layer (), which produces dimensional feature map by convolving filters of size . The remaining layers of the CNN-1 model are similar to the AlexNet architecture proposed in krizhevsky2012imagenet . The CNN-1 model with a single FC layer (i.e., the output FC layer) consists of following number of trainable parameters, for CIFAR-10 dataset, for CIFAR-100 dataset, for Tiny ImageNet dataset, and for CRCHistoPhenotypes dataset. Note that, the number of trainable parameters is different for each dataset due to the different number of classes present in the datasets which leads to the varying number of trainable parameters in the output FC layer. The detailed specifications of the CNN-1 model are given in Table 1.

Input: Image dimension ()
[layer ] Conv. , S=, P=; ReLU; BN, DP
[layer ] Conv. , S=, P=; ReLU; BN;
[layer ] Pool., S=, P=;
[layer ] Conv. , S=, P=; ReLU; BN, DP
[layer ] Conv. , S=, P=; ReLU; BN;
[layer ] Pool., S=, P=;
[layer ] Conv. , S=, P=; ReLU; BN, DP
[layer ] Conv. , S=, P=; ReLU; BN;
[layer ] Pool., S=, P=;
[layer ] Conv. , S=, P=; ReLU; BN, DP
[layer ] Conv. , S=, P=; ReLU; BN;
[layer ] Pool., S=, P=;
[layer ] Conv. , S=, P=; ReLU; BN, DP
[layer ] Conv. , S=, P=; ReLU; BN;
[layer ] Pool., S=, P=;
[layer ] Flatten; 512;
Output: (FC layer) Predicted Class Scores
Table 2: The CNN-2 model having 10- layers. The , , , and denote the stride, padding, batch normalization, and dropout with a factor of . The output layer has , , , and nodes in the case of CIFAR-10 , CIFAR-100 , Tiny ImageNet, and CRCHistoPhenotypes datasets, respectively.
Figure 2: (a) A few sample images from CIFAR-10/100 dataset krizhevsky2009learning . (b) A random sample images from Tiny ImageNet dataset tinyimagenet . (c) Example images from CRCHistoPhenotypes dataset sirinukunwattana2016locality with each row represents the images from one category.

2.2 CNN-2 Architecture

Another CNN model is designed based on the CIFAR-VGG liu2015very model by removing some layers from the model. We name this model as CNN-2. There are blocks in this model, where first blocks have two consecutive layers followed by a layer. Finally, the sixth block has FC (output) layer which generates the class scores. The input to this model is an image of dimension . To meet this requirement, images of the Tiny ImageNet dataset are down-sampled from to . The CNN-2 architecture corresponds to , , , and learnable parameters in the case of CIFAR-10, CIFAR-100, Tiny ImageNet, and CRCHistoPhenotypes datasets, respectively. The CNN-2 model specifications are given in Table 2.

2.3 CNN-3 Architecture

Most of the popular CNN models like AlexNet krizhevsky2012imagenet , VGG-16 simonyan2014very , GoogLeNet szegedy2015going , and many more were proposed for high dimensional image dataset called ImageNet deng2009imagenet . On the other hand, the low dimensional image datasets like CIFAR-10/100 have rarely got benefited from the CNNs. Liu et al.liu2015very introduced CIFAR-VGG architecture, which is basically a layer deep CNN architecture proposed for low-resolution image dataset CIFAR-10. We have utilized this CIFAR-VGG model as the third deep neural network to observe the impact of FC layers in CNN and named as CNN-3 in this paper. The input to this model is an image of dimension . To meet this requirement, images of the Tiny ImageNet dataset are down-sampled from to . The CNN-3 architecture with a single FC (output) layer corresponds to , , , and trainable parameters in the case of CIFAR-10 krizhevsky2009learning , CIFAR-100 krizhevsky2009learning , Tiny ImageNet tinyimagenet , and CRCHistoPhenotypes sirinukunwattana2016locality datasets, respectively.

3 Experimental Setup

This section describes the experimental setup including the training details, datasets used for the experiments, and the evaluation criteria to judge the performance of the CNN models.

3.1 Training details

The classification experiments are conducted on different modalities of image datasets to provide the empirical justifications of our findings in this paper. The initial value of the learning rate is considered as and it is decreased by a factor of for every epochs. The Rectified Linear Unit () based non-linearity krizhevsky2012imagenet

is used as the activation function after every

and FC layer (except the output FC layer) in all the CNN models discussed in section 2. The Batch Normalization (i.e., ) ioffe2015batch is employed after of each and FC layer, except final FC layer in CNN-2 and CNN-3 architectures. Whereas, in the case of CNN-1, the Batch Normalization is used only with first two layers as mentioned in Table 1. To reduce the amount of data over-fitting, we have used a popular regularization method called Dropout (i.e., ) srivastava2014dropout after some Batch-Normalization layers as summarized in Table 2 for CNN-2. For CNN-3, the layers are used as per the CIFAR-VGG model liu2015very . In order to find the impact of fully connected (FC) layers on performance of CNN, any added FC layer has the , and by default. Along with dropout, various data augmentations techniques like rotation, horizontal flip, and vertical flip are also applied to reduce the amount of over-fitting. The implemented CNN architectures are trained for

epochs using Stochastic Gradient Descent (SGD) optimizer with a momentum of

.

3.2 Evaluation criteria

To evaluate the performance of the developed CNN models (i.e., CNN-1, CNN-2, and CNN-3), we have considered the classification accuracy as the performance evaluation metric.

3.3 Datasets

To find out the empirical observations addressed in this paper, we have conducted the experiments on different modalities of datasets like CIFAR- krizhevsky2009learning , CIFAR- krizhevsky2009learning , Tiny ImageNet tinyimagenet (i.e., the natural image datasets), and CRCHistoPhenotypes sirinukunwattana2016locality (i.e., the medical image dataset).

3.3.1 Cifar-10 krizhevsky2009learning

The CIFAR-10 krizhevsky2009learning is the most popular tiny image dataset consists of different categories of images, where each class has images. The dimension of each image is . To train the deep neural networks, we have used the training set (i.e., images) of the CIFAR-10 dataset, and remaining data (i.e., images) is utilized to validate the performance of the models. A few samples of images from the CIFAR-10 dataset are shown in Fig. 2(a).

3.3.2 Cifar-100 krizhevsky2009learning

The CIFAR-100 krizhevsky2009learning dataset is similar to CIFAR-10, except that CIFAR-100 has classes. In our experimental setting, the images are used to train the CNN models and remaining images are used to validate the performance of the models. Similar to CIFAR-10, the dimension of each image is . The sample images are shown in Fig. 2(a).

3.3.3 Tiny ImageNet tinyimagenet

The Tiny ImageNet dataset tinyimagenet consists a subset of ImageNet deng2009imagenet images. This dataset has a total of classes and each class has training and validation images. In other words, we have used images for training and images for validating the performance of the models. The dimension of each image is . The example images of the Tiny ImageNet dataset are portrayed in Fig. 2(b).

3.3.4 CRCHistoPhenotypes sirinukunwattana2016locality

In order to generalize the observations reported in this paper, we have used a medical image dataset consists of routine colon cancer nuclei cells called “CRCHistoPhenotypes” sirinukunwattana2016locality , which is publicly available111https://warwick.ac.uk/fac/sci/dcs/research/tia/data/crchistolabelednucleihe. This colon cancer dataset consists of the total nuclei patches that belong to the four different classes, namely, ‘Epithelial’, ‘Inflammatory’, ‘Fibroblast’, and ‘Miscellaneous’, respectively. In total, there are images belong to the ‘Epithelial’ class, images belong to the ‘Fibroblast’ class, images belong to the ‘Inflammatory’ class, and the remaining images belong to the ‘Miscellaneous’ class. The dimension of each image is . For training the CNN models, of entire data ( images) is utilized and remaining data ( images) is used to validate the performance of the models. The sample images are displayed in Fig. 2(c).

Deeper vs Wider datasets bansal2017s : For any two datasets with roughly same number of images, one dataset is said to be deeper bansal2017s than another dataset, if it has more number of images per class in the training set. The other dataset which has a lower number of images per class (i.e., more number of classes compared to another one) in the training set is called the wider dataset. For example, CIFAR-10 and CIFAR-100 krizhevsky2009learning , both the datasets have images in the training set. The CIFAR-10 is a deeper dataset since it has images per class in the training set. On the other hand, the CIFAR-100 is wider dataset because it has only images per class.

4 Results and Analysis

We have conducted extensive experiments to observe the useful practices in deep learning for the usage of Convolutional Neural Networks (CNNs). The three CNN models discussed in section

2 are implemented to perform the experiments on publicly available CIFAR-10/100, Tiny ImageNet, and CRCHistoPhenotypes datasets. The results in terms of the classification accuracy are reported in this paper.

4.1 Impact of FC layers on the performance of the CNN model w.r.t. to depth of the CNN

To observe the effect of deeper/ wider architectures (i.e., the number of Convolution layers, ) on FC layers, initially, the CNN models are trained with a single FC (output) layer. Then another FC layer is added manually before the output (FC) layer to observe the gain/loss in the performance due to the addition of the new FC layer. The number of nodes is chosen (for newly added FC layer) starting from the number of classes to all multiples of (i.e, powers of such as , , etc.), which is greater than the number of classes and up to . For example, in the case of CIFAR-10 dataset krizhevsky2009learning , the experiments are conducted by varying the number of nodes in the newly added FC layer with number of nodes. In the next step, one more FC layer is added before the recently added FC layer. The number of nodes for newly added FC layer is chosen, ranging from the value for which best performance is obtained in the previous setting to . Let us assume, we obtained the best performance using CNN-1 having two FC layers (with , nodes in the case of CIFAR-10). Then, we observed the performance of the model by adding another FC layer by varying the number of nodes as . The details like the number of FC layers, number of nodes in each FC layer, best classification accuracies obtained for CIFAR-10 dataset using the three CNN models are shown in Table 3. It is clearly observed from Table 3 that the deeper architectures (i.e., CNN-2 with layers and CNN-3 with layers) require relatively less number of FC layers and also less number of nodes in FC layers compared to the shallow architecture (i.e., CNN-1 with layers) for CIFAR-10 dataset. In order to generalize the above-mentioned observation, we have computed the results by varying the number of FC layers over other datasets and reported the best performance in Table 4. From Table 4, the similar findings are noticed that the deeper architectures do not require more number of FC layers. In contrast, the shallow architecture requires more number of FC layers in order to obtain better performance for any dataset. The reasoning for such behavior is related to the type of features being learned by the layers. In general, a CNN architecture learns the hierarchical features from raw images. Zeiler and Fergus zeiler2014visualizing have shown that the early layers learn the low-level features, whereas the deeper layers learn the high-level (problem specific) features. It means that the final layer of shallow architecture produces less abstract features as compared to the deeper architecture. Thus, the number of FC layers needed for shallow architecture is more as compared to the deeper architecture.

width=center S.No. CNN-1 CNN-2 CNN-3
1
Single FC (output) layer (44.29) Single FC (output) layer (91.46) Single FC (output) layer (92.05)

2
(88.67) (91.14) (91.03)

3
(88.72) (91.58) (91.77)

4
(88.93) (91.99) (92.02)


5
(89.72) (91.82) (91.8)

6
(89.2) (91.86) (89.2)

7
(89.23) (92.02) (89.23)

8
(88.95) (90.98) (91.78)

9
(89.56) (91.54) (92.22)

10
(87.4) (91.27) (91.59)


11
(86.27) (87.51) (90.68)


12
(89.35) (91.97) (91.27)

13
(89.71) (91.92) (91.43)

14
(89.79) (91.53) (91.94)

15
(89.88) (91.95) -

16
(90) (92.29) -

17
(90.28) (91.64) -

18
(90.59) - -
19 (90.77) - - 20 (90.74) - -

Table 3: The effect of depth of the CNN models (i.e., CNN-1, CNN-2, and CNN-3) on FC layers for the CIFAR-10 dataset is shown in this table. The best and best accuracies are highlighted in bold and italic, respectively. For example, the CNN-2 model produces the best accuracy of for three FC layers with , , and nodes and the best accuracy of for two FC layers with and nodes.
S.No. Dataset Architecture
CNN-1 (5 Conv layers) CNN-2 (10 Conv layers) CNN-3 (13 Conv layers)
1 CIFAR-10 4096x4096x64x10 (90.77) 4096x256x10 (92.29) (92.22)
2 CIFAR-100 4096x4096x2048x100 (69.21) 4096x100 (62.28) Single FC (output) layer (66.98)
3 Tiny ImageNet 4096x4096x1024x200 (50.1) 512x200, 1024x200 (41.84) Single FC (output) layer (40.27)
4 CRCHistoPhenotypes 2048x256x4 (82.53) 512x4 (84.89) (84.94)
Table 4: The best validation accuracies obtained over CIFAR-10, CIFAR-100, Tiny ImageNet and CRCHistoPhenotypes datasets using three CNN models (i.e., CNN-1, CNN-2, and CNN-3). The results are presented in terms of the FC layer structures and validation accuracy.

4.2 Effect of FC layers on the performance of the CNN model w.r.t. to different types of datasets

We have used two kinds of datasets (deeper and wider) to analyze the effect on FC layers. Table 5 presents the characteristics like average number of images per class in the training set (), number of classes (), number of training images (), and validation images () of four datasets discussed in section 3.3. From Fig. 3, we can also observe that shallow architectures CNN-1 (less deeper compared to CNN-2, CNN-3) require more number of nodes in FC layers for wider datasets compared to deeper datasets. On the other hand, deeper architecture CNN-3 (deeper than CNN-1) requires less number of nodes in FC layers for wider datasets compared to deeper datasets. A deeper CNN model such as CNN-3 with 13 layers has more number of trainable parameters in layers. Thus, a deeper dataset is required to learn the parameters of the network. Whereas, a shallow architecture such as CNN-1 with layers has less number of parameters for which a wider dataset is more suitable to train the model.

Figure 3: The effect of deeper/wider datasets on FC layers of CNN. Deeper architectures CNN-3 require relatively less number of nodes in FC layers for the wider datasets compared to deeper datasets. In contrast, shallow architecture CNN-1 require relatively large number nodes in FC layers for wider datasets compared to deeper datasets.
Dataset N C Tr Va
CIFAR-10 5000 10 50,000 10,000
CIFAR-100 500 100 50,000 10,000
Tiny ImageNet 500 200 80,000 20,000
CRCHistoPhenotypes 4489 4 17955 4489
Table 5: The characteristics of CIFAR-10, CIFAR-100, Tiny ImageNet, and CRCHistoPhenotypes datasets are presented in this table. Here, represents the average number of images per class in the training set, represents the number of classes corresponding to a dataset, and are the number of images in the Training and Validation sets, respectively.

4.3 Deeper vs. Shallower Architectures, Which are better and when?

Bansal et al.bansal2017s have reported that the deeper architectures are preferred over shallow architectures while training the CNN models with deeper datasets. Whereas, for the wider datasets, the shallow architectures perform better compared to the deeper architectures. However, this observation is specific to face recognition problem as reported in bansal2017s . In this paper, we made a rigorous study to generalize this finding by conducting extensive experiments on different modalities of datasets. For example, the CIFAR-10, CIFAR-100, and Tiny ImageNet datasets have the natural images and the CRCHistoPhenotypes dataset has the medical images. The results obtained through these experiments clearly indicate that the deeper architectures are always preferred over shallow architectures to train the CNN model using deeper datasets. In contrast, for the wider datasets, the shallow architectures perform better than the deeper CNN models.

From Table 4, we can observe that training deeper architectures CNN-2 and CNN-3 with deeper dataset produces and validation accuracies in the case of the CIFAR-10 dataset and and in the case of CRCHistoPhenotypes dataset, respectively. In contrast, we obtained and validation accuracies for CIFAR-10 and CRCHistoPhenotypes datasets, when shallow architectures are trained with the deeper datasets. On the other hand, for the wider datasets like CIFAR-100 and Tiny ImageNet, better performance is obtained using the shallow architecture (CNN-1). From Table 4, we can observe that the CNN-1 gives a validation accuracy of for CIFAR-100 and for Tiny ImageNet dataset. Whereas, the CNN-1 model performs relatively poor for deeper datasets.

This observation is very much useful while choosing a CNN architecture to train the model for a given dataset. The generalization of this finding intuitively makes sense because the deeper/shallow architectures have a more/less number of learnable parameters, in a typical CNN model which require more/less number of images per subject for the training.

5 Conclusion

In this paper, we have analyzed the affect of certain decisions in terms of the FC layers of CNN for image classification. Careful selection of these decisions, not only improve the performance of the CNN models but also reduces the time required to choose among different architectures such as deeper and shallow. This paper is concluding the following guidelines that can be adopted while designing the deep/shallow convolutional neural networks to obtain the better performance.

  • The shallow architectures require a large number of nodes in FC layers to obatin the better performance. On the other hand, deeper architectures less number of nodes are required in FC layers irrespective of the type of the dataset.

  • The shallow models require a large number of nodes in FC layers as well as more number of FC layers for wider datasets compared to deeper datasets and vice-versa.

  • Deeper architectures perform better than shallow architectures over deeper datasets. In contrast, shallow architectures perform better than deeper architectures for wider datasets. These observations can help the deep learning community while making a decision about the choice of deep/shallow CNN architectures.

Acknowledgment

This work is supported in part by Science and Engineering Research Board (SERB), Govt. of India, Grant No. ECR/2017/000082. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce Titan XP GPU used for this research.

References

References

  • (1) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2015) 436.
  • (2) K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Computer Vision (ICCV), 2017 IEEE International Conference on, IEEE, 2017, pp. 2980–2988.
  • (3) B. Hariharan, P. Arbelaez, R. Girshick, J. Malik, Object instance segmentation and fine-grained localization using hypercolumns, IEEE transactions on pattern analysis and machine intelligence 39 (4) (2017) 627–639.
  • (4) B. Zoph, Q. V. Le, Neural architecture search with reinforcement learning, arXiv preprint arXiv:1611.01578.
  • (5) H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, J. Dean, Efficient neural architecture search via parameter sharing, arXiv preprint arXiv:1802.03268.
  • (6) M. Wistuba, Bayesian optimization combined with successive halving for neural network architecture optimization., in: AutoML@ PKDD/ECML, 2017, pp. 2–11.
  • (7) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
  • (8) M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European conference on computer vision, Springer, 2014, pp. 818–833.
  • (9) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.
  • (10)

    C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.

  • (11) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • (12) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, Ieee, 2009, pp. 248–255.
  • (13) S. S. Basha, S. Ghosh, K. K. Babu, S. R. Dubey, V. Pulabaigari, S. Mukherjee, Rccnet: An efficient convolutional neural network for histological routine colon cancer nuclei classification, in: 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV), IEEE, 2018, pp. 1222–1227.
  • (14) Q. Xu, M. Zhang, Z. Gu, G. Pan, Overfitting remedy by sparsifying regularization on fully-connected layers of cnns, Neurocomputing 328 (2019) 69–74.
  • (15) H. N. Mhaskar, T. Poggio, Deep vs. shallow networks: An approximation theory perspective, Analysis and Applications 14 (06) (2016) 829–848.
  • (16) H. Mhaskar, Q. Liao, T. Poggio, Learning functions: when is deep better than shallow, arXiv preprint arXiv:1603.00988.
  • (17) G. F. Montufar, R. Pascanu, K. Cho, Y. Bengio, On the number of linear regions of deep neural networks, in: Advances in neural information processing systems, 2014, pp. 2924–2932.
  • (18) A. Bansal, C. Castillo, R. Ranjan, R. Chellappa, The do’s and don’ts for cnn-based face verification, in: 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), IEEE, 2017, pp. 2545–2554.
  • (19) A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images, Tech. rep., Citeseer (2009).
  • (20) J. Wu, Q. Zhang, G. Xu, Tiny imagenet challenge, cs231n, Stanford University.
  • (21) K. Sirinukunwattana, S. E. A. Raza, Y.-W. Tsang, D. R. Snead, I. A. Cree, N. M. Rajpoot, Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images, IEEE transactions on medical imaging 35 (5) (2016) 1196–1206.
  • (22) J. Ba, R. Caruana, Do deep nets really need to be deep?, in: Advances in neural information processing systems, 2014, pp. 2654–2662.
  • (23) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.
  • (24) S. Liu, W. Deng, Very deep convolutional neural network based image classification using small training sample size, in: Pattern Recognition (ACPR), 2015 3rd IAPR Asian Conference on, IEEE, 2015, pp. 730–734.
  • (25) S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International Conference on Machine Learning, 2015, pp. 448–456.
  • (26) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15 (1) (2014) 1929–1958.