I Introduction
Deep learning massive success in almost every fields represents its ability to solve complex problems. The tradeoff between model complexity and accuracy is an important area of deep learning research. Very complex model with millions of parameters [8, 9] proved to the state of the art solution for many vision and natural language problems. A common way to measure the performance or generalizability of a deep learning model is to test it on a well discriminative validation/test set representing the variation of samples of the corresponding problem. Learning very complex model is a matter of the requirements of high computing power and huge dataset. So it’s important to understand the optimal complexity requirement for a problem to reduce the burden of computing power. In a recent work by Zhang et al. [6]
, it has been proved that a simple 2 layers neural network with
parameters can represent any function for samples indimensions. It is interesting to see that a simple multilayer perceptron with ReLU activation can fit a dataset of random labels with zero training accuracy, but poor generalizability. The problem with mere data memorizing is to be blamed for poor performance on a test set. Preventing the network in memorizing data samples in inefficient random high dimensional space is important model design paradigm. But at the same time, it’s acceptable for a model to memorize the data in an efficient hyperspace representing original data distribution. Learning the original data distribution solves the poor validation set performance of a model. Designing an optimal network with a minimum number of parameters will reduce computation costs and improve performance.
How do we design model that best fit the original data distribution? In this work, we will present the importance of deep compositional feature space design with an optimal number of parameters. We will prove the rate of feature space size reduction matters irrespective of network parameters. Also, we will define an optimal strategy to relate a number of parameters requirements for a particular feature space representation.
Ia Contribution
Feature space representability:
There exists an optimal number of nonlinear transformations to represent a particular size features space without losing discriminative information. Convolution is a linear operation projecting data from one space to another, using a nonlinear activation final output of convolution transformation becomes nonlinear. While transforming features from one space to another space there’s loss of information. The loss of information can be understood using singular value decomposition. We have trained multiples network with the same number of parameters but a different rate of feature space reduction on the CIFAR10 dataset to prove our points. Following observations have strong impact on the learning and the generalization performance:

The rate of reductions of the feature space size with respect to a number of convolution operations.

onvolution vs max pooling for feature space reduction.
An optimal number of model parameters: There exists an optimal number of model parameters for a model to achieve high generalization accuracy. The number of optimal parameters depend on the compositional design and the feature space reduction rate.
Implicit Regularizer: Learning rate decay policy acts as an implicit regularizer boosting the model performance and faster learning.
Note: here with the term feature space size, we talk about width and height of a feature map, not the number of feature maps.
IB Related Work
Zhang et al.(2017)[6] studied the representational power of a network with respect to the training sample size; shows that a deep model can memorize any dataset with random labels, but it doesn’t imply generalization. Their finding also stated that explicit regularization alone can’t prevent poor generalization performance. Barlett (1998)[1] showed that VC dimension is not relevant to measure generalization performance for neural networks, rather the
norm of the network weights is more prevalent measure. They defined a fatshattering dimension for error estimation that depends on parameter magnitude. Maass (1995)
[5]has established the VC dimension bounds for the neural network with various activation functions for generalization analysis. Krogh et al. (1992)
[4]showed that weight decay suppresses any irrelevant weights vector component also reduce noises, hence lowering generalization errors. Hardt et al. (2016)
[7]introduced a generalization error upper bound for a model trained with stochastic gradient descent for convex and nonconvex optimization problems. Neyshabur et al. (2015)
[3] shows that apart from adding weight decay or implicit regularization, increasing the network size improves generalization performance for a model learned with stochastic gradient descent. They also asserted that with very high number of hidden units () a weight decay regularized network is considered as a convex neural net for optimization. On the effectiveness of deep networks vs shallow networks Mhaskar et al. (2016)[10], showed that VC dimension and fatshattering dimension are smaller for deep networks than shallow networks. They argued the benefits of compositional function design for scalability and shift invariance in image and text data. All these above works didn’t discuss the prominent effect of the loss of information while data is projected from one space to another. We will show here that the information loss effect the generalization error for deep or shallow networks.Ii Feature space design in deep neural networks
Iia Information loss
The standard convolution operation used in the deep network is linear. When feature space is projected from one space to another using convolution/pooling operation there’s always a loss of information. The information loss depends on the projected space dimension and capacity. Loss of information can be understood from singular value decomposition (SVD). If denotes the input for convolutional layer
(1) 
(2) 
is the input map, the equation (1) refer to a special case where the input matrix for SVD is Hermitian and positive definite. When data is projected from dimension to lowrank approximation
dimension, there’s loss of information. 2D convolution for a single feature map with a single kernel can be interpreted as projecting data from one space to another. Significant information is lost when convolution stride
1. Another way to calculate the retrived information after convolution is using the mutual information between two signals. For two independent signals X and Y, mutual information I(X, Y) = 0 and maximum if X Y (fully correlated). For X and Y we define information loss as follows:(3) 
Mutual information between the original data and projected data (convolution/pooling) is given as follows
(4) 
For normal random vectors X and Y mutual information can be calculated as follows:
(5)  
Convolution of a image with kernel is equivalent to the same with and kernels; 2 1D convolution.
Where
are covariance matrices for respective variables. Since we are using batch normalization also input whitening, it’s safe to assume input/output of convolution as normal. Apart from that convolutional kernel also initialized as normal variables. If the convolution input X is sampled from a normal distribution X
and the kernel K , then the ouput will also be a normal with distribution Y . Using eq 3 and eq 5 we can calculate information loss for convolution/pooling.IiB Compositional Design of Convolutional Layer
A single convolution layer followed by max pooling results in high information loss. Increasing the number of convolution kernel for a layer doesn’t necessarily solve the problem. Before reducing the feature space size it’s important to project the feature into high dimensional hyperspaces using multiple convolution operations. It’s important to use nonlinearity and batch normalization [13] for each convolution operation to achieve highly uncorrelated hyperspace projection. The stacking design best resembles the representation power of the compositional function. Also, the VC dimension of compositional design is smaller than that of shallow design [10]. With the composition of multiples convolution operation, receptive field grows in polynomial order. Compositional design inspired from the visual cortex increasing receptive fields for higher visual areas. It has been established [12] that visual cortex receptive fields are larger for a simple scene and smaller for a complex scene. Compositional design has smaller and larger receptive fields those capture information related to simple and complex objects leading to decrease in information loss.
IiC Convolution vs Max Pooling
Claim 1.1 Strided convolution replaces max/avg pooling with better generalization performance.
Feature space reduction using max pooling is a very crude projection into another hyper space. Max pooling operation leads to lossy nonlinear transformation. The translational invariance which is one of the advantages of max pooling operation can be easily well represented using affine transformation, achieved with strided convolution. Information loss in strided convolution is lower than hard nonlinear max pooling. Fig
1 illustrates model design with strided convolution and max pooling.Table II proves our claims.
IiD Rate of Reduction
Claim 1.2 Minimum one convolutional operation needed before reducing the feature space size of the model.
To minimize loss of information, projecting the data into mutliples nonlinear hyper space is required for improved generalization. Minimum one convolution for the first layer of the model without stride and maximum 4 convolution operations without residual connection is preferable for intermediate layers; with residual connection,
convolutional operations can be added. It is also noteworthy that residual connection only facilitate the training of deeper model.Iii Generalization requires feature space analysis
Iiia VC dimension and Fat Shattering dimension
A function is considered as Lipschitz function if it satisfies the following condition for a smallest constant :
(8) 
For deep networks nonlinearity like ReLU is lipschitz function for and sigmoid for
VC dimension for a neural network class H with layers, inputs and ReLU activation function is given as follows [5]
(9)  
Significance of VC dimension analysis for deep convolutional neural network training is marginal. The theretical
complexty for number of total parameters of convolutinal model is rather very high upper bound for consideration in generalization analysis. It has been established that implicit and explicit regularization improves generalization. Fat Shattering dimension of a neural network class G with layers and inputs is given as follows [2] for some constant and ; number of points shattered by G(10)  
Fatshattering dimension is better bound than VC dimension for learning algorithms, as it suggets that minimizing the values of networks parameters is important for better generalization. The values of model parameters can be minimized using regularizer; achieved adding a extra penality term to the cost function using respective norms.
(11)  
IiiB Implicit and explicit regularization
Most widely used and effective explicit regularizers are Data Augmentation, Data Balancing, dropout, l1 regularizer, l2 regularizer.
Data Augmentation: Due to increasing in capacity fo the deep network and scarcity of enough discriminant labeled data, it’s useful to generate deformed version fo original training examples using affine transformation such as rotation, translation, shearing, mirroring and random cropping. Apart from that color space transformation such as RGB to Lab or HSV, also cropping and resizing proved to be effective in reducing overfitting.
Data Balancing:
For a dataset with biased sample classes, learning tends to overfit the class with high bias, results in poor generalization performance. Sample balancing methods such as uniform sampling and stratified sampling resample from data for balanced mini batches as per the given probability distribution of the classes.
Dropout: Dropping layer activation randomly realizes ensemble of many functions for that layer, it helps reducing overfitting.
L1 & L2 regularizer: L1 regularizer encourages sparsity by minimizing the L1 norm of the model weights. L2 regularizer penalizes model complexity, leads to small weights. The added combination of L1 & L2 regularizations encourages sparsity with small weights. The effectiveness of each regularization depends on the application; in general L2 regularization works perfectly fine.
Normalization: Batch Normalization (BN) [13]
is one of the de facto implicit regularization for faster and better generalization learning of feedforward deep convolutional model. BN normalizes the layer inputs to a zero mean and unit variance distribution. A model with BN can put an end to the bias term necessity. In the case of recurrent neural network layer normalization
[14] proved to be efficient than batch normalization. Another useful implicit regularizer is Local Response Normalization (LRN). LRN [16]is inspired from the lateral inhibition of an excited neuron. Unbounded activations are normalized using the values of the local window. For applications such as person reidentification
[15] LRN outperforms BN.An experimental analysis of the effectiveness of the batch norm and dropout can be observed from Table III
IiiC Feature Space Analysis
Claim 1.3: The rate of reduction of feature space size with respect to the number of convolutional layers plays important role in generalization.
A detailed analysis is given in Section II on the impact of feature space design for features representation without losing significance information. Efficient feature space design improves generalization performance by a fair margin. Table I proves our claim.
IiiD Learning Rate Decay
Claim 1.4: Learning rate decay policy acts as an implicit regularizer for deep model learning.
Learning rate decay policy plays a major role in generalized parameter learning with faster convergence. As the learning progress, exploration in local neighborhood becomes more important to do away with oscillation and illconditioning.
The Taylor series approxmation of the cost function :
(12)  
where and are the gradient and the hessian of the cost function. When the values of are large cost increases, this effect is known as ill conditioning and a common probelm with deep learning training. The learning rate decay alleviate the effect of ill conditioning leading to low cost space exploration. In practice, polynomial decay works very well in comparison to step decay or inverse decay methods. Step decay needs more supervision, better not to use.
From Table IV we can see that polynomial decay performs much better than step decay.
IiiE Optimal Number of Parameters
In the deep model impact of VC dimension is marginal. The fat shattering dimension plays important role in regularization. In determining the optimal numbers of parameters feature space design comes into play. Inefficient shallow model or extra deep model may underfit/overfit the data resulting in poor generalization performance. As per the discussion in Section II, feature space design plays a major role in powerfull representation on uncorellated hyperspaces reducing information loss.
Iv Experiment Setup
Platform Details: All our experiments were carried out on a Linux server with 128GB RAM, Xeon E54667 v4 processor, and two Nvidia K80 GPUs.
Model  #params(K)  test_accuracy (%) 

design 1  20173  89.4 
design 2  20173  86.8 
design 3  20025  87.9 
Model  #params (K)  test_accuracy (%) 

design 1_conv  20948  91.7 
design 1 (max_pooling)  20173  89.4 
Model  dropout  batch_norm  test_accuracy (%) 

design 1_conv  yes  yes  91.7 
design 1_conv  yes  no  88.2 
design 1_conv  no  yes  90.1 
Model  policy  test_accuracy (%) 

design 1_conv  polynomial  91.7 
design 1_conv  step  90.1 
Model  first_layer_stride  test_accuracy (%) 

design 1_conv  no  91.7 
design 1_conv_stride  yes  89.4 
Model  #params (K)  test_accuracy (%) 

design 1_conv  20948  91.7 
design 4  21573  89.3 
block  design 1  design 1_conv  design 2  design 3  design 4 
input  28 x 28 x 3  28 x 28 x 3  28 x 28 x 3  28 x 28 x 3  28 x 28 x 3 
block1  1 x conv 3x3, 1, 64  1 x conv3x3, 1, 64  1 x conv3x3, 1, 64  conv3x3, 64  2 x conv3x3, 1, 64 
block2  max_pool  1 x conv3x3, 2, 64  max_pool  max_pool  1 x conv3x3, 2, 64 
block2_1          1 x conv1x1, 2, 128 
block3  2 x conv 3x3, 1, 128  2 x conv3x3, 1, 128  1 x conv3x3, 1, 128  1 x conv3x3, 1, 128  3 x conv3x3, 1, 128 
block3_1      max_pool    block2_1 + block3 
block3_2      1 x conv3x3, 1, 128    3 x conv3x3, 1, 128 
block4  max_pool  1 x conv3x3, 2, 128  max_pool  max_pool  1 x conv3x3, 2, 128 
block5  4 x conv 3x3, 1, 256  4 x conv3x3, 1, 256  4 x conv3x3, 1, 256  4 x conv3x3, 1, 256  4 x conv3x3, 1, 256 
block6  max_pool  1 x conv3x3, 2, 256  max_pool  max_pool  1 x conv3x3, 2, 256 
block7  1 x conv 1x1, 1, 4096  1 x conv1x1, 1, 4096  1 x conv1x1, 1, 4096  1 x conv1x1, 1, 4096  1 x conv1x1, 1, 4096 
block7_1  dropout  dropout  dropout  dropout  dropout 
block8  1 x conv 1x1, 1, 4096  1 x conv1x1, 1, 4096  1 x conv1x1, 1, 4096  1 x conv1x1, 1, 4096  1 x conv1x1, 1, 4096 
block8_1  dropout  dropout  dropout  dropout  dropout 
block9  1 x conv 1x1, 1, 10  1 x conv1x1, 1, 10  1 x conv1x1, 1, 10  1 x conv1x1, 1, 10  1 x conv1x1, 1, 10 
Dataset: To validate our claims we have used image classification CIFAR10 [17] dataset. It has 10 object classes and divided into two splits for training and validation. The training split has 50000 and the validation split has 10000 images. The size of each image is , RGB color channels.
Preprocessing: For training a randomly cropped patch of size is used. Each patch is flipped left/right and up/down based on coin flipping results. Apart from that, we adjust the image color by scaling its values into
range and changing its hue, contrast, and saturation. Each image (training/validation) is standardized by subtracting its mean and dividing its standard deviation.
For evaluation, the central crop of each image is selected and resized using bilinear interpolation.
Framework: We have used TEFLA [18]
, a python framework developed on the top of TENSORFLOW
[19], for all experiments described in this work.Model: Table VII details model design for different experiments. Conventions are followed as (); where is the number of convolution for composition design, is the stride for convolution and is the number of kernel for each convolution layer. Each convolutional layer of a model is followed by a batch normalization and a nonlinearity (relu for ur experiments) layer for all designs experimented in this work.
Iva Results Analysis
Table I shows performance of each model on CIFAR10 dataset validation/test set. For design 1 and design 2 with the same number of parameters, generalization performance varies significantly, asserting the importance of feature space size importance and minimization of information loss.
Table II shows the importance of strided convolution for feature space size reduction than max pooling. Significance performance gain is observed while using convolution instead of max pooling; implying the information loss for max pooling is higher than strided convolution. For design 1_conv if we use strided convolution for the first layer instead of the second there’s a significant drop of generalization performance even though number of parameters remain same, Table V.
Table III shows the importance of dropout and batch normalization for generalization. Effect of batch normalization is higher than the dropout.
Table IV proves our claim that learning rate decay policy also acts as implicit regularizer improving generalization performance. polynomial decay is very robust and requires minimal supervision, yielding better generalization performance.
Performance doesn’t always depend on more depth, an optimal design performs better than a deeper design, from Table VI we can see the experimental results of two design. From this, we can conclude that there exists an optimal number of parameters for generalization.
V Conclusion
In this work, a detailed analysis of deep model generalization performance tradeoff is presented. We showed that the compositional feature space design with implicit and explicit regularizations play important role in achieving better performance. In terms of model complexity traditional measure, VC dimension doesn’t give much information, but fatshattering dimension analysis has an indirect effect on generalization. From our experiment, we showed that the optimal model satisfies compositional design criteria and have the optimal number of parameters. We wrap up this work with the claim that combination of compositional feature space design with explicit and implicit generalization and efficient optimization algorithms give the bestgeneralized performance for any dataset.
References
 [1] Peter L Bartlett. The Sample Complexity of Pattern Classification with Neural Networks  The Size of the Weights is More Important than the Size of the Network. IEEE Trans. Information Theory, 1998.
 [2] Bartlett, P. L. For valid generalization, the size of the weights is more important than the size of the network. Advances in neural information processing systems, 134140, 1997
 [3] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. CoRR, abs/1412.6614, 2014.
 [4] Krogh, A. and Hertz, J. A. A simple weight decay can improve generalization. In Proc. NIPS, pp. 950–957, 1992.
 [5] Maass, W. VapnikChervonenkis dimension of neural nets. The handbook of brain theory and neural networks, pp. 10001003, 1995.
 [6] Zhang, Chiyuan, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
 [7] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In ICML, 2016.
 [8] Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. Inceptionv4, inceptionresnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016.

[9]
He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
 [10] Mhaskar, H., Liao, Q., Poggio, T. Learning functions: When is deep better than shallow. arXiv preprint arXiv:1603.00988, 2016.
 [11] Ioffe, S., Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [12] Trappenberg, T. P., Rolls, E. T., Stringer, S. M. Effective size of receptive fields of inferior temporal visual cortex neurons in natural scenes. Advances in neural information processing systems, 1, 293300, 2002.
 [13] Ioffe, S., Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [14] Ba, J. L., Kiros, J. R., Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 [15] Varior, R. R., Haloi, M., Wang, G. Gated siamese convolutional neural network architecture for human reidentification. In European Conference on Computer Vision (pp. 791808). Springer International Publishing, 2016.

[16]
Krizhevsky, A., Sutskever, I., Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 10971105), 2012.
 [17] Krizhevsky, A., Hinton, G. Learning multiple layers of features from tiny images. Technical report, Department of Computer Science, University of Toronto, 2009
 [18] https://github.com/n3011/tefla
 [19] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., … Ghemawat, S. (2016). Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
Vi Appendix
Via Effective receptive field
(13)  
where is the distance between two adjacent feature maps; is the convolution stride, is the kernel size and is the receptive fields size.
ViB Learning rate decay policy
Commonly used learning rate decay policies are given below:
(14)  
(15)  
(16)  
(17)  
(18)  
(19) 
where and are two constants; is the initial learning rate, is the learning for (current iteration). is the maximum number of iteration for learning and is the step for changing learning rate for step policy.
ViC Mutual information for normal random variables
For a normal random variables X
, entropy of X is calculated as followsFor another normal random variables Y ; the mutual information between X and Y can be calculated as follows: