Log In Sign Up

Deep Learning: Generalization Requires Deep Compositional Feature Space Design

by   Mrinal Haloi, et al.

Generalization error defines the discriminability and the representation power of a deep model. In this work, we claim that feature space design using deep compositional function plays a significant role in generalization along with explicit and implicit regularizations. Our claims are being established with several image classification experiments. We show that the information loss due to convolution and max pooling can be marginalized with the compositional design, improving generalization performance. Also, we will show that learning rate decay acts as an implicit regularizer in deep model training.


page 1

page 2

page 3

page 4


Toward Compositional Generalization in Object-Oriented World Modeling

Compositional generalization is a critical ability in learning and decis...

Concepts, Properties and an Approach for Compositional Generalization

Compositional generalization is the capacity to recognize and imagine a ...

Generalization in Multimodal Language Learning from Simulation

Neural networks can be powerful function approximators, which are able t...

Generalization by design: Shortcuts to Generalization in Deep Learning

We take a geometrical viewpoint and present a unifying view on supervise...

Localized Adversarial Domain Generalization

Deep learning methods can struggle to handle domain shifts not seen in t...

Compositional Generalization by Learning Analytical Expressions

Compositional generalization is a basic but essential intellective capab...

I Introduction

Deep learning massive success in almost every fields represents its ability to solve complex problems. The trade-off between model complexity and accuracy is an important area of deep learning research. Very complex model with millions of parameters [8, 9] proved to the state of the art solution for many vision and natural language problems. A common way to measure the performance or generalizability of a deep learning model is to test it on a well discriminative validation/test set representing the variation of samples of the corresponding problem. Learning very complex model is a matter of the requirements of high computing power and huge dataset. So it’s important to understand the optimal complexity requirement for a problem to reduce the burden of computing power. In a recent work by Zhang et al. [6]

, it has been proved that a simple 2 layers neural network with

parameters can represent any function for samples in

dimensions. It is interesting to see that a simple multilayer perceptron with ReLU activation can fit a dataset of random labels with zero training accuracy, but poor generalizability. The problem with mere data memorizing is to be blamed for poor performance on a test set. Preventing the network in memorizing data samples in inefficient random high dimensional space is important model design paradigm. But at the same time, it’s acceptable for a model to memorize the data in an efficient hyperspace representing original data distribution. Learning the original data distribution solves the poor validation set performance of a model. Designing an optimal network with a minimum number of parameters will reduce computation costs and improve performance.

How do we design model that best fit the original data distribution? In this work, we will present the importance of deep compositional feature space design with an optimal number of parameters. We will prove the rate of feature space size reduction matters irrespective of network parameters. Also, we will define an optimal strategy to relate a number of parameters requirements for a particular feature space representation.

I-a Contribution

Feature space representability:

There exists an optimal number of nonlinear transformations to represent a particular size features space without losing discriminative information. Convolution is a linear operation projecting data from one space to another, using a nonlinear activation final output of convolution transformation becomes non-linear. While transforming features from one space to another space there’s loss of information. The loss of information can be understood using singular value decomposition. We have trained multiples network with the same number of parameters but a different rate of feature space reduction on the CIFAR10 dataset to prove our points. Following observations have strong impact on the learning and the generalization performance:

  • The rate of reductions of the feature space size with respect to a number of convolution operations.

  • onvolution vs max pooling for feature space reduction.

An optimal number of model parameters: There exists an optimal number of model parameters for a model to achieve high generalization accuracy. The number of optimal parameters depend on the compositional design and the feature space reduction rate.
Implicit Regularizer: Learning rate decay policy acts as an implicit regularizer boosting the model performance and faster learning.
Note: here with the term feature space size, we talk about width and height of a feature map, not the number of feature maps.

I-B Related Work

Zhang et al.(2017)[6] studied the representational power of a network with respect to the training sample size; shows that a deep model can memorize any dataset with random labels, but it doesn’t imply generalization. Their finding also stated that explicit regularization alone can’t prevent poor generalization performance. Barlett (1998)[1] showed that VC dimension is not relevant to measure generalization performance for neural networks, rather the

norm of the network weights is more prevalent measure. They defined a fat-shattering dimension for error estimation that depends on parameter magnitude. Maass (1995)


has established the VC dimension bounds for the neural network with various activation functions for generalization analysis. Krogh et al. (1992)


showed that weight decay suppresses any irrelevant weights vector component also reduce noises, hence lowering generalization errors. Hardt et al. (2016)


introduced a generalization error upper bound for a model trained with stochastic gradient descent for convex and non-convex optimization problems. Neyshabur et al. (2015)

[3] shows that apart from adding weight decay or implicit regularization, increasing the network size improves generalization performance for a model learned with stochastic gradient descent. They also asserted that with very high number of hidden units () a weight decay regularized network is considered as a convex neural net for optimization. On the effectiveness of deep networks vs shallow networks Mhaskar et al. (2016)[10], showed that VC dimension and fat-shattering dimension are smaller for deep networks than shallow networks. They argued the benefits of compositional function design for scalability and shift invariance in image and text data. All these above works didn’t discuss the prominent effect of the loss of information while data is projected from one space to another. We will show here that the information loss effect the generalization error for deep or shallow networks.

Ii Feature space design in deep neural networks

Ii-a Information loss

The standard convolution operation used in the deep network is linear. When feature space is projected from one space to another using convolution/pooling operation there’s always a loss of information. The information loss depends on the projected space dimension and capacity. Loss of information can be understood from singular value decomposition (SVD). If denotes the input for convolutional layer


is the input map, the equation (1) refer to a special case where the input matrix for SVD is Hermitian and positive definite. When data is projected from dimension to low-rank approximation

dimension, there’s loss of information. 2-D convolution for a single feature map with a single kernel can be interpreted as projecting data from one space to another. Significant information is lost when convolution stride

1. Another way to calculate the retrived information after convolution is using the mutual information between two signals. For two independent signals X and Y, mutual information I(X, Y) = 0 and maximum if X Y (fully correlated). For X and Y we define information loss as follows:


Mutual information between the original data and projected data (convolution/pooling) is given as follows


For normal random vectors X and Y mutual information can be calculated as follows:


Convolution of a image with kernel is equivalent to the same with and kernels; 2 1-D convolution.


are covariance matrices for respective variables. Since we are using batch normalization also input whitening, it’s safe to assume input/output of convolution as normal. Apart from that convolutional kernel also initialized as normal variables. If the convolution input X is sampled from a normal distribution X

and the kernel K , then the ouput will also be a normal with distribution Y . Using eq 3 and eq 5 we can calculate information loss for convolution/pooling.

Ii-B Compositional Design of Convolutional Layer

A single convolution layer followed by max pooling results in high information loss. Increasing the number of convolution kernel for a layer doesn’t necessarily solve the problem. Before reducing the feature space size it’s important to project the feature into high dimensional hyperspaces using multiple convolution operations. It’s important to use non-linearity and batch normalization [13] for each convolution operation to achieve highly uncorrelated hyperspace projection. The stacking design best resembles the representation power of the compositional function. Also, the VC dimension of compositional design is smaller than that of shallow design [10]. With the composition of multiples convolution operation, receptive field grows in polynomial order. Compositional design inspired from the visual cortex increasing receptive fields for higher visual areas. It has been established [12] that visual cortex receptive fields are larger for a simple scene and smaller for a complex scene. Compositional design has smaller and larger receptive fields those capture information related to simple and complex objects leading to decrease in information loss.


The above equations ((6), (7)) only valid for convolution stride 1.

Ii-C Convolution vs Max Pooling

Claim 1.1 Strided convolution replaces max/avg pooling with better generalization performance.

Feature space reduction using max pooling is a very crude projection into another hyper space. Max pooling operation leads to lossy non-linear transformation. The translational invariance which is one of the advantages of max pooling operation can be easily well represented using affine transformation, achieved with strided convolution. Information loss in strided convolution is lower than hard non-linear max pooling. Fig 

1 illustrates model design with strided convolution and max pooling.
Table II proves our claims.

Fig. 1: Left: A convolutional block design with compositional convolutional operations (conv-n-a means convolution followed by normalization and non-linearity). Middle: design with max-pool for downsampling. Right: convolution for downsampling

Ii-D Rate of Reduction

Claim 1.2 Minimum one convolutional operation needed before reducing the feature space size of the model.

To minimize loss of information, projecting the data into mutliples non-linear hyper space is required for improved generalization. Minimum one convolution for the first layer of the model without stride and maximum 4 convolution operations without residual connection is preferable for intermediate layers; with residual connection,

convolutional operations can be added. It is also noteworthy that residual connection only facilitate the training of deeper model.

Iii Generalization requires feature space analysis

Iii-a VC dimension and Fat Shattering dimension

A function is considered as Lipschitz function if it satisfies the following condition for a smallest constant :


For deep networks non-linearity like ReLU is lipschitz function for and sigmoid for

VC dimension for a neural network class H with layers, inputs and ReLU activation function is given as follows [5]


Significance of VC dimension analysis for deep convolutional neural network training is marginal. The theretical

complexty for number of total parameters of convolutinal model is rather very high upper bound for consideration in generalization analysis. It has been established that implicit and explicit regularization improves generalization. Fat Shattering dimension of a neural network class G with layers and inputs is given as follows [2] for some constant and ; number of points -shattered by G


Fat-shattering dimension is better bound than VC dimension for learning algorithms, as it suggets that minimizing the values of networks parameters is important for better generalization. The values of model parameters can be minimized using regularizer; achieved adding a extra penality term to the cost function using respective norms.


Iii-B Implicit and explicit regularization

Most widely used and effective explicit regularizers are Data Augmentation, Data Balancing, dropout, l1 regularizer, l2 regularizer.

Data Augmentation: Due to increasing in capacity fo the deep network and scarcity of enough discriminant labeled data, it’s useful to generate deformed version fo original training examples using affine transformation such as rotation, translation, shearing, mirroring and random cropping. Apart from that color space transformation such as RGB to Lab or HSV, also cropping and resizing proved to be effective in reducing overfitting.

Data Balancing:

For a dataset with biased sample classes, learning tends to overfit the class with high bias, results in poor generalization performance. Sample balancing methods such as uniform sampling and stratified sampling resample from data for balanced mini batches as per the given probability distribution of the classes.

Dropout: Dropping layer activation randomly realizes ensemble of many functions for that layer, it helps reducing overfitting.

L1 & L2 regularizer: L1 regularizer encourages sparsity by minimizing the L1 norm of the model weights. L2 regularizer penalizes model complexity, leads to small weights. The added combination of L1 & L2 regularizations encourages sparsity with small weights. The effectiveness of each regularization depends on the application; in general L2 regularization works perfectly fine.

Normalization: Batch Normalization (BN) [13]

is one of the de facto implicit regularization for faster and better generalization learning of feedforward deep convolutional model. BN normalizes the layer inputs to a zero mean and unit variance distribution. A model with BN can put an end to the bias term necessity. In the case of recurrent neural network layer normalization

[14] proved to be efficient than batch normalization. Another useful implicit regularizer is Local Response Normalization (LRN). LRN [16]

is inspired from the lateral inhibition of an excited neuron. Unbounded activations are normalized using the values of the local window. For applications such as person re-identification

[15] LRN outperforms BN.

An experimental analysis of the effectiveness of the batch norm and dropout can be observed from Table III

Iii-C Feature Space Analysis

Claim 1.3: The rate of reduction of feature space size with respect to the number of convolutional layers plays important role in generalization.
A detailed analysis is given in Section II on the impact of feature space design for features representation without losing significance information. Efficient feature space design improves generalization performance by a fair margin. Table I proves our claim.

Iii-D Learning Rate Decay

Claim 1.4: Learning rate decay policy acts as an implicit regularizer for deep model learning.
Learning rate decay policy plays a major role in generalized parameter learning with faster convergence. As the learning progress, exploration in local neighborhood becomes more important to do away with oscillation and ill-conditioning. The Taylor series approxmation of the cost function :


where and are the gradient and the hessian of the cost function. When the values of are large cost increases, this effect is known as ill conditioning and a common probelm with deep learning training. The learning rate decay alleviate the effect of ill conditioning leading to low cost space exploration. In practice, polynomial decay works very well in comparison to step decay or inverse decay methods. Step decay needs more supervision, better not to use.

From Table IV we can see that polynomial decay performs much better than step decay.

Iii-E Optimal Number of Parameters

In the deep model impact of VC dimension is marginal. The fat shattering dimension plays important role in regularization. In determining the optimal numbers of parameters feature space design comes into play. Inefficient shallow model or extra deep model may underfit/overfit the data resulting in poor generalization performance. As per the discussion in Section II, feature space design plays a major role in powerfull representation on uncorellated hyperspaces reducing information loss.

Claim 1.5: The optimal numbers of parameters, is the number of parameters of a optimal model designed using feature space analysis.
Table VI and I gives an experimental validation of this claim.

Iv Experiment Setup

Platform Details: All our experiments were carried out on a Linux server with 128GB RAM, Xeon E5-4667 v4 processor, and two Nvidia K80 GPUs.

Model #params(K) test_accuracy (%)
design 1 20173 89.4
design 2 20173 86.8
design 3 20025 87.9
TABLE I: Results of 3 main Designs
Model #params (K) test_accuracy (%)
design 1_conv 20948 91.7
design 1 (max_pooling) 20173 89.4
TABLE II: Convolution vs Max Pooling
Model dropout batch_norm test_accuracy (%)
design 1_conv yes yes 91.7
design 1_conv yes no 88.2
design 1_conv no yes 90.1
TABLE III: Results of explicit regularization
Model policy test_accuracy (%)
design 1_conv polynomial 91.7
design 1_conv step 90.1
TABLE IV: Results of Learning rate decay policy
Model first_layer_stride test_accuracy (%)
design 1_conv no 91.7
design 1_conv_stride yes 89.4
TABLE V: Results of rate of reduction
Model #params (K) test_accuracy (%)
design 1_conv 20948 91.7
design 4 21573 89.3
block design 1 design 1_conv design 2 design 3 design 4
input 28 x 28 x 3 28 x 28 x 3 28 x 28 x 3 28 x 28 x 3 28 x 28 x 3
block1 1 x conv 3x3, 1, 64 1 x conv3x3, 1, 64 1 x conv3x3, 1, 64 conv3x3, 64 2 x conv3x3, 1, 64
block2 max_pool 1 x conv3x3, 2, 64 max_pool max_pool 1 x conv3x3, 2, 64
block2_1 - - - - 1 x conv1x1, 2, 128
block3 2 x conv 3x3, 1, 128 2 x conv3x3, 1, 128 1 x conv3x3, 1, 128 1 x conv3x3, 1, 128 3 x conv3x3, 1, 128
block3_1 - - max_pool - block2_1 + block3
block3_2 - - 1 x conv3x3, 1, 128 - 3 x conv3x3, 1, 128
block4 max_pool 1 x conv3x3, 2, 128 max_pool max_pool 1 x conv3x3, 2, 128
block5 4 x conv 3x3, 1, 256 4 x conv3x3, 1, 256 4 x conv3x3, 1, 256 4 x conv3x3, 1, 256 4 x conv3x3, 1, 256
block6 max_pool 1 x conv3x3, 2, 256 max_pool max_pool 1 x conv3x3, 2, 256
block7 1 x conv 1x1, 1, 4096 1 x conv1x1, 1, 4096 1 x conv1x1, 1, 4096 1 x conv1x1, 1, 4096 1 x conv1x1, 1, 4096
block7_1 dropout dropout dropout dropout dropout
block8 1 x conv 1x1, 1, 4096 1 x conv1x1, 1, 4096 1 x conv1x1, 1, 4096 1 x conv1x1, 1, 4096 1 x conv1x1, 1, 4096
block8_1 dropout dropout dropout dropout dropout
block9 1 x conv 1x1, 1, 10 1 x conv1x1, 1, 10 1 x conv1x1, 1, 10 1 x conv1x1, 1, 10 1 x conv1x1, 1, 10
TABLE VII: Network design

Dataset: To validate our claims we have used image classification CIFAR10 [17] dataset. It has 10 object classes and divided into two splits for training and validation. The training split has 50000 and the validation split has 10000 images. The size of each image is , RGB color channels.

Preprocessing: For training a randomly cropped patch of size is used. Each patch is flipped left/right and up/down based on coin flipping results. Apart from that, we adjust the image color by scaling its values into

range and changing its hue, contrast, and saturation. Each image (training/validation) is standardized by subtracting its mean and dividing its standard deviation.

For evaluation, the central crop of each image is selected and resized using bilinear interpolation.

Framework: We have used TEFLA [18]

, a python framework developed on the top of TENSORFLOW

[19], for all experiments described in this work.

Model: Table VII details model design for different experiments. Conventions are followed as (); where is the number of convolution for composition design, is the stride for convolution and is the number of kernel for each convolution layer. Each convolutional layer of a model is followed by a batch normalization and a non-linearity (relu for ur experiments) layer for all designs experimented in this work.

Iv-a Results Analysis

Table I shows performance of each model on CIFAR10 dataset validation/test set. For design 1 and design 2 with the same number of parameters, generalization performance varies significantly, asserting the importance of feature space size importance and minimization of information loss.

Table II shows the importance of strided convolution for feature space size reduction than max pooling. Significance performance gain is observed while using convolution instead of max pooling; implying the information loss for max pooling is higher than strided convolution. For design 1_conv if we use strided convolution for the first layer instead of the second there’s a significant drop of generalization performance even though number of parameters remain same, Table V.

Table III shows the importance of dropout and batch normalization for generalization. Effect of batch normalization is higher than the dropout.

Table IV proves our claim that learning rate decay policy also acts as implicit regularizer improving generalization performance. polynomial decay is very robust and requires minimal supervision, yielding better generalization performance.

Performance doesn’t always depend on more depth, an optimal design performs better than a deeper design, from Table VI we can see the experimental results of two design. From this, we can conclude that there exists an optimal number of parameters for generalization.

V Conclusion

In this work, a detailed analysis of deep model generalization performance trade-off is presented. We showed that the compositional feature space design with implicit and explicit regularizations play important role in achieving better performance. In terms of model complexity traditional measure, VC dimension doesn’t give much information, but fat-shattering dimension analysis has an indirect effect on generalization. From our experiment, we showed that the optimal model satisfies compositional design criteria and have the optimal number of parameters. We wrap up this work with the claim that combination of compositional feature space design with explicit and implicit generalization and efficient optimization algorithms give the best-generalized performance for any dataset.


Vi Appendix

Vi-a Effective receptive field


where is the distance between two adjacent feature maps; is the convolution stride, is the kernel size and is the receptive fields size.

Vi-B Learning rate decay policy

Commonly used learning rate decay policies are given below:


where and are two constants; is the initial learning rate, is the learning for (current iteration). is the maximum number of iteration for learning and is the step for changing learning rate for step policy.

Vi-C Mutual information for normal random variables

For a normal random variables X

, entropy of X is calculated as follows

For another normal random variables Y ; the mutual information between X and Y can be calculated as follows: