Log In Sign Up

Smaller Models, Better Generalization

Reducing network complexity has been a major research focus in recent years with the advent of mobile technology. Convolutional Neural Networks that perform various vision tasks without memory overhaul is the need of the hour. This paper focuses on qualitative and quantitative analysis of reducing the network complexity using an upper bound on the Vapnik-Chervonenkis dimension, pruning, and quantization. We observe a general trend in improvement of accuracies as we quantize the models. We propose a novel loss function that helps in achieving considerable sparsity at comparable accuracies to that of dense models. We compare various regularizations prevalent in the literature and show the superiority of our method in achieving sparser models that generalize well.


page 1

page 2

page 3

page 4


PRUNIX: Non-Ideality Aware Convolutional Neural Network Pruning for Memristive Accelerators

In this work, PRUNIX, a framework for training and pruning convolutional...

Efficient and Effective Quantization for Sparse DNNs

Deep convolutional neural networks (CNNs) are powerful tools for a wide ...

Neural Network Quantization for Efficient Inference: A Survey

As neural networks have become more powerful, there has been a rising de...

Activation Density driven Energy-Efficient Pruning in Training

The process of neural network pruning with suitable fine-tuning and retr...

Putting 3D Spatially Sparse Networks on a Diet

3D neural networks have become prevalent for many 3D vision tasks includ...

Low Complexity Convolutional Neural Networks for Equalization in Optical Fiber Transmission

A convolutional neural network is proposed to mitigate fiber transmissio...

Nonconvex Regularization for Network Slimming:Compressing CNNs Even More

In the last decade, convolutional neural networks (CNNs) have evolved to...

1 Introduction

Deep Neural Networks have been very successful in variegation of tasks. They have been applied to Image classification [22, 13, 33], Text analytics [28, 15], Handwriting generation [8], Image Captioning [20], Automatic Game playing [27, 32], Speech Recognition [12], Machine translation [4, 38] and many others. Bengio et al. [23] and Schmidhuber [31]

provides an extensive review of deep learning and its applications.

The representational power of a neural network increases with its depth as is evident from the architectures like Highway Networks [36] (32 layers and 1.25M parameters) and ResNet [14] (110 layers has 1.7M parameters). Such large number of weights presents a challenge in terms of storage capacity, memory bandwidth and representational redundancy. For example, widely used models like AlexNet Caffemodel is over 200MB, and the VGG-16 Caffemodel is over 500MB. With advent of mobile technologies and IoT devices the need for faster and accurate computing has arisen. Sparse matrix multiplications and convolutions are a lot faster than their dense counterparts. Furthermore, a sparse model with few parameters gain advantage in terms of better generalization ability thereby preventing overfitting. Effect of various regularizers () on CNN (Convolutional Neural Networks) are studied in [6].

In this paper we introduce a novel loss function to achieve sparsity by minimizing a convex upper bound on Vapnik-Chervonenkis (VC) dimension. We first derive an upper bound on the VC dimension of the classifier layer of a neural network, and then apply this bound on the intermediate layers in the neural networks, in conjunction with the weight-decay (


norms) regularization bound. This result provides us with a novel error functional to optimize over with backpropagation for training neural network architectures, modified from the traditional learning rules.

This learning rule adapts the model weights to minimize both empirical error on training data as well as the VC dimesion of the neural network. With the inclusion of a term minimizing the VC dimension, we aim to achieve sparser neural networks, which allow us to remove a large number of synapses and neurons without any penalty on empirical performance.

Finally, we demonstrate the consistent effectiveness of the learning rule across a variety of learning algorithms on various datasets across learning task domains. We see that the data dependent rule promotes higher test set accuracies, faster convergence and achieves smaller models across various architectures such as Feedforward (Fully Connected) Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) confirming our hypothesis that the algorithm indeed controls model complexity, while improving generalization performance.

The rest of the paper is organised as follows - in Section 2 we provide a brief overview of the recent relevant work in complexity control and generalization in deep neural networks, and in Section 3 we provide the derivation for our learning rule, and proof for theoretical bounds. Section 4 describes the effect of quantization on VC dimension bound of the network. In the subsequent section 5 we describe our experimental setup and methodology, along with qualitative and quantitative analyses of our experiments.

2 Related Works

Compression of deep nets have been widely studied. Network pruning and quantization are the methods of choice. Researchers have used weights and neuronal removal to instigate sparsity. [11] used iterative deletion of weights and neurons to achieve sparsity. [42, 30] used group sparse regularization on weights to incorporate sparsity. [37] used iterative sparsification based on neural correlations. [25] used optimal brain damage to enforce sparsity. [34] removed redundant neurons based on saliancy of two weight sets. [41] used second order Taylor information to prune neurons.[2] pruned the net using sparse matrix transformation keeping the layer input and output close to the original unpruned model.[35] used a bimodal regularizer to enforce sparsity and [3] merged two neurons with high correlations.
A rich body of literature exist on quantizing the models as well. [10] build their model on top of their earlier model, by adding quantization and Huffman coding. [24]

used weight binarization and quantizing the learned representations in each layers to achieve the same. In their work

[29] binarized both weights and inputs to the convolutional layers. [26] proposed cluster based quantization method to convert pre-trained full precision weights to ternary weights with minimal loss in accuracy. [17] quantized weights, activations and incorporated quantized gradients with 6 bits in their training.

3 Sparsifying Neural Networks through Pruning

In this section we derive an upper bound on the VC dimension . This proof is an extension of the one in [18]. Vapnik [40] showed that the VC dimension

for fat margin hyperplane classifiers with margin



Let us consider a dataset with samples and features. The individual samples are denoted by . where denotes the radius of the smallest sphere enclosing all the training samples. We first consider the case of a linearly separable dataset. By definition, there exists a hyperplane , parameterized by and a bias term with positive margin that can classify these points with zero error. We can always choose a value ; for all further discussion we assume that this is the case. The samples are assumed to be in a high dimension; this assumption is reasonable because the samples inherently have a large number of features and are thus linearly separable, owing to Cover’s theorem [7], or they have been transformed from the input space to a high dimensional space by using a nonlinear transformation. The case when the samples are linearly separable and in a small dimension is not interesting as these are of a trivial nature. Thus we have,


Let us consider the problem of minimizing the fraction as minimizing the upper bound on VC dimension.


Since, both the numerator and denominator are positive quantities with and , we can alternatively write (3) as:


We simplify the value of the fraction , to attain a tractable convex bound in term of the weights of network.


Without proper scaling of and , we can write the minimum value of distance of correctly classified point to be .


Using (7), we convert (6) to the following optimization problem.


Since, for two numbers and , the following inequality holds:


Applying the inequality (9) to (8), we achieve the following upper bound on the fraction


For a separating hyperplane that passes through the data, the maximum distance of the point from the plane, is greater than the maximum radius of the data. Thus we can extend the bound on radius of dataset as:


Using the bound derived in (11), we can write (10) as:


For positive numbers , the following inequality holds,


Using (13) in (12), we have the following bound


Finally, we arrive at the convex and differentiable version of the bound on VC dimension, that can be minimized using stochastic gradient descent and can used in conjugation with various architectures. The following bound acts as a data dependent regularizer when used alongside the loss function minimization. Here we present the effectiveness of the bound for reducing the number of connections of the network.


3.1 A bound on Neural network

We now use the bound (15) in the context of a multi-layer feedforward neural network. Consider a neural with multiple hidden layers for the problem of multiclass classification with classes. Let the number of neurons in the penultimate layer be denoted by , and let their outputs be denoted by ; let the corresponding connecting weights for the classifier layer be denoted by respectively. One may view the outputs of this layer as a map from the input to , i.e. . The biases of at the output are denoted by . The score for input pattern at the output is given by . For the purposes of this paper, we use multiclass hinge loss following the works of Tang et al., [39], where the authors state superiority of hinge loss over softmax loss. Thus applying the bound (15) on the classification layer of neural network, lead us to the following optimization problem:


3.2 Application of the bound on hidden layers

The great advantage with this bound is its ability to be applied to pre-activations in the net across all the layers. When applied to the pre-activations in a net, it is interpreted as a

regularizer. It forces pre-activations to be close to zero. For ReLu activation functions

, our data dependent regularizer forces the pre-activations for each layer to be close to zero. Thus, it in turn enforces sparsity at neuronal levels in the intermediate layers. In principle, during back-propagation this tantamount to solving a least squares problem for each neuron where the targets are all . Consider a feedforward architecture with hidden layers. For an intermediate layer , the let the activations of the layer with neurons be . Let be the weights of the layer going from to and be the set of biases. Let us assume that the targets for each sample for each pre-activations is . Hence, the application of (15) on pre-activations with ReLu activation function, is equivalent to the following minimization problem.


With the application of VC bound (17) to all the layers, the final minimization problem can be derived from (16) as:


4 Trade-off between margin and error: Role of quantization

Consider a binary classification problem with samples where sample is denoted as and its corresponding label is represented as . Let us define a fat margin hyperplane classifier denoted by where, be the weights and be the bias term. Let be the quantized weights and be the quantized bias term. Without loss of generality, we can consider hyperplanes passing through the origin. To see that this is possible, we augment the co-ordinates of all samples with an additional dimension or feature whose value is always , i.e. the samples are given by

; also, we assume that the weight vector is

-dimensional, i.e. . Thus, the classifier then becomes . Following the above notation, quantized version of vector is denoted as .

Theorem 1.

Consider full precision and a quantized fat margin classifiers with upper bounds on VC dimensions as and . If , then the quantized classifier has smaller VC bound ().


Given a set of linearly separable data points and the two fat margin classifiers, former with full precision and latter with quantized set of weights. If the predicted label for each data point assigned by each individual classifiers is the same, which implies that the two classifiers have same accuracies, then the differences in the scores for each sample multiplied by its individual class should be positive.


where, .
It can be easily shown that (19) is true if,


The condition (21), translates to the fact that we assign smaller number mantissa bits to the weights or during reduction in fraction bits is smaller than . The argument for this condition comes from the fact that if (21) holds then the sign of remains the same as that of or . Now since quantization does not allow flipping of signs of each individual bits, (21) allows for the same sign of the sum given by eq. (19). This implies,


From, eq. (15) where we define , analogous to it, the quantized counterpart can be defined as . Now, using eq. (22) and eq. (23), we have,


Thus by introducing the quantization, one can reduce the complexity of the classifier. This is also evident from the fact that the size of hypothesis class reduces as the precision is reduced. ∎

5 Empirical Analysis and Observations

We determine the effectiveness of network pruning and quantization on various network architectures like Convolutional Neural networks (CNNs) and fully-connected neural (FNN) nets using various data independent regularizers such as L1 norm and L2 norm on weights and dropout and the proposed data dependent regularizer (15).

5.1 Setup and Notation

All our experiments are run on a GPU cluster with NVIDIA Tesla K40 GPUs, and implementations were done using the assistance of the Caffe


library for CNNs, while the experiments for FNNs were done using Tensorflow

[1] and quantization of FNNs was implemented using Matlab [9].
Hyperparamter settings

: The two main hyperparameters in our experiments are

and , which are described in section 3. The two hyperparameters were tuned in the range of and in the multiples of . The other parameters such as dropout rate was kept at their default values for densely connected nets and quicknet. The learning rate was tuned for two values namely and . For CNNs the learning rate was multiplied by after every iterations, whereas for FNNs the learning rate was decreased as

after every epoch (one complete pass of data). The total number of iterations was kept to be

for CNNs and epochs for FNNs.
The notation used for simplicity in understanding experimental results is given as,

Symbols Meaning
H Hinge loss
W2 regularization
W1 regularization
LCL LCNN applied only on last layer
LCA LCNN applied on all layers
D Dropout
BN Batch normalization
Table 1: Tabular representation of notation.

5.2 Network Pruning

To analyse the efficacy of our regularizer in attaining sparsity we perform pruning of the network after training has finished. Firstly, we select a minimum weight threshold of . Then, we calculate the absolute value of weights in each layer, subsequently we divide the difference between maximum value of weights in each layer and the minimum threshold value into 50 (for FNNs) or 100 (for CNNs) steps. In the last step, we loop over these 50 steps and prune the weights whose absolute magnitude is below the step value.

5.2.1 CNNs: Datasets

Our first set of experiments are performed on image classification task using CNNs. Table (2) describes the standard image classification dataset used in the pruning and quantization experiments.

name features classes train size val size test size
Cifar 10 [21] 32323 10 50000 5000 5000
Table 2: Dataset used for CNN experiments

5.2.2 CNNs: Experiments

We studied the effect of pruning and quantization on two architectures of CNNs, namely Caffe quicknet [19] and Caffe implementation of densly connected convolutional nets [16] with 40 layers. We study various regularization and found that data dependent regularization achieves maximum sparsity, thus maximum compression ratio when compared to its contemporary regularizations.

Table 3 shows the compression ratio achieved when we prune the trained model. weight regularization achieves the best compression followed by our data dependent regularizer, whereas table 4 shows the accuracies, our data dependent regularizer reaches the best accuracy in the pool, keeping up the sparsity. We compare the effect of pruning and quantization on various regularizers visually using 2 dimensional tSNE plots of the final layer of densely connected CNNs. Figure (2) describes the results. Here we observe that data dependent regularizer allows forming of compact clusters thus achieving better generalizations for Cifar 10 dataset. The plots for pruned and quantized networks are visually similar, yet on closer inspection one finds that some of the clusters like Automobile, Horse, Cat and Airplane gets better clusters in terms of compactness and better separability than their unpruned and un-quantized counterparts.

Pruning Quantization
S 1.41 1.28
S + LCA 1.29 1.07
S + W 6.95 6.03
S + W + BN 1.92 2.33
S + W + BN + LCA 1.33 1.93
S + W + BN + LCA + D 1.16 2.20
S + W + D 3.20 1.46
S + W + D + BN 1.56 2.48
S + W + D + LCA 3.77 1.53
S + W + D + LCL 2.65 1.05
S + W + LCA 1.89 1.04
S + W + LCL 3.95 1.08
Table 3: Compression ratio for cifar 10 quick net model
Original acc Pruned acc Quantization acc
S 0.73 0.72 0.71
S + LCA 0.77 0.77 0.75
S + W 0.77 0.76 0.76
S + W + BN 0.80 0.79 0.77
S + W + BN + LCA 0.78 0.77 0.73
S + W + BN + LCA + D 0.79 0.78 0.78
S + W + D 0.77 0.76 0.76
S + W + D + BN 0.79 0.78 0.79
S + W + D + LCA 0.74 0.73 0.79
S + W + D + LCL 0.78 0.73 0.77
S + W + LCA 0.77 0.76 0.78
S + W + LCL 0.79 0.78 0.79
Table 4: Accuracies for cifar 10 quick net model

Table 5 shows the accuracies of Cifar10 before and after pruning and quantization on densely connected CNNs [16]. We observe that our regularization performs equally well when used in conjugation with dropout and weight regularizer.

Original acc Pruned Acc Quantization acc
H +D 0.901 0.887 0.901
H +W +D 0.924 0.913 0.923
H +W +LCL +D 0.924 0.914 0.920
H + W 0.900 0.886 0.897
H + W + LCL 0.895 0.888 0.889
H +W1 0.866 0.853 0.857
H +W1 +LCL 0.857 0.847 0.854
Table 5: Accuracies for cifar 10 densely connected CNN

Figure 1 shows the accuracies of various algorithms with the total number of bits after we perform the first round of pruning. We see that our regularizer has the best set of accuracies (H +W + LCL) among all the algorithms and is quite robust to the changes in the total number of bits.

Figure 1: Accuracies for various algorithms after pruning and then quantizing the number of bits
(a) H + W2
(b) H + W2 + P + Q
(c) H + W2 + LCL
(d) H + W2 + LCL + P + Q
Figure 2: tSNE visualization of last layer in densenet [16] for 50 random test samples from each class of Cifar 10 for various regularizations, here the notation in figures correpond to H = Hinge loss, W or W2 = L2 weight regularization, LCL = data dependent regularizer applied on last layer only, P= pruning applied, Q= quantization applied. We observe that in both the cases, the figures (1(c)) and (1(d)) have better clustering than figures (1(a)) and (1(b))
Unpruned Pruned
a9a 0.826 0.848 0.849 0.848 0.848 0.849 0.818 0.842 0.840 0.839 0.842 0.840
acoustic 0.781 0.781 0.781 0.778 0.773 0.781 0.779 0.779 0.779 0.778 0.769 0.779
connect-4 0.815 0.820 0.819 0.812 0.813 0.819 0.809 0.810 0.810 0.809 0.805 0.810
dna 0.851 0.941 0.954 0.938 0.941 0.953 0.845 0.938 0.950 0.930 0.938 0.944
ijcnn 0.968 0.964 0.974 0.965 0.964 0.974 0.962 0.955 0.967 0.956 0.955 0.967
mnist 0.968 0.968 0.938 0.947 0.940 0.933 0.959 0.959 0.929 0.940 0.937 0.930
protein 0.617 0.676 0.685 0.667 0.676 0.685 0.614 0.668 0.677 0.658 0.668 0.677
seismic 0.737 0.740 0.741 0.738 0.740 0.741 0.729 0.736 0.741 0.738 0.736 0.741
w8a 0.988 0.988 0.988 0.984 0.982 0.988 0.979 0.981 0.979 0.974 0.972 0.979
webspam uni 0.985 0.985 0.985 0.984 0.971 0.985 0.978 0.978 0.978 0.975 0.963 0.978
Table 6: Accuracies for various methods for 1 hidden layer FNN
Unpruned Pruned
a9a 0.831 0.849 0.845 0.843 0.847 0.841 0.827 0.841 0.840 0.834 0.839 0.831
acoustic 0.777 0.775 0.779 0.776 0.775 0.777 0.777 0.766 0.775 0.771 0.766 0.767
connect-4 0.804 0.817 0.816 0.816 0.821 0.824 0.798 0.813 0.811 0.808 0.815 0.823
dna 0.812 0.938 0.957 0.906 0.938 0.895 0.803 0.930 0.954 0.898 0.930 0.886
ijcnn 0.982 0.980 0.979 0.972 0.980 0.979 0.977 0.972 0.973 0.967 0.972 0.974
mnist 0.953 0.957 0.958 0.953 0.959 0.943 0.948 0.955 0.957 0.944 0.952 0.939
protein 0.596 0.664 0.670 0.605 0.664 0.604 0.587 0.658 0.670 0.599 0.658 0.597
seismic 0.744 0.743 0.746 0.725 0.738 0.738 0.743 0.740 0.746 0.725 0.733 0.731
w8a 0.986 0.985 0.986 0.972 0.975 0.987 0.978 0.976 0.978 0.970 0.970 0.977
webspam uni 0.986 0.983 0.986 0.969 0.983 0.981 0.979 0.978 0.979 0.965 0.978 0.978
Table 7: Accuracies for various methods for 2 hidden layer FNN
Unpruned Pruned
a9a 0.832 0.847 0.845 0.845 0.847 0.840 0.822 0.838 0.836 0.840 0.838 0.831
acoustic 0.779 0.775 0.779 0.775 0.775 0.771 0.777 0.772 0.777 0.769 0.772 0.767
connect-4 0.815 0.820 0.816 0.816 0.817 0.813 0.805 0.811 0.816 0.808 0.808 0.808
dna 0.761 0.934 0.957 0.903 0.932 0.856 0.756 0.931 0.950 0.902 0.922 0.852
ijcnn 0.980 0.982 0.981 0.977 0.982 0.977 0.972 0.973 0.974 0.975 0.973 0.971
mnist 0.958 0.960 0.961 0.954 0.955 0.945 0.955 0.954 0.960 0.951 0.946 0.944
protein 0.621 0.656 0.675 0.657 0.668 0.627 0.614 0.648 0.668 0.651 0.662 0.617
seismic 0.736 0.745 0.742 0.728 0.727 0.739 0.736 0.742 0.735 0.728 0.722 0.735
w8a 0.970 0.981 0.980 0.973 0.972 0.982 0.970 0.972 0.971 0.970 0.970 0.975
webspam uni 0.979 0.979 0.979 0.979 0.979 0.979 0.973 0.970 0.973 0.973 0.970 0.973
Table 8: Accuracies for various methods for 3 hidden layer FNN
(a) accuracy FNN1
(b) loss FNN1
(c) margin FNN1
(d) accuracy FNN2
(e) loss FNN2
(f) margin FNN2
(g) accuracy FNN3
(h) loss FNN3
(i) margin FNN3
Figure 3: Effect of quantization on accuracy, margin and loss function for 1,2 and 3 hidden layer FNN for dataset ’dna’. Here we see that even on decreasing the number of total number of bits (applying brute force to determine the number of fraction bits using 1% error tolerance), the accuracy does not significantly decrease even if total number of bits are as close to 4. In a peculiar observation, we see that for all the cases, at some value of total number of bits, the accuracy increases slightly compared to full precision. This value is different for different regularizers.

Similar results were obtained for Cifar100 and MNIST datasets. Results of which can be found in supplementary section.

5.2.3 FNNs: Datasets

We use 10 datasets from LIBSVM website [5], to demonstrate the effectiveness of our method when compared to other methods. The datasets vary in the number of features, classes, and training set sizes thus covering a wide variety of applications of neural networks.

name features classes train size val size test size
a9a 122 2 26049 6512 16281
acoustic 50 3 63058 15765 19705
connect-4 126 3 40534 13512 13511
dna 180 3 1400 600 1186
ijcnn 22 2 35000 14990 91701
mnist 778 10 47999 12001 10000
protein 357 3 14895 2871 6621
seismic 50 3 63060 15763 19705
w8a 300 2 39800 9949 14951
webspam uni 254 2 210000 70001 69999
Table 9: Datasets used for FNN experiments adopted from [5]

5.2.4 FNNs: Experiments

In these set of experiments we show the individual effect of pruning and quantization on a wide range of regularizers prevalent in the neural network domain. We also test the efficacy of our regularizer in achieving sparsity across various neural network sizes ranging from 1 hidden layer to 3 hidden layers. The number neurons in each layer was set to 50.
Tables (6-8) shows the accuracy obtained for the datasets in case of unpruned and pruned network. We vary the number of hidden layers from 1 to 3 and evaluate the test set accuracies. We find that for 1 hidden layer FNN, weight regularization and regularization with data dependent term have the highest accuracies for 7 out of 10 datasets, whereas for pruned network regularization has the best performance. Similar observations can be made about networks with two and three hidden layers, where regularization has the best performances in terms of accuracies.
Tables (10)-(12) demonstrates the compression ratio for individual networks. We observe that the regularizers with data dependent term outperforms in 9 out of 10 for network with 1 hidden layer, 7 out of 10 in networks with two hidden layers and 8 out of 10 in networks with 3 hidden layers. The compressions ranges from 1.0 to 5063 with just pruning.

Tables (10)-(12) exhibits the compression ratio achieved by various regularizers. Here the compression ratio is defined as .

a9a 2.1 133.0 99.2 2.4 133.0 99.2
acoustic 1.2 1.2 1.2 1.0 1.5 1.2
connect-4 1.5 2.3 5.1 1.1 2.3 5.1
dna 1.8 167.3 262.9 19.5 167.3 85.2
ijcnn 1.5 7.0 3.2 1.6 7.0 3.2
mnist 2.3 2.3 2.3 1.8 3.0 6.5
protein 1.8 37.1 35.3 4.6 37.1 35.3
seismic 1.1 1.4 1.3 1.0 1.4 1.3
w8a 3.1 75.0 3.1 1377.5 1515.2 3.1
webspam uni 1.3 1.3 1.3 2.2 2.8 1.3
Table 10: Compression ratios for various methods for 1 hidden layer FNN
a9a 1.3 157.1 72.1 3.2 204.7 23.6
acoustic 1.0 2.0 2.0 1.4 2.0 2.7
connect-4 1.4 1.9 5.0 1.6 3.8 4.4
dna 1.4 55.7 172.8 4.0 55.7 2.7
ijcnn 1.6 7.1 8.1 2.0 7.1 6.4
mnist 1.4 2.9 2.4 1.5 2.9 8.1
protein 1.3 51.4 35.8 1.3 51.4 2.4
seismic 1.1 2.1 1.5 1.0 2.1 6.1
w8a 4.4 4.5 4.4 2212.8 2528.9 74.7
webspam uni 1.6 4.7 1.6 1.8 4.7 4.7
Table 11: Compression ratios for various methods for 2 hidden layer FNN
a9a 1.4 354.7 283.8 2.2 354.7 36.0
acoustic 1.2 3.4 1.2 1.5 3.4 5.9
connect-4 1.5 2.1 3.1 1.5 3.4 8.7
dna 1.3 3.6 596.0 4.4 18.2 2.9
ijcnn 2.1 13.5 12.5 1.8 13.5 9.0
mnist 1.3 3.6 3.1 1.6 3.4 6.2
protein 1.3 6.5 39.4 1.7 46.9 2.4
seismic 1.0 4.8 10.0 1.0 82.1 9.5
w8a 1265.8 2531.5 2531.5 4050.4 5063.0 22.0
webspam uni 1.9 9.4 1.9 1.7 9.4 2.9
Table 12: Compression ratios for various methods for 3 hidden layer FNN

5.3 FNN:Quantization

Figures 3 shows effect of quantization on the generalization abilities of neural networks. We performed quantization on the trained network. We show the accuracy, margin computed as and loss for multiple regularizers as the total number of bits are varied from 16 to 2. For every value of total number of bits, the number of fraction bits were varied from 3 to 15 and the number which amounted to best test set accuracies was selected . We observe that for 1 hidden layer network, the regularizer with data dependent term despite having the highest accuracy to start with, is the least robust as it tapers of quickly with decrease in total number of bits, whereas, regularizer based on minimization of VC bound is the most robust. For other networks our proposed data dependent regularizer has comparable performances to other regularizers. One peculiar observation in the figures 3 is that, we observe a peak in a accuracy at a certain bit value. One possible explanation can be attributed to the fact that quantization noise may allow the network to reach a better minima thus achieving higher accuracies than their full precision counterpart.

6 Conclusion and Discussion

This paper attempts to extend the ideas of minimal complexity machines [18] and learn the weights of a neural network by minimizing the empirical error and an upper bound on the VC dimension. However, an added advantage of using such bound, is in terms of reduction in model complexity. We observe that pruning and then quantizing the models helps to achieve comparable or better sparsity in terms of weights and allows for better generalization abilities.

We proposed a theoretical framework to reduce the model complexity of neural networks and then ran multiple experiments on various benchmark datasets. These benchmarks offer a diversity in terms of the number of samples and number of features. The results incontrovertibly demonstrate that the our data dependent regularizer generalize better than conventional CNNs and FNNs.

The approach presented in the paper is generic, and can be adapted to many other settings and architectures. In our experiments we use a global hyperparameter for data dependent term, which can be further improved by using multiple hyperparameters for individual layers.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. (2016)

    Tensorflow: large-scale machine learning on heterogeneous distributed systems

    arXiv preprint arXiv:1603.04467. Cited by: §5.1.
  • [2] A. Aghasi, N. Nguyen, and J. Romberg (2016) Net-trim: a layer-wise convex pruning of deep neural networks. arXiv preprint arXiv:1611.05162. Cited by: §2.
  • [3] M. Babaeizadeh, P. Smaragdis, and R. H. Campbell (2016) NoiseOut: a simple way to prune neural networks. arXiv preprint arXiv:1611.06211. Cited by: §2.
  • [4] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
  • [5] C. Chang and C. Lin (2011)

    LIBSVM: a library for support vector machines

    ACM Transactions on Intelligent Systems and Technology (TIST) 2 (3), pp. 27. Cited by: §5.2.3, Table 9.
  • [6] M. D. Collins and P. Kohli (2014) Memory bounded deep convolutional networks. arXiv preprint arXiv:1412.1442. Cited by: §1.
  • [7] T. M. Cover (1968) Capacity problems for linear machines. pp. 283––289. Cited by: §3.
  • [8] A. Graves (2013)

    Generating sequences with recurrent neural networks

    arXiv preprint arXiv:1308.0850. Cited by: §1.
  • [9] M. U. Guide (1998) The mathworks. Inc., Natick, MA 5, pp. 333. Cited by: §5.1.
  • [10] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.
  • [11] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §2.
  • [12] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §1.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification


    Proceedings of the IEEE international conference on computer vision

    pp. 1026–1034. Cited by: §1.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 770–778. Cited by: §1.
  • [15] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng (2012) Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 873–882. Cited by: §1.
  • [16] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten (2016) Densely connected convolutional networks. arXiv preprint arXiv:1608.06993. Cited by: Figure 2, §5.2.2, §5.2.2.
  • [17] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Quantized neural networks: training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061. Cited by: §2.
  • [18] Jayadeva (2015) Learning a hyperplane classifier by minimizing an exact bound on the vc dimensioni. NEUROCOMPUTING 149, pp. 683–689. Cited by: §3, §6.
  • [19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. Cited by: §5.1, §5.2.2.
  • [20] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137. Cited by: §1.
  • [21] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Cited by: Table 2.
  • [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [23] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
  • [24] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio (2015) Neural networks with few multiplications. arXiv preprint arXiv:1510.03009. Cited by: §2.
  • [25] C. Liu, Z. Zhang, and D. Wang (2014) Pruning deep neural networks by optimal brain damage.. In INTERSPEECH, pp. 1092–1095. Cited by: §2.
  • [26] N. Mellempudi, A. Kundu, D. Das, D. Mudigere, and B. Kaul (2017) Mixed low-precision deep learning inference using dynamic fixed point. arXiv preprint arXiv:1701.08978. Cited by: §2.
  • [27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013)

    Playing atari with deep reinforcement learning

    arXiv preprint arXiv:1312.5602. Cited by: §1.
  • [28] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation.. In EMNLP, Vol. 14, pp. 1532–1543. Cited by: §1.
  • [29] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §2.
  • [30] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini (2016) Group sparse regularization for deep neural networks. arXiv preprint arXiv:1607.00485. Cited by: §2.
  • [31] J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
  • [32] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. Cited by: §1.
  • [33] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • [34] S. Srinivas and R. V. Babu (2015) Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149. Cited by: §2.
  • [35] S. Srinivas, A. Subramanya, and R. V. Babu (2016) Training sparse neural networks. arXiv preprint arXiv:1611.06694. Cited by: §2.
  • [36] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Training very deep networks. In Advances in neural information processing systems, pp. 2377–2385. Cited by: §1.
  • [37] Y. Sun, X. Wang, and X. Tang (2016)

    Sparsifying neural network connections for face recognition

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4856–4864. Cited by: §2.
  • [38] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
  • [39] Y. Tang (2013) Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239. Cited by: §3.1.
  • [40] V. Vapnik (1998) Statistical learning theory. Wiley. External Links: ISBN 978-0-471-03003-4 Cited by: §3.
  • [41] N. Wolfe, A. Sharma, L. Drude, and B. Raj (2017) The incredible shrinking neural network: new perspectives on learning representations through the lens of pruning. arXiv preprint arXiv:1701.04465. Cited by: §2.
  • [42] H. Zhou, J. M. Alvarez, and F. Porikli (2016) Less is more: towards compact cnns. In European Conference on Computer Vision, pp. 662–677. Cited by: §2.