Deep Neural Networks have been very successful in variegation of tasks. They have been applied to Image classification [22, 13, 33], Text analytics [28, 15], Handwriting generation , Image Captioning , Automatic Game playing [27, 32], Speech Recognition , Machine translation [4, 38] and many others. Bengio et al.  and Schmidhuber 
provides an extensive review of deep learning and its applications.
The representational power of a neural network increases with its depth as is evident from the architectures like Highway Networks  (32 layers and 1.25M parameters) and ResNet  (110 layers has 1.7M parameters). Such large number of weights presents a challenge in terms of storage capacity, memory bandwidth and representational redundancy. For example, widely used models like AlexNet Caffemodel is over 200MB, and the VGG-16 Caffemodel is over 500MB. With advent of mobile technologies and IoT devices the need for faster and accurate computing has arisen. Sparse matrix multiplications and convolutions are a lot faster than their dense counterparts. Furthermore, a sparse model with few parameters gain advantage in terms of better generalization ability thereby preventing overfitting. Effect of various regularizers () on CNN (Convolutional Neural Networks) are studied in .
In this paper we introduce a novel loss function to achieve sparsity by minimizing a convex upper bound on Vapnik-Chervonenkis (VC) dimension. We first derive an upper bound on the VC dimension of the classifier layer of a neural network, and then apply this bound on the intermediate layers in the neural networks, in conjunction with the weight-decay (and
norms) regularization bound. This result provides us with a novel error functional to optimize over with backpropagation for training neural network architectures, modified from the traditional learning rules.
This learning rule adapts the model weights to minimize both empirical error on training data as well as the VC dimesion of the neural network. With the inclusion of a term minimizing the VC dimension, we aim to achieve sparser neural networks, which allow us to remove a large number of synapses and neurons without any penalty on empirical performance.
Finally, we demonstrate the consistent effectiveness of the learning rule across a variety of learning algorithms on various datasets across learning task domains. We see that the data dependent rule promotes higher test set accuracies, faster convergence and achieves smaller models across various architectures such as Feedforward (Fully Connected) Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) confirming our hypothesis that the algorithm indeed controls model complexity, while improving generalization performance.
The rest of the paper is organised as follows - in Section 2 we provide a brief overview of the recent relevant work in complexity control and generalization in deep neural networks, and in Section 3 we provide the derivation for our learning rule, and proof for theoretical bounds. Section 4 describes the effect of quantization on VC dimension bound of the network. In the subsequent section 5 we describe our experimental setup and methodology, along with qualitative and quantitative analyses of our experiments.
2 Related Works
Compression of deep nets have been widely studied. Network pruning and quantization are the methods of choice. Researchers have used weights and neuronal removal to instigate sparsity.  used iterative deletion of weights and neurons to achieve sparsity. [42, 30] used group sparse regularization on weights to incorporate sparsity.  used iterative sparsification based on neural correlations.  used optimal brain damage to enforce sparsity.  removed redundant neurons based on saliancy of two weight sets.  used second order Taylor information to prune neurons. pruned the net using sparse matrix transformation keeping the layer input and output close to the original unpruned model. used a bimodal regularizer to enforce sparsity and  merged two neurons with high correlations.
A rich body of literature exist on quantizing the models as well.  build their model on top of their earlier model, by adding quantization and Huffman coding. 
used weight binarization and quantizing the learned representations in each layers to achieve the same. In their work binarized both weights and inputs to the convolutional layers.  proposed cluster based quantization method to convert pre-trained full precision weights to ternary weights with minimal loss in accuracy.  quantized weights, activations and incorporated quantized gradients with 6 bits in their training.
3 Sparsifying Neural Networks through Pruning
for fat margin hyperplane classifiers with marginsatisfies
Let us consider a dataset with samples and features. The individual samples are denoted by . where denotes the radius of the smallest sphere enclosing all the training samples. We first consider the case of a linearly separable dataset. By definition, there exists a hyperplane , parameterized by and a bias term with positive margin that can classify these points with zero error. We can always choose a value ; for all further discussion we assume that this is the case. The samples are assumed to be in a high dimension; this assumption is reasonable because the samples inherently have a large number of features and are thus linearly separable, owing to Cover’s theorem , or they have been transformed from the input space to a high dimensional space by using a nonlinear transformation. The case when the samples are linearly separable and in a small dimension is not interesting as these are of a trivial nature. Thus we have,
Let us consider the problem of minimizing the fraction as minimizing the upper bound on VC dimension.
Since, both the numerator and denominator are positive quantities with and , we can alternatively write (3) as:
We simplify the value of the fraction , to attain a tractable convex bound in term of the weights of network.
Without proper scaling of and , we can write the minimum value of distance of correctly classified point to be .
Since, for two numbers and , the following inequality holds:
For a separating hyperplane that passes through the data, the maximum distance of the point from the plane, is greater than the maximum radius of the data. Thus we can extend the bound on radius of dataset as:
For positive numbers , the following inequality holds,
Finally, we arrive at the convex and differentiable version of the bound on VC dimension, that can be minimized using stochastic gradient descent and can used in conjugation with various architectures. The following bound acts as a data dependent regularizer when used alongside the loss function minimization. Here we present the effectiveness of the bound for reducing the number of connections of the network.
3.1 A bound on Neural network
We now use the bound (15) in the context of a multi-layer feedforward neural network. Consider a neural with multiple hidden layers for the problem of multiclass classification with classes. Let the number of neurons in the penultimate layer be denoted by , and let their outputs be denoted by ; let the corresponding connecting weights for the classifier layer be denoted by respectively. One may view the outputs of this layer as a map from the input to , i.e. . The biases of at the output are denoted by . The score for input pattern at the output is given by . For the purposes of this paper, we use multiclass hinge loss following the works of Tang et al., , where the authors state superiority of hinge loss over softmax loss. Thus applying the bound (15) on the classification layer of neural network, lead us to the following optimization problem:
3.2 Application of the bound on hidden layers
The great advantage with this bound is its ability to be applied to pre-activations in the net across all the layers. When applied to the pre-activations in a net, it is interpreted as a, our data dependent regularizer forces the pre-activations for each layer to be close to zero. Thus, it in turn enforces sparsity at neuronal levels in the intermediate layers. In principle, during back-propagation this tantamount to solving a least squares problem for each neuron where the targets are all . Consider a feedforward architecture with hidden layers. For an intermediate layer , the let the activations of the layer with neurons be . Let be the weights of the layer going from to and be the set of biases. Let us assume that the targets for each sample for each pre-activations is . Hence, the application of (15) on pre-activations with ReLu activation function, is equivalent to the following minimization problem.
4 Trade-off between margin and error: Role of quantization
Consider a binary classification problem with samples where sample is denoted as and its corresponding label is represented as . Let us define a fat margin hyperplane classifier denoted by where, be the weights and be the bias term. Let be the quantized weights and be the quantized bias term. Without loss of generality, we can consider hyperplanes passing through the origin. To see that this is possible, we augment the co-ordinates of all samples with an additional dimension or feature whose value is always , i.e. the samples are given by
; also, we assume that the weight vector is-dimensional, i.e. . Thus, the classifier then becomes . Following the above notation, quantized version of vector is denoted as .
Consider full precision and a quantized fat margin classifiers with upper bounds on VC dimensions as and . If , then the quantized classifier has smaller VC bound ().
Given a set of linearly separable data points and the two fat margin classifiers, former with full precision and latter with quantized set of weights. If the predicted label for each data point assigned by each individual classifiers is the same, which implies that the two classifiers have same accuracies, then the differences in the scores for each sample multiplied by its individual class should be positive.
It can be easily shown that (19) is true if,
The condition (21), translates to the fact that we assign smaller number mantissa bits to the weights or during reduction in fraction bits is smaller than . The argument for this condition comes from the fact that if (21) holds then the sign of remains the same as that of or . Now since quantization does not allow flipping of signs of each individual bits, (21) allows for the same sign of the sum given by eq. (19). This implies,
Thus by introducing the quantization, one can reduce the complexity of the classifier. This is also evident from the fact that the size of hypothesis class reduces as the precision is reduced. ∎
5 Empirical Analysis and Observations
We determine the effectiveness of network pruning and quantization on various network architectures like Convolutional Neural networks (CNNs) and fully-connected neural (FNN) nets using various data independent regularizers such as L1 norm and L2 norm on weights and dropout and the proposed data dependent regularizer (15).
5.1 Setup and Notation
All our experiments are run on a GPU cluster with NVIDIA Tesla K40 GPUs, and implementations were done using the assistance of the Caffe
library for CNNs, while the experiments for FNNs were done using Tensorflow and quantization of FNNs was implemented using Matlab .
: The two main hyperparameters in our experiments areand , which are described in section 3. The two hyperparameters were tuned in the range of and in the multiples of . The other parameters such as dropout rate was kept at their default values for densely connected nets and quicknet. The learning rate was tuned for two values namely and . For CNNs the learning rate was multiplied by after every iterations, whereas for FNNs the learning rate was decreased as
after every epoch (one complete pass of data). The total number of iterations was kept to befor CNNs and epochs for FNNs.
The notation used for simplicity in understanding experimental results is given as,
|LCL||LCNN applied only on last layer|
|LCA||LCNN applied on all layers|
5.2 Network Pruning
To analyse the efficacy of our regularizer in attaining sparsity we perform pruning of the network after training has finished. Firstly, we select a minimum weight threshold of . Then, we calculate the absolute value of weights in each layer, subsequently we divide the difference between maximum value of weights in each layer and the minimum threshold value into 50 (for FNNs) or 100 (for CNNs) steps. In the last step, we loop over these 50 steps and prune the weights whose absolute magnitude is below the step value.
5.2.1 CNNs: Datasets
Our first set of experiments are performed on image classification task using CNNs. Table (2) describes the standard image classification dataset used in the pruning and quantization experiments.
|name||features||classes||train size||val size||test size|
|Cifar 10 ||32323||10||50000||5000||5000|
5.2.2 CNNs: Experiments
We studied the effect of pruning and quantization on two architectures of CNNs, namely Caffe quicknet  and Caffe implementation of densly connected convolutional nets  with 40 layers. We study various regularization and found that data dependent regularization achieves maximum sparsity, thus maximum compression ratio when compared to its contemporary regularizations.
Table 3 shows the compression ratio achieved when we prune the trained model. weight regularization achieves the best compression followed by our data dependent regularizer, whereas table 4 shows the accuracies, our data dependent regularizer reaches the best accuracy in the pool, keeping up the sparsity. We compare the effect of pruning and quantization on various regularizers visually using 2 dimensional tSNE plots of the final layer of densely connected CNNs. Figure (2) describes the results. Here we observe that data dependent regularizer allows forming of compact clusters thus achieving better generalizations for Cifar 10 dataset. The plots for pruned and quantized networks are visually similar, yet on closer inspection one finds that some of the clusters like Automobile, Horse, Cat and Airplane gets better clusters in terms of compactness and better separability than their unpruned and un-quantized counterparts.
|S + LCA||1.29||1.07|
|S + W||6.95||6.03|
|S + W + BN||1.92||2.33|
|S + W + BN + LCA||1.33||1.93|
|S + W + BN + LCA + D||1.16||2.20|
|S + W + D||3.20||1.46|
|S + W + D + BN||1.56||2.48|
|S + W + D + LCA||3.77||1.53|
|S + W + D + LCL||2.65||1.05|
|S + W + LCA||1.89||1.04|
|S + W + LCL||3.95||1.08|
|Original acc||Pruned acc||Quantization acc|
|S + LCA||0.77||0.77||0.75|
|S + W||0.77||0.76||0.76|
|S + W + BN||0.80||0.79||0.77|
|S + W + BN + LCA||0.78||0.77||0.73|
|S + W + BN + LCA + D||0.79||0.78||0.78|
|S + W + D||0.77||0.76||0.76|
|S + W + D + BN||0.79||0.78||0.79|
|S + W + D + LCA||0.74||0.73||0.79|
|S + W + D + LCL||0.78||0.73||0.77|
|S + W + LCA||0.77||0.76||0.78|
|S + W + LCL||0.79||0.78||0.79|
Table 5 shows the accuracies of Cifar10 before and after pruning and quantization on densely connected CNNs . We observe that our regularization performs equally well when used in conjugation with dropout and weight regularizer.
|Original acc||Pruned Acc||Quantization acc|
|H +W +D||0.924||0.913||0.923|
|H +W +LCL +D||0.924||0.914||0.920|
|H + W||0.900||0.886||0.897|
|H + W + LCL||0.895||0.888||0.889|
|H +W1 +LCL||0.857||0.847||0.854|
Figure 1 shows the accuracies of various algorithms with the total number of bits after we perform the first round of pruning. We see that our regularizer has the best set of accuracies (H +W + LCL) among all the algorithms and is quite robust to the changes in the total number of bits.
Similar results were obtained for Cifar100 and MNIST datasets. Results of which can be found in supplementary section.
5.2.3 FNNs: Datasets
We use 10 datasets from LIBSVM website , to demonstrate the effectiveness of our method when compared to other methods. The datasets vary in the number of features, classes, and training set sizes thus covering a wide variety of applications of neural networks.
|name||features||classes||train size||val size||test size|
5.2.4 FNNs: Experiments
In these set of experiments we show the individual effect of pruning and quantization on a wide range of regularizers prevalent in the neural network domain. We also test the efficacy of our regularizer in achieving sparsity across various neural network sizes ranging from 1 hidden layer to 3 hidden layers. The number neurons in each layer was set to 50.
Tables (6-8) shows the accuracy obtained for the datasets in case of unpruned and pruned network. We vary the number of hidden layers from 1 to 3 and evaluate the test set accuracies. We find that for 1 hidden layer FNN, weight regularization and regularization with data dependent term have the highest accuracies for 7 out of 10 datasets, whereas for pruned network regularization has the best performance. Similar observations can be made about networks with two and three hidden layers, where regularization has the best performances in terms of accuracies.
Tables (10)-(12) demonstrates the compression ratio for individual networks. We observe that the regularizers with data dependent term outperforms in 9 out of 10 for network with 1 hidden layer, 7 out of 10 in networks with two hidden layers and 8 out of 10 in networks with 3 hidden layers. The compressions ranges from 1.0 to 5063 with just pruning.
Figures 3 shows effect of quantization on the generalization abilities of neural networks. We performed quantization on the trained network. We show the accuracy, margin computed as and loss for multiple regularizers as the total number of bits are varied from 16 to 2. For every value of total number of bits, the number of fraction bits were varied from 3 to 15 and the number which amounted to best test set accuracies was selected . We observe that for 1 hidden layer network, the regularizer with data dependent term despite having the highest accuracy to start with, is the least robust as it tapers of quickly with decrease in total number of bits, whereas, regularizer based on minimization of VC bound is the most robust. For other networks our proposed data dependent regularizer has comparable performances to other regularizers. One peculiar observation in the figures 3 is that, we observe a peak in a accuracy at a certain bit value. One possible explanation can be attributed to the fact that quantization noise may allow the network to reach a better minima thus achieving higher accuracies than their full precision counterpart.
6 Conclusion and Discussion
This paper attempts to extend the ideas of minimal complexity machines  and learn the weights of a neural network by minimizing the empirical error and an upper bound on the VC dimension. However, an added advantage of using such bound, is in terms of reduction in model complexity. We observe that pruning and then quantizing the models helps to achieve comparable or better sparsity in terms of weights and allows for better generalization abilities.
We proposed a theoretical framework to reduce the model complexity of neural networks and then ran multiple experiments on various benchmark datasets. These benchmarks offer a diversity in terms of the number of samples and number of features. The results incontrovertibly demonstrate that the our data dependent regularizer generalize better than conventional CNNs and FNNs.
The approach presented in the paper is generic, and can be adapted to many other settings and architectures. In our experiments we use a global hyperparameter for data dependent term, which can be further improved by using multiple hyperparameters for individual layers.
Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. Cited by: §5.1.
-  (2016) Net-trim: a layer-wise convex pruning of deep neural networks. arXiv preprint arXiv:1611.05162. Cited by: §2.
-  (2016) NoiseOut: a simple way to prune neural networks. arXiv preprint arXiv:1611.06211. Cited by: §2.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2 (3), pp. 27. Cited by: §5.2.3, Table 9.
-  (2014) Memory bounded deep convolutional networks. arXiv preprint arXiv:1412.1442. Cited by: §1.
-  (1968) Capacity problems for linear machines. pp. 283––289. Cited by: §3.
Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §1.
-  (1998) The mathworks. Inc., Natick, MA 5, pp. 333. Cited by: §5.1.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.
-  (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §2.
-  (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §1.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In
Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §1.
Deep residual learning for image recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1.
-  (2012) Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 873–882. Cited by: §1.
-  (2016) Densely connected convolutional networks. arXiv preprint arXiv:1608.06993. Cited by: Figure 2, §5.2.2, §5.2.2.
-  (2016) Quantized neural networks: training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061. Cited by: §2.
-  (2015) Learning a hyperplane classifier by minimizing an exact bound on the vc dimensioni. NEUROCOMPUTING 149, pp. 683–689. Cited by: §3, §6.
-  (2014) Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. Cited by: §5.1, §5.2.2.
-  (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137. Cited by: §1.
-  (2009) Learning multiple layers of features from tiny images. Cited by: Table 2.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
-  (2015) Neural networks with few multiplications. arXiv preprint arXiv:1510.03009. Cited by: §2.
-  (2014) Pruning deep neural networks by optimal brain damage.. In INTERSPEECH, pp. 1092–1095. Cited by: §2.
-  (2017) Mixed low-precision deep learning inference using dynamic fixed point. arXiv preprint arXiv:1701.08978. Cited by: §2.
Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
-  (2014) Glove: global vectors for word representation.. In EMNLP, Vol. 14, pp. 1532–1543. Cited by: §1.
-  (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §2.
-  (2016) Group sparse regularization for deep neural networks. arXiv preprint arXiv:1607.00485. Cited by: §2.
-  (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
-  (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. Cited by: §1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
-  (2015) Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149. Cited by: §2.
-  (2016) Training sparse neural networks. arXiv preprint arXiv:1611.06694. Cited by: §2.
-  (2015) Training very deep networks. In Advances in neural information processing systems, pp. 2377–2385. Cited by: §1.
Sparsifying neural network connections for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4856–4864. Cited by: §2.
-  (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
-  (2013) Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239. Cited by: §3.1.
-  (1998) Statistical learning theory. Wiley. External Links: Cited by: §3.
-  (2017) The incredible shrinking neural network: new perspectives on learning representations through the lens of pruning. arXiv preprint arXiv:1701.04465. Cited by: §2.
-  (2016) Less is more: towards compact cnns. In European Conference on Computer Vision, pp. 662–677. Cited by: §2.