Neural networks (NNs) in machine learning systems are critical drivers of new technologies such as image processing and speech recognition. Modern NNs are gigantic in size with millions of parameters, such as the ones described in Alexnet, Overfeat  and ResNet . They therefore require an enormous amount of memory and silicon processing during usage. Optimizing a network to improve performance typically involves making it deeper and adding more parameters [4, 5, 6]
, which further exacerbates the problem of large storage complexity. While the convolutional (conv) layers in these networks do feature extraction, there are usually fully connected layers at the end performing classification. We shall henceforth refer to these layers asconnected layers (CLs), of which fully connected layers (FCLs) are a special case. Owing to their high density of connections, the majority of network parameters are concentrated in FCLs. For example, the FCLs in Alexnet account for 95.7% of the network parameters .
We shall refer to the spaces between CLs as CL junctions (or simply junctions), which are occupied by connections, or weights. Given the trend in modern NNs, we raise the question – “How necessary is it to have FCLs?” or, in other words, “What if most of the junction connections never existed? Would the resulting sparsely connected layers (SCLs)
, when trained and tested, still give competitive performance?” As an example, consider a network with 2 CLs of 100 neurons each and the junction between them has 1000 weights instead of the expected 10,000. Then this is a sparse network with connection density of 10%. Given such a sparse architecture, a natural question to ask is “How can the existing 1000 weights be best distributed so that network performance is maximized?”
In this regard, the present work makes the following contributions. In Section II, we formalize the concept of sparsity, or its opposite measure density, and explore its effects on different network types. We show that CL parameters are largely redundant and a network pre-defined to be sparse before starting training does not result in any performance degradation. For certain network architectures, this leads to CL parameter reduction by a factor of more than 450, or an overall parameter reduction by a factor of more than 20. In Section II-D, we discuss techniques to distribute connections across junctions when given an overall network density. Finally, in Section III, we formalize pre-defined sparse connectivity patterns using adjacency matrices and introduce the scatter metric. Our results show that scatter is a quick and useful indicator of how good a sparse network is.
Ii Pre-defined Sparsity
As an example of the footprint of modern NNs, AlexNet has a weight size of 234 MB and requires 635 million arithmetic operations only for feedforward processing . It has been shown that NNs, particularly their FCLs, have an excess of parameters and tend to overfit to the training data , resulting in inferior performance on test data. The following paragraph describes several previous works that have attempted to reduce parameters in NNs.
Dropout (deletion) of random neurons  trains multiple differently configured networks, which are finally combined to regain the original full size network.  randomly forced the same value on collections of weights, but acknowledged that “a significant number of nodes [get] disconnected from neighboring layers.” Other sparsifying techniques such as pruning and quantization [11, 12, 13, 14] first train the complete network, and then perform further computations to delete parameters.  used low rank matrices to impose structure on network parameters.  proposed a regularizer to reduce parameters in the network, but acknowledged that this increased training complexity. In general, all these architectures deal with FCLs at some point of time during their usage cycle and therefore, do not permanently solve the NN parameter explosion problem.
Ii-a Our Methodology
Our attempt to simplify NNs is to pre-define the level of sparsity, or connection density, in a network prior to the start of training. This means that our network always has fewer connections than its FCL counterpart; the weights which are absent never make an appearance during training or inference. In our notation, a NN will have junctions, i.e. layers, with being the number of neurons in each layer. and are respectively the number of neurons in the earlier (left) and later (right) layers of junction . Every left neuron has a fixed number of edges going from it to the right, and every right neuron has a fixed number of edges coming into it from the left. These numbers are defined as fan-out () and fan-in (), respectively. For conventional FCLs, and . We propose SCLs where and , such that , the number of weights in junction . Having a fixed and ensures that all neurons in a junction contribute equally and none of them get disconnected, since that would lead to a loss of information. The connection density in junction is given as and the overall CL connection density is defined as .
An earlier work of ours  proposed a hardware architecture that leverages pre-defined sparsity to speed up training, while another of our earlier works  built on that by exploring the construction of junction connection patterns optimized for hardware implementation. However, a complete analysis of methods to pre-define connections, its possible gains on different kinds of modern deep NNs, and a test of its limits via a metric quantifying its goodness has been lacking. 
introduced a metric based on eigenvalues, but ran limited tests on MNIST.
The following subsections analyze our method of pre-defined sparsity in more detail. We experimented with networks operating on the CIFAR image classification dataset, the MNIST handwritten digit recognition dataset, and Morse code symbol classification – a new dataset we have designed and described in 111L2 regularization is used wherever applicable for the MNIST networks. For Morse, the difference with and without regularization is negligible, while for CIFAR, the accuracy results differ by about 1%.
Ii-B Network Experiments
|Net||Junction||CL Junction||Overall CL||Net||Junction||CL Junction||Overall CL|
|fan-outs||Densities (%)||Density (%)||fan-outs||Densities (%)||Density (%)|
We used the original CIFAR10 and CIFAR100 datasets without data augmentation. Our network has 6 conv layers with number of filters equal to
. Each has window size 3x3. The outputs are batch-normalized before applying ReLU non-linearity. A max-pooling layer of pool size 2x2 succeeds every pair of conv layers. This structure finally results in a layer of 4096 neurons, which is followed by the CLs. We used the Adam optimizer, ReLU-activated hidden layers and softmax output layer – choices which we maintained for all networks unless otherwise specified.
Our results in Section II-D indicate that later CL junctions (i.e. closer to the outputs) should be denser than earlier ones (i.e. closer to the inputs). Moreover, since most CL networks have a tapering structure where monotonically decreases as increases, more parameter savings can be achieved by making earlier layers less dense. Accordingly we did a grid search and picked CL junction densities as given in Table I. The phrase ‘conv+2CLs’ denotes 2 CL junctions corresponding to a CL neuron configuration of for CIFAR10, for CIFAR100222Powers of 2 are used for ease of testing sparsity. The extra output neurons have a ‘false’ ground truth labeling and thus do not impact the final classification accuracy., and for MNIST (see Section II-B2). For ‘conv+3CLs’, an additional 256-neuron layer precedes the output. ‘MNIST CL’ and ‘Morse CL’ refer to the CL only networks described subsequently, for which we have only shown some of the more important configurations in Table I.
As an example, consider the first network in ‘CIFAR10 conv+2CLs’ which has . This means that the individual junction densities are and , to give an overall CL density of . In other words, while FCLs would have been 100% dense with weights, the SCLs use weights, which is 457 times less. Note that weights in the sparse junction are distributed as fixed, but randomly generated patterns, with the constraints of fixed fan-in and fan-out.
Figure 1 shows the results for CIFAR. Subfigures (a), (b), (d) and (e) show classification performance on validation data as the network is trained for 30 epochs (note that the final accuracies stayed almost constant after 20 epochs). The different lines correspond to different overall CL densities. Subfigures (c) and (f) show the best validation accuracies after 1, 5 and 30 epochs for the different CL densities. We see that the final accuracies (the numbers at the top of each column) show negligible performance degradation for these extremely low levels of density, not to mention some cases where SCLs outperform FCLs. These results point to the promise of sparsity. Also notice from subfigures (c) and (f) that SCLs generally start training quicker than FCLs, as evidenced by their higher accuracies after 1 epoch of training. See Appendix Section V-C for more discussion.
We used 2 different kinds of networks when experimenting on MNIST (no data augmentation). The first was ‘conv+2CLs’ – 2 conv layers having 32 and 64 filters of size 5x5 each, alternating with 2x2 max pooling layers. This results in a layer of 3136 neurons, which is followed by 2 CLs having 784 and 10 neurons, i.e. 2 junctions overall. Fig. 2(a) and (b) show the results. Due to the simplicity of the overall network, performance starts degrading at higher densities compared to CIFAR. However, a network with CL density 2.35% still matches FCLs in performance. Note that the total number of weights (conv+SCLs) is 0.11M for this network, which is only 4.37% of the original (conv+FCLs).
The second was a family of networks with only CLs, either having a single junction with a neuron configuration of , or 2 junctions configured as , where varies. The results are shown in Fig. 2(c), which offers two insights. Firstly, performance drops off at higher densities for CL only MNIST networks as compared to the one with conv layers. However, half the parameters can still be dropped without appreciable performance degradation. This aspect is further discussed in Section II-C. Secondly, large SCLs perform better than small FCLs with similar number of parameters. Considering the black-circled points as an example, performance drops when switching from 224 hidden neurons at 12.5% density to 112 at 25% to 56 at 50% to 28 at 100%, even though all these networks have similar number of parameters. So increasing the number of hidden neurons is desirable, albeit with diminishing returns.
The Morse code dataset presents a harder challenge for sparsity 
. It only has 64-valued inputs (as compared to 784 for MNIST and 3072 for CIFAR), so each input neuron encodes a significant amount of information. The outputs are Morse codewords and there are 64 classes. Distinctions between inputs belonging to different classes is small. For example, the input pattern for the Morse codeword ‘. . . . .’ can be easily confused with the codeword ‘. . . . -’. As a result, performance degrades quickly as connections are removed. Our network had 64 input and output neurons and 1024 hidden layer neurons, i.e. 3 CLs and 2 junctions, trained using stochastic gradient descent. The results are shown in Fig.3(a). As with MNIST CL only, 50% density can be achieved with negligible degradation in accuracy.
Ii-C Analyzing the Results of Pre-defined Sparsity
|CLs /||Conv||Conv||FC CL||Sparse||Overall||Overall|
|Net||Total||Params||Ops||Params||CL Par-||Param %||Op %|
|MNIST CL ()||2/2||0||0||0.178||0.089||50||50|
Our results indicate that for deep networks having several conv layers, there is severe redundancy in the CLs. As a result, they can be made extremely sparse without hampering network performance, which leads to significant memory savings. If the network only has CLs, the amount of density reduction achievable without performance degradation is smaller. This can be explained using the argument of relative importance. For a network which extensively extracts features and processes its raw input data via conv filters, the input to the CLs can already substantially discriminate between inputs belonging to different classes. As a result, the importance of the CLs’ functioning is less as compared to a network where they process the raw inputs.
The computational savings
by sparsifying CLs, however, are not as large because the conv layers dominate the computational complexity. Other types of NNs, such as restricted Boltzmann machines, have higher prominence of CLs than CNNs and would thus benefit more from our approach. TableII shows the overall memory and computational gains obtained from pre-defining CLs to be sparse for our networks. The number of SCL parameters (params) are calculated by taking the minimum overall CL density at which there is no accuracy loss. Note that the number of operations (ops) for CLs is nearly the same as their number of parameters, hence are not explicitly shown.
Ii-D Distributing Individual Junction Densities
Note that the Morse code network has symmetric junctions since each will have weights to give a total of 131,072 FCL weights. Consider an example where overall density of 50% (i.e. 65,536 total SCL weights) is desired. This can be achieved in multiple ways, such as making both junctions 50% dense, i.e. 32,768 weights in each. Here we explore if individual junction densities contribute equally to network performance.
Figures 3(b) and (c) sweep junction 1 and 2 connectivity densities on the x-axis such that the resulting overall density is fixed at 25% for (b) and 50% for (c). The black vertical line denotes where the densities are equal. Note that peak performance in both cases is achieved to the left of the black line, such as in (c) where junction 2 is 75% dense and junction 1 is 25% dense. This suggests that later junctions need more connections than earlier ones. See Appendix Section V-A for more details.
Iii Connectivity patterns
We now introduce adjacency matrices to describe junction connection patterns. Let be the (simplified) adjacency matrix of junction , such that element indicates whether there is a connection between the th right neuron and th left neuron. will have ’s on each row and ’s on each column. These adjacency matrices can be multiplied to yield the effective connection pattern between any 2 junctions and , i.e. , where element denotes the number of paths from the th neuron in layer to the th neuron in layer . For the special case where and (total number of junctions), we obtain the input-output adjacency matrix . As a simple example, consider the network shown in Fig. 4 where and , which implies that . and are adjacency matrices of single junctions. We obtain the input-output adjacency matrix , equivalent , and equivalent . Note that this equivalent junction 1:2 is only an abstract concept that aids visualizing how neurons connect from the inputs to the outputs. It has no relation to the overall network density.
We now attempt to characterize the quality of a sparse connection pattern, i.e. we try to find the best possible way to connect neurons to optimize performance. Since sparsity gives good performance, we hypothesize that there exists redundancy / correlated information between neurons. Intuitively, we assume that left neurons of a junction can be grouped into windows depending on the dimensionality of the left layer output. For example, the input layer in an MNIST CL only network would have 2D windows, each of which might correspond to a fraction of the image, as shown in Fig. 5(a). When outputs from a CL have an additional dimension for features, such as in CIFAR or the MNIST conv network, each window is a cuboid capturing fractions of both spatial extent and features, as shown in Fig. 5(b). Given such windows, we will try to maximize the number of left windows to which each right neuron connects, the idea being that each right neuron should get some information from all portions of the left layer in order to capture global view. To realize the importance of this, consider the MNIST output neuron representing digit 2. Let’s say the sparse connection pattern is such that when the connections to output 3 are traced back to the input layer, they all come from the top half of the image. This would be undesirable since the top half of an image of a 2 can be mistaken for a 3. A good sparse connection pattern will try to avoid such scenarios by spreading the connections to any right neuron across as many input windows as possible. The problem can also be mirrored so that every left neuron connects to as many different right windows as possible. This ensures that local information from left neurons is spread to different parts of the right layer. The grouping of right windows will depend on the dimensionality of the input to the right layer.
The window size is chosen to be the minimum possible such that the ideal number of connections from or to it remains integral. The example from Fig. 4 is reproduced in Fig. 6. Since , the inputs must be grouped into 2 windows so that ideally 1 connection from each reaches every hidden neuron. If instead the inputs are grouped into 4 windows, the ideal number would be half of a connection, which is not achievable. In order to achieve the minimum window size, we let the number of left windows be and the number of right windows be . So in junction , the number of neurons in each left and right window is and , respectively. Then we construct left- and right-window adjacency matrices and by summing up entries of as shown in Fig. 5(c). The window adjacency matrices describe connectivity between windows and neurons on the opposite side. Ideally, every window adjacency matrix for a single junction should be the all 1s matrix, which signifies exactly 1 connection from every window to every neuron on the opposite side. Note that these matrices can also be constructed for multiple junctions, i.e. and , by multiplying matrices for individual junctions. See Appendix Section V-B for more discussion.
Scatter is a proxy for the performance of a NN. It is useful because it can be computed in a fraction of a second and used to predict how good or bad a sparse network is without spending time training it. To compute scatter, we count the number of entries greater than or equal to 1 in the window adjacency matrix. If a particular window gets more than its fair share of connections to a neuron on the opposite side, then it is depriving some other window from getting its fair share. This should not be encouraged, so we treat entries greater than 1 the same as 1. Scatter is the average of the count, i.e. for junction :
Subscripts and denote forward (left windows to right neurons) and backward (right neurons to left windows), indicating the direction of data flow. As an example, we consider in Fig. 6, which has a scatter value
. The other scatter values can be computed similarly to form the scatter vector, where the final 2 values correspond to junction 1:2. Notice that will be all 1s for FCLs, which is the ideal case. Incorporating sparsity leads to reduced values. The final scatter metric is the minimum value in , i.e. 0.75 for Fig. 6. Our experiments indicate that any low value in leads to bad performance, so we picked the critical minimum value.
Iii-B Analysis and Results of Scatter
We ran experiments to evaluate scatter using a) the Morse CL only network with , b) an MNIST CL only network with neuron configuration and , and c) the ‘conv+2CLs’ CIFAR10 network with . We found that high scatter indicates good performance and the correlation is stronger for networks where CLs have more importance, i.e. CL only networks as opposed to conv. This is shown in the performance vs. scatter plots in Fig. 7, where (a) and (b) show the performance predicting ability of scatter better than (c). Note that the random connection patterns used so far have the highest scatter and occur as the rightmost points in each subfigure. The other points are obtained by specifically planning connections. We found that when 1 junction was planned to give corresponding high values in , it invariably led to low values for another junction, leading to a low . This explains why random patterns generally perform well.
is shown alongside each point. When is equal for different connection patterns, the next minimum value in needs to be considered to differentiate the networks, and so on. Considering the Morse results, the leftmost 3 points all have , but the number of occurrences of in is 3 for the lowest point (8% accuracy), 2 for the second lowest (12% accuracy) and 1 for the highest point (46% accuracy). For the MNIST results, both the leftmost points have a single minimum value of in , but the lower has two occurrences of while the upper has one.
We draw several insights from these results. Firstly, although we defined as a single value for convenience, there may arise cases when other (non-minimum) elements in are important. Secondly, perhaps contrary to intuition, the concept of windows and scatter is important for all CLs, not simply the first. As shown in Fig. 7a), a network with performs equally poorly as a network with . Thirdly, scatter is a sufficient metric for performance, not necessary. A network with a high value will perform well, but a network with a slightly lower than another cannot be conclusively dismissed as being worse. But if a network has multiple low values in , it should be rejected. Finally, carefully choosing which neurons to group in a window will increase the predictive power of scatter. A priori knowledge of the dataset will lead to better window choices.
Iv Conclusion and Future Work
This paper discusses the merits of pre-defining sparsity in CLs of neural networks, which leads to significant reduction in parameters and, in several cases, computational complexity as well, without performance loss. In general, the smaller the fraction of CLs in a network, the more redundancy there exists in their parameters. If we can achieve similar results (i.e., 0.2% density) on Alexnet for example, we would obtain 95% reduction in overall parameters. Coupled with hardware acceleration designed for pre-defined sparse networks, we believe our approach will lead to more aggressive exploration of network structure. Network connectivity can be guided by the scatter metric, which is closely related to performance, and by optimally distributing connections across junctions. Future work would involve extension to conv layers and trying to reduce their operational complexity.
V-a More on Distributing Individual Junction Densities
Section II-D showed that when overall CL density is fixed, it is desirable to make junction 2 denser than junction 1. It is also interesting to note, however, that performance falls off more sharply when junction 1 density is reduced to the bare minimum as compared to treating junction 2 similarly. This is not shown in Fig. 3 due to space constraints. We found that when junction 1 had the minimum possible density and junction 2 had the maximum possible while still satisfying the fixed overall, the accuracy was about 36% for both subfigures (b) and (c). When the densities were flipped, the accuracies were 67% for subfigure (b) and 75% for (c) in Figure 3.
V-B Dense cases of Window Adjacency Matrices
As stated in Section III-A, window output matrices for several junctions can be constructed by multiplying the individual matrices for each component junction. Consider the Morse network as described in Section III-B. Note that and . Thus, for the equivalent junction 1:2 which has left neurons and right neurons, we have and . So in this case the number of neurons in each window will be rounded up to 1, and both the ideal window adjacency matrices and will be all 16’s matrices since the ideal number of connections from each window to a neuron on the opposite side is . This is a result of the network having sufficient density so that several paths exist from every input neuron to every output neuron.
V-C Possible reasons for SCLs converging faster than FCLs
Training a neural network is essentially an exercise in finding the minimum of the cost function, which is a function of all the network parameters. The graph for cost as a function of parameters may have saddle points which masquerade as minima. It could also be poorly conditioned, wherein the gradient of cost with respect to two different parameters have widely different magnitudes, making simultaneous optimization difficult. These effects are non-idealities and training the network often takes more time because of the length of the trajectory needed to overcome these and arrive at the optimum point. The probability of encountering these non-idealities increases as the number of network parameters increase, i.e. less parameters leads to a higher ratio of minima : saddle points, which can make the network converge faster. We hypothesize that SCLs train faster than FCLs due to the former having fewer parameters.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inProc. NIPS, 2012, pp. 1097–1105.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in arXiv:1312.6229, 2013.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, June 2016, pp. 770–778.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. CVPR, 2015, pp. 1–9.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in arXiv:1608.06993, 2016.
-  C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, “Energy-efficient CNN implementation on a deeply pipelined FPGA cluster,” in Proc. ISLPED. ACM, 2016, pp. 326–331.
M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas, “Predicting parameters in deep learning,” inProc. NIPS, 2013, pp. 2148–2156.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
-  W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing neural networks with the hashing trick,” in Proc. ICML. JMLR.org, 2015, pp. 2285–2294.
-  S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in Proc. ICLR, 2016.
-  S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Proc. NIPS, 2015, pp. 1135–1143.
-  X. Zhou, S. Li, K. Qin, K. Li, F. Tang, S. Hu, S. Liu, and Z. Lin, “Deep adaptive network: An efficient deep neural network with sparse binary connections,” in arXiv:1604.06154, 2016.
-  Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing deep convolutional networks using vector quantization,” in arXiv:1412.6115, 2014.
-  V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for small-footprint deep learning,” in Proc. NIPS. Curran Associates, Inc., 2015, pp. 3088–3096.
-  S. Srinivas, A. Subramanya, and R. V. Babu, “Training sparse neural networks,” in arXiv:1611.06694, 2016.
-  S. Dey, Y. Shao, K. M. Chugg, and P. A. Beerel, “Accelerating training of deep neural networks via sparse edge processing,” in Proc. ICANN. Springer, 2017, pp. 273–280.
-  S. Dey, P. A. Beerel, and K. M. Chugg, “Interleaver design for deep neural networks,” in Proc. Asilomar Conference on Signals, Systems and Computers. IEEE, 2017.
-  A. Bourely, J. P. Boueri, and K. Choromonski, “Sparse neural network topologies,” in arXiv:1706.05683, 2017.
-  S. Dey, “Morse code dataset for artificial neural networks,” Oct 2017. [Online]. Available: https://cobaltfolly.wordpress.com/2017/10/15/morse-code-dataset-for-artificial-neural-networks/