1 Introduction
Deep neural networks excel at many tasks but usually suffer form two problems: 1) they are cumbersome, and difficult to deploy on embedded devices (Crowley et al., 2018) and 2) they tend to overfit the training data (Hawkins, 2004). Several approaches have been proposed to overcome each of them. DropOut (Srivastava et al., 2014), DropConnect (Wan et al., 2013) and wightdecay (Krogh & Hertz, 1992) are a few examples of techniques that try to overcome overfitting. DropOut drops a random subset of neurons and all the edges connected to them during the training phase while in DropConnect a random subset of edges are dropped during the training phase. Weightdecay method uses regularization to enforce uniformity among weights. However, all of these approaches induce sparsity only during the training phase and testing is performed on the averaged network which is not sparse. Hence, there is no reduction in the memory usage.
Pruning methods try to compress a deep neural network to be more memory efficient. A simple, yet popular technique which uses hard thresholding to prune the weights close to zero has received significant attention (Reed, 1993). Weightelimination (Weigend et al., 1991), group LASSO regularizer (Scardapane et al., 2017) and regularizer (Li et al., 2018) are examples of pruning techniques. Even thought all these approaches have shown a lot of promise in sparsifying the network, they are computationally taxing schemes, since all the weights of the network need to be optimized over training data.
In this work, first, we empirically show that regularization results in a paradigm where the out degree of a neuron is representative of its importance. Then, we propose a random sparse neural network which trains on a far fewer parameters than a fully connected deep neural network without much degradation in the test performance. Our approach uses feature importance to order the inputs in the first layer and inputs with lesser importance are allotted lesser trainable parameters. This sparsity structure is maintained across all the layers by using spatiallycoupled sparse constructions (inspired by spatiallycoupled LDPC codes) to maintain block sparsity. Our proposed architecture is pruned before training and avoids overfitting. Hence, it is both computationally efficient and memory efficient.
2 Proposed Sparse Construction
In most of the learning problems at hand, features are transformed into a space that separates the data well to aid the learning task. But traditional neural networks don’t take advantage of the inherent ordering in the features which is based on an importance measure. Our proposed architecture is a feedforward neural network that is designed to take advantage of the side information in the input features to allocate the trainable parameters efficiently. The first step in our model is to transform and rank input features based on their importance. Two of such transformations that well suit our model are principal component analysis (PCA)
(Wold et al., 1987)and random forest (RF) feature importance
(Liaw et al., 2002). After applying one of the feature transformation methods to the input features, we get transformed input where the elements in are ordered in the decreasing order of importance measure specific to the transformation. In the next section first we define spatiallycoupled (SC) layer and then show how to use feature importnace information in a SC layer.2.1 SpatiallyCoupled (SC) Layer
First, we define random sparse (RSP) layer which is the building block of a SC layer. Let
denote the output vector of the
th layer of the neural network. Also assume that and denote the weights and biases at layer , respectively. The output of RSP layer is given by(1) 
where
is the activation function. Although the mathematical formulation in (
1) is similar to traditional fully connected (FC) layer, the fundamental difference is that is the binary adjacency matrix of a random bipartite graph (Tanner graph) as opposed to all ones matrix in the FC case. More specifically, RSP construction imposes sparsity in the bipartite graph between layers, in which each edge denotes a weight between two neurons in layers and , in the same way as how low density parity check (LDPC) codes are constructed (Gallager, 1962).Assume that shows the out degree of th neuron at layer , denotes the set of integers and represents the number of neurons at layer . Then, the RSP layer is constructed as follows.

Pick a degree distribution parameterized by the set of parameters .

Draw i.i.d samples for degrees from the chosen degree distribution .

For each neuron in layer , uniformly pick unique neurons denoted by from layer and connect to all neurons in .
Notice that unlike FC layer, not every neuron in layer is connected to every neuron in layer . In other words, the out degree of th neuron at layer is much less than the number of nodes in layer , i.e. .
Given the construction of a RSP layer, now we define the SC layer. Here, the neurons in each layer are partitioned into blocks of equal size and the neurons in a block of layer are randomly connected (locally) to neurons in a few adjacent blocks of layer . The number of adjacent blocks that participate in the local connections at layer is called the receptive field of the layer and denoted by (see Figure 1). More specifically, the construction of SC layer is done as follows:

Consider a window of adjacent blocks with block indices from layer and block index from layer and construct a RSP layer locally.

Repeat step 1 for each of the blocks and choose a random instance of RSP for each of those windows.
We choose a simple left regular degree distribution within each block. As pointed out before, we allocate resources (trainable parameters) according to the importance in features. The degree of each neuron is equivalent to the amount of resources that is allocated to that neuron. As our input is ordered based on some importance measure, we allocate high degree to the blocks with higher important features and low degree to the blocks with lower important features. By construction, the intermediate layers also have the same ordering of feature importance and hence at each layer, we allocate degrees proportional to the importance measure.
3 Experiments and Discussion
We trained spatiallycoupled neural network for classification problem on fashion MNIST dataset (Xiao et al., 2017). Fashion MNIST data set consist of 70000 samples of 2828 grayscale images, each one associated with a label from 10 classes. We vectorized the input samples to a vectors of length 784, transformed (in case of PCA) and reordered the features based on decreasing order of importance.
In all of our experiments, we deployed neural networks with 5 hidden layers with 784 neurons each and an output layer with 10 neurons. We used sigmoid activation function, crossentropy loss with regularization and regularization parameter of
except for FC architecture. In FC architecture, we used DropOut with keeping probability of 0.5. We also compared our method with RSP construction with left regular degree distribution. The degree of each node is set to be 53. We form our proposed SC neural network with 8 blocks in all layers. The RSP construction of each block is left regular. The out degrees of the neurons in blocks are set to be {98, 130, 98, 49, 20, 10, 10, 10}. We note that the number of parameters in RSP and SC constructions are approximately 93% less than the fully connected case.
The findings reported in Figure 2
fortify the proposed framework of allocating more resources to features of higher importance. We trained a FC neural network classifier (with the same architecture as mentioned above) with
regularization of on input layer weights and on other weights. PCA was used to sort features in descending order. The number of edges with absolute weight greater than 0.1 (a representative of contributing edges) emerging from an input neurons is plotted in Figure 2. It can be seen that it decays rapidly as the importance of features goes down which validates our choice of SC graph constructions.Feature Importance  Input Ordering  NN Construction  Accuracy 
SC  87.18%  
Descending  RSP  84.33%  
FC  86.78%  
PCA  SC  10.00%  
Ascending  RSP  84.33%  
FC  86.78%  
SC  86.40%  
Descending  RSP  84.54%  
FC  86.06%  
RF  SC  85.26%  
Ascending  RSP  84.54%  
FC  86.06%  
Table 1 summarizes the results of classification task. If the input to the models is ordered PCA features, spatiallycoupled neural network shows the best accuracy, 87.18%. Comparing SC with FC neural network shows that FC, even with regularization and DropOut, tend to overfit the data because of the huge number of parameters in the model. On the other hand, adding sparsity to the model naively like RSP with left regular degree neural network degrades the performance. By allocating the trainable parameters efficiently and cleverly, we can reach high accuracy with very sparse networks. The same pattern exist for random forest feature importance (RF) too, in which SC outperforms the other two methods.
To show that ordering of features is crucial in SC construction, we repeated the experiments with reversed order of features, i.e. assigning high degrees to less important features and vice versa. It can be seen that in reverse PCA case, SC shows a very poor performance while the other methods maintain their performance as they are permutation invariant. In RF reverse case, SC is very close to the best performance. The difference in the two cases, that causes drastic change in performance, is the quantization of feature importance. The PCA has a small number of very important features (few high variance features) and large number of feature which are not important (many low variance features), thus assigning a very low degree to all of the high variance features degrading the accuracy substantially. However, RF reorders the input which tend to give us many equally important feature and some less important features. Therefore, all of the important features are not diminished and some of them will have high degree in SC model.
One important property of SC construction is that it is preserves the feature importance ordering throughout the network. We validated this empirically by measuring the feature importance at each layer. Besides better interpretability of the model compared to FC neural networks, a nice application of this property is that at every layer we can prune the lower blocks after training while maintaining the overall accuracy. This can lead us to a highly sparse structure which can reduce the model complexity even more than 95% with approximately the same performance. An avenue for future work is how to learn these class of transformation that respects the network using fully connected layers.
References
 Crowley et al. (2018) Crowley, E. J., Turner, J., Storkey, A., and O’Boyle, M. Pruning neural networks: is it time to nip it in the bud? arXiv preprint arXiv:1810.04622, 2018.
 Gallager (1962) Gallager, R. Lowdensity paritycheck codes. IRE Transactions on information theory, 8(1):21–28, 1962.
 Hawkins (2004) Hawkins, D. M. The problem of overfitting. Journal of chemical information and computer sciences, 44(1):1–12, 2004.
 Krogh & Hertz (1992) Krogh, A. and Hertz, J. A. A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957, 1992.
 Li et al. (2018) Li, F., Zurada, J. M., and Wu, W. Smooth group l1/2 regularization for input layer of feedforward neural networks. Neurocomputing, 314:109–119, 2018.
 Liaw et al. (2002) Liaw, A., Wiener, M., et al. Classification and regression by randomforest. R news, 2(3):18–22, 2002.
 Reed (1993) Reed, R. Pruning algorithmsa survey. IEEE transactions on Neural Networks, 4(5):740–747, 1993.
 Scardapane et al. (2017) Scardapane, S., Comminiello, D., Hussain, A., and Uncini, A. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Wan et al. (2013) Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pp. 1058–1066, 2013.
 Weigend et al. (1991) Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. Generalization by weightelimination with application to forecasting. In Advances in neural information processing systems, pp. 875–882, 1991.
 Wold et al. (1987) Wold, S., Esbensen, K., and Geladi, P. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(13):37–52, 1987.
 Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. FashionMNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv eprints, art. arXiv:1708.07747, Aug 2017.