Topological Insights in Sparse Neural Networks

Sparse neural networks are effective approaches to reduce the resource requirements for the deployment of deep neural networks. Recently, the concept of adaptive sparse connectivity, has emerged to allow training sparse neural networks from scratch by optimizing the sparse structure during training. However, comparing different sparse topologies and determining how sparse topologies evolve during training, especially for the situation in which the sparse structure optimization is involved, remain as challenging open questions. This comparison becomes increasingly complex as the number of possible topological comparisons increases exponentially with the size of networks. In this work, we introduce an approach to understand and compare sparse neural network topologies from the perspective of graph theory. We first propose Neural Network Sparse Topology Distance (NNSTD) to measure the distance between different sparse neural networks. Further, we demonstrate that sparse neural networks can outperform over-parameterized models in terms of performance, even without any further structure optimization. To the end, we also show that adaptive sparse connectivity can always unveil a plenitude of sparse sub-networks with very different topologies which outperform the dense model, by quantifying and comparing their topological evolutionary processes. The latter findings complement the Lottery Ticket Hypothesis by showing that there is a much more efficient and robust way to find "winning tickets". Altogether, our results start enabling a better theoretical understanding of sparse neural networks, and demonstrate the utility of using graph theory to analyze them.


page 10

page 11

page 12


Pruned and Structurally Sparse Neural Networks

Advances in designing and training deep neural networks have led to the ...

Evolving and Understanding Sparse Deep Neural Networks using Cosine Similarity

Training sparse neural networks with adaptive connectivity is an active ...

Learning Connectivity of Neural Networks from a Topological Perspective

Seeking effective neural networks is a critical and practical field in d...

On improving deep learning generalization with adaptive sparse connectivity

Large neural networks are very successful in various tasks. However, wit...

Insights on representational similarity in neural networks with canonical correlation

Comparing different neural network representations and determining how r...

NeuroFabric: Identifying Ideal Topologies for Training A Priori Sparse Networks

Long training times of deep neural networks are a bottleneck in machine ...

On Optimizing Deep Convolutional Neural Networks by Evolutionary Computing

Optimization for deep networks is currently a very active area of resear...

Code Repositories


Bachelor Thesis project comparing different activation functions and their impact on the performance of Sparse Neural Networks

view repo

1 Introduction

Deep neural networks have led to promising breakthroughs in various applications. While the performance of deep neural networks improving, the size of these usually over-parameterized models has been tremendously increasing. The training and deploying cost of the state-of-art models, especially pre-trained models like BERT [4], is very large.

Sparse neural networks are an effective approach to address these challenges. Discovering a small sparse and well-performing sub-network of a dense network can significantly reduce the parameters count (e.g. memory efficiency), along with the floating-point operations. Over the past decade, many works have been proposed to obtain sparse neural networks, including but not limited to magnitude pruning [10, 9]

, Bayesian statistics

[27, 22], and regularization [23]

, reinforcement learning

[19]. Given a pre-trained model, these methods can efficiently discover a sparse sub-network with competitive performance. While some works aim to provide analysis of sparse neural networks [6, 34, 7, 21], they mainly focus on how to empirically improve training performance or to what extent the initialization and the final sparse structure contribute to the performance. Sparsity (the proportion of neural network weights that are zero-valued) inducing techniques essentially uncover the optimal sparse topologies (sub-networks) that, once initialized in a right way, can reach a similar predictive performance with dense networks as shown by the Lottery Ticket Hypothesis [6]. Such sub-networks are named “winning lottery tickets” and can be obtained from pre-trained dense models, which makes them inefficient during the training phase.

Recently, many works have emerged to achieve both, training efficiency and inference efficiency, based on adaptive sparse connectivity [25, 28, 20, 3, 5]

. Such networks are initialized with a sparse topology and can maintain a fixed sparsity level throughout training. Instead of only optimizing model parameters - weight values (continuous optimization problem), in this case, the sparse topology is also optimized (combinatorial optimization problem) during training according to some criteria in order to fit the data distribution. In

[5], it is shown that such metaheuristics approaches always lead to very-well performing sparse topologies, even if they are based on a random process, without the need of a pre-trained model and a lucky initialization as done in [6]. While it has been shown empirically that both approaches, i.e. winning lottery tickets and adaptive sparse connectivity, find very well-performing sparse topologies, we are generally lacking their understanding. Questions such as: How different are these well-performing sparse topologies?, Can very different sparse topologies lead to the same performance?, Are there many local sparse topological optima which can offer sufficient performance (similar in a way with the local optima of the weights continuous optimization problem)?, are still unanswered.

In this paper, we are studying these questions in order to start enabling a better theoretical understanding of sparse neural networks and to unveil high gain future research directions. Concretely, our contributions are:

  • We propose the first metric which can measure the distance between two sparse neural networks topologies111Our code is available at
    , and we name it Neural Network Sparse Topology Distance (NNSTD). For this, we treat the sparse network as a large neural graph. In NNTSD, we take inspiration from graph theory and Graph Edit Distance (GED) [31]

    which cannot be applied directly due to the fact that two different neural graphs may represent very similar networks since hidden neurons are interchangeable


  • Using NNSTD, we demonstrate that there exist many very different well-performing sparse topologies which can achieve the same performance.

  • In addition, with the help of our proposed distance metric, we confirm and complement the findings from [5] by being able to quantify how different are the sparse and, at the same time, similarly performing topologies obtained with adaptive sparse connectivity. This implicitly implies that there exist many local well-performing sparse topological optima.

2 Related Work

2.1 Sparse Neural Networks

2.1.1 Sparse Neural Networks for Inference Efficiency.

Since being proposed, the motivation of sparse neural networks is to reduce the cost associated with the deployment of deep neural networks (inference efficiency) and to gain better generalization [1, 11, 16]. Up to now, a variety of methods have been proposed to obtain inference efficiency by compressing a dense network to a sparse one. Out of them, pruning is certainly the most effective one. A method which iteratively alternates pruning and retraining was introduced by Han et al. [10]

. This method can reduce the number of connections of AlexNet and VGG-16 on ImageNet by 9

to 13 without loss of accuracy. Further, Narang et al. [29]

applied pruning to recurrent neural networks while getting rid of the retraining process. At the same time, it is shown in

[35] that, with the same number of parameters, the pruned models (large-sparse) have better generalization ability than the small-dense models. A grow-and-prune (GP) training was proposed in [2]. The network growth phase slightly improves the performance. While unstructured sparse neural networks achieve better performance, it is difficult to be applied into parallel processors, since the limited support for sparse operations. Compared with fine-grained pruning, coarse-grained (filter/channel) pruning is more desirable to the practical application as it is more amenable for hardware acceleration [12, 13].

2.1.2 Sparse Neural Networks for Training Efficiency

Recently, more and more works attempt to get memory and computational efficiency for the training phase. This can be naturally achieved by training sparse neural networks directly. However, while training them with a fixed sparse topology can lead to good performance [24], it is hard to find an optimal sparse topology to fit the data distribution before training. This problems was addressed by introducing the adaptive sparse connectivity concept through its first instantiation, the Sparse Evolutionary Training (SET) algorithm [26, 25]

. SET is a straightforward strategy that starts from random sparse networks and can achieve good performance based on magnitude weights pruning and regrowing after each training epoch. Further, Dynamic Sparse Reparameterization (DSR)

[28] introduced across-layer weights redistribution to allocate more weights to the layer that contributes more to the loss decrease. By utilizing the momentum information to guide the weights regrowth and across-layer redistribution, Sparse Momentum [3] can improve the classification accuracy for various models. However, the performance improvement is at the cost of updating and storing the momentum of every individual weight of the model. Very recently, instead of using the momentum, The Rigged Lottery [5] grows the zero-weights with the highest magnitude gradients to eliminate the extra floating point operations required by Sparse Momentum. Liu et al. [20] trained intrinsically sparse recurrent neural networks (RNNs) that can achieve usually better performance than dense models. Lee et al [17] introduced single-shot network pruning (SNIP) that can discover a sparse network before training based on a connection sensitivity criterion. Trained in the standard way, the sparse pruned network can have good performance. Instead of using connection sensitivity, GraSP [32] prunes connections whose removal causes the least decrease in the gradient norm, resulting in better performance than SNIP in the extreme sparsity situation.

2.1.3 Interpretation and Analysis of Sparse Neural Networks

Some works are aiming to interpret and analyze sparse neural networks. Frankle & Carbin [6] proposed the Lottery Ticket Hypothesis and shown that the dense structure contains sparse sub-networks that are able to reach the same accuracy when they are trained with the same initialization. Zhou et al. [34] further claimed that the sign of the “lucky” initialization is the key to guarantee the good performance of ”winning lottery tickets”. Liu et al. [21] reevaluated the value of network pruning techniques. They showed that training a small pruned model from scratch can reach the same or even better performance than conventional network pruning and for small pruned models, the pruned architecture itself is more crucial to the learned weights. Moreover, magnitude pruning [35] can achieve better performance than regularization [23] and variational dropout [27] in terms of large-scale tasks [7].

2.2 Sparse Evolutionary Training

Sparse Evolutionary Training (SET) [25] is an effective algorithm that allows training sparse neural networks from scratch with a fixed number of parameters. Instead of starting from a highly over-parameterized dense network, the network topology is initialized as a sparse Erdős-Rényi graph [8]

, a graph where each edge is chosen randomly with a fixed probability, independently from every other edge. Given that the random initialization may not always guarantee good performance, adaptive sparse connectivity is utilized to optimize the sparse topology during training. Concretely, a fraction

of the connections with the smallest magnitude are pruned and an equal number of novel connections are re-grown after each training epoch. This adaptive sparse connectivity (pruning-and-regrowing) technique is capable of guaranteeing a constant sparsity level during the whole learning process and also improving the generalization ability. More precisely, at the beginning of the training, the connection () between neuron and exists with the probability:


where are the number of neurons of layer and , respectively; is a parameter determining the sparsity level. The smaller is, the more sparse the network is. By doing this, the sparsity level of layer is given by . The connections between the two layers are collected in a sparse weight matrix . Compared with fully-connected layers whose number of connections is , the SET sparse layers only have connections which can significantly alleviate the pressure of the expensive memory footprint. Among all possible adaptive sparse connectivity techniques, in this paper, we make use of SET due to two reasons: (1) its natural simplicity and computational efficiency, and (2) the fact that the re-grown process of new connections is purely random favoring in this way an unbiased study of the evolved sparse topologies.

3 Neural Network Sparse Topology Distance

In this section, we introduce our proposed method, NNSTD, to measure the topological distance between two sparse neural networks. The sparse topology locution used in this paper refers to the graph underlying a sparsely connected neural network in which each neuron represents a vertex in this graph and each existing connection (weight) represents an edge in the graph. Existing metrics to measure the distance between two graphs are not always applicable to artificial neural network topologies. The main difficulty is that two different graph topologies may represent similar neural networks since hidden neurons are interchangeable. All graph similarity metrics consider either labeled or unlabeled nodes to compute the similarity. With neural networks, input and output layers are labeled (each of their neurons corresponds to a concrete data feature or class, respectively), whereas hidden layers are unlabelled. In particular, we take multilayer perceptron networks (MLP) as the default.

The inspiration comes from Graph Edit Distance (GED)  [31], a well-known graph distance metric. Considering two graphs and , it measures the minimum cost required to transform into a graph isomorphic to . Formally the graph edit distance is calculated as follows.


where represents a sequence of transformation from into a graph isomorphic to , and represents the total cost of such transformation. represents all possible transformations. This large panel of possibilities makes computing the GED a NP-hard problem when a subset of the nodes in the graphs are unlabeled (e.g. hidden neurons are interchangeable).

Figure 1: NNSTD metric illustration.
Function NED():
        return ;
Function CompareLayers():
        for neuron in  do
               for neuron in  do
               end for
        end for
        return neuron_assignment, normalized_cost/size;
Function CompareNetworks():
        for layer in  do
        end for
       return ;
Algorithm 1 Neural Network Sparse Topology Distance

The proposed NNSTD metric is presented in Algorithm 1 and discussed next. A graphical example is also provided in Figure 1. As an example, two neural networks are considered. For each hidden neuron , a tree graph is constructed based on all direct inputs to this neuron, and these input neurons are collected in a set . Per layer, for all possible pairs of neurons between the two networks, the Normalized Edit Distance (NED) is calculated between their input neurons, as defined in the second line of Algorithm 1. NED takes the value 1 if the two compared neurons have no input neurons in common, and 0 if they have the exact same neurons as input. To reduce the complexity of the search space, we take a greedy approach, and for any current layer we consider that the neurons of the previous layer are labeled (as they have been matched already by the proposed distance metric when the previous layer was under scrutiny), and that adding or deleting inputs have the same cost. For instance, for the neurons compared in Figure 1, one input neuron is shared out of two different inputs considered, thus the distance between them is . The NNSTD matrix is solved using the Hungarian method to find the neuron (credit) assignment problem which minimizes the total cost, presented in underlined Figure 1. The aggregated costs divided by the size of gives the distance between the first layer of and . To compare the next layer using the same method, the current layer must be fixed. Therefore the assignment solving the NNSTD matrix is saved to reorder the first layer of . To the end, an NNSTD value of 0 between two sparse layers (or two sparse networks) shows that the two layers are exactly the same, while a value of 1 (maximum possible) shows that the two layers are completely different.

4 Experimental Results

In this section, we study the performance of the proposed NNSTD metric and the sparse neural network properties on two datasets, Fashion-MNIST [33] and CIFAR-10 [15], in a step-wise fashion. We begin in Section 4.2 by showing that sparse neural networks can match the performance of the fully-connected counterpart, even without topology optimization. Next, in Section 4.3 we first validate NNSTD and then we apply it to show that adaptive sparse connectivity can find many well-performing very different sub-networks. Finally, we verify that adaptive sparse connectivity indeed optimizes the sparse topology during training in Section 4.4.

4.1 Experimental Setup

For the sake of simplicity, the models we use are MLPs with SReLU activation function

[14] as it has been shown to provide better performance for SET-MLP [25]. For both datasets, we use 20% of the training data as the validation set and the test accuracy is computed with the model that achieves the highest validation accuracy during training.

For Fashion-MNIST, we choose a three-layer MLP as our basic model, containing 784 hidden neurons in each layer. We set the batch size to 128. The optimizer is stochastic gradient descent (SGD) with Nesterov momentum. We train these sparse models for 200 epochs with a learning rate of 0.01, Nesterov momentum of 0.9. And the weight decay is 1e-6.

The network used for CIFAR-10 consists of two hidden layers with 1000 hidden neurons. We use standard data augmentations (horizontal flip, random rotate, and random crop with reflective padding). We set the batch size to 128. We train the sparse models for 1000 epochs using a learning rate of 0.01, stochastic gradient descent with Nesterov momentum of

= 0.9. And we use a weight decay of 1e-6.

4.2 The Performance of Sparse Neural Networks

We first verify that random initialized sparse neural networks are able to reach a competitive performance with the dense networks, even without any further topology optimization.

(a) Test accuracy with three-layer MLPs on Fashion-MNIST.
(b) Test accuracy with two-layer MLPs on CIFAR-10.
Figure 2:

Test accuracy of MLPs with various density levels. SET-MLP refers to the networks trained with adaptive sparse connectivity associated with SET. Fix-MLP refers to the networks trained without sparse topology optimization. The dashed lines represent the dense MLPs. Note that each line is the average of 8 trials and the standard deviation is very small.

For Fashion-MNIST, we train a group of sparse networks with density levels () in the space . For each density level, we initialize two sparse networks with two different random seeds as root networks. For each root network, we generate a new network by randomly changing 1% connections. We perform this generating operation 3 times to have 4 networks in total including the root network for each random seed. Every new network is generated from the previous generation. Thus, the number of networks for each density level is 8 and the total number of sparse networks of Fashion-MNIST is 96. We train these sparse networks without any sparse topology optimization for 200 epochs, named as Fix-MLP. To evaluate the effectiveness of sparse connectivity optimization, we also train the same networks with sparse connectivity optimization proposed in SET [25] for 200 epochs, named as SET-MLP. The hyper-parameter of SET, pruning rate, is set to be 0.2. Besides this, we choose two fully-connected MLPs as the baseline.

The experimental results are given in Figure 1(a). We can see that, as long as the density level is bigger than 20%, both Fix-MLP and SET-MLP can reach a similar accuracy with the dense MLP. While decreasing the density level decreases the performance of sparse networks gradually, sparse MLPs still reach the dense accuracy with only 0.6% parameters. Compared with Fix-MLP, the networks trained with SET are able to achieve slightly better performance.

For CIFAR-10, we train two-layer MLPs with various density levels located in the range . We use the same strategy with Fashion-MNIST to generate 72 networks in total, 8 for each density level. All networks are trained with and without adaptive sparse connectivity for 900 epochs. The two-layer dense MLP is chosen as the baseline.

The results are illustrated in Figure 1(b). We can observe that Fix-MLP consistently reaches the performance of the fully-connected counterpart when the percentage of parameters is larger than 20%. It is more surprising that SET-MLP can significantly improve the accuracy with the help of adaptive sparse connectivity. With only 5% parameters, SET-MLP can outperform the dense counterpart.

4.3 Topological Distance between Sparse Neural Networks

4.3.1 Evaluation of Neural Network Sparse Topology Distance.

In this part, we evaluate our proposed NNSTD metric by measuring the initial topological distance between three-layer MLPs on Fashion-MNIST before training. We first measure the topology distance between networks with the same density. We initialize one sparse network with a density level of 0.6%. Then, we generate 9 networks by iteratively changing 1% of the connections from the previous generation step. By doing this, the density of these networks is the same, whereas the topologies vary a bit. Therefore, we expect that the topological distance of each generation from the root network should increase gradually as the generation adds up, but still to have a small upper bound. The distance measured by our method is illustrated in Figure 2(a). We can see that the result is consistently in line with our hypothesis. Starting with the value close to zero, the distance increases as the topological difference adds up, but the maximum distance is still very small, around 0.2.

(a) Same density level

(b) Different density levels
Figure 3: Evaluation of the proposed NNSTD metric. (a) refers to the sparse topology distance among 10 networks generated by randomly changing 1% connections with the same density level of 0.6%. represents these gradually changed networks. (b) represents the sparse topology distance among networks generated with different density levels.

Further, we also evaluate NNSTD on sparse networks with different density levels. We use the same 96 sparse and two dense networks generated in Section 4.2. Their performance is given in Figure 1(a). Concretely, for each density level, we choose 8 networks generated by two different random seeds. For each density level in the plot, the first four networks are generated with one random seed and the latter four networks are generated with another one. We hypothesize that distance among the networks with the same density should be different from the networks with different density. The distance among networks with different density can be very large, since the density varies over a large range, from 0.1% to 100%. Furthermore, the topological distance increases as the density difference increases, since more cost is required to match the difference between the number of connections. We show the initial topology distance in Figure 2(b). We can see that the distance among different density levels can be much larger than among the ones with the same density, up to 1. The more similar the density levels are, the more similar the topologies are. As expected, the distance between networks with the same density generated with different random seeds is very big. This makes sense as initialized with different random seeds, two sparse connectivities between two layers can be totally different. We only plot the distance of the first layer, as all layers are initialized in the same way.

4.3.2 Evolutionary Optimization Process Visualization.

Herein, we visualize the evolutionary optimization process of the sparse topology learned by adaptive sparse connectivity associated with SET on Fashion-MNIST and CIFAR-10.

(a) Epoch 0
(b) Epoch 10
(c) Epoch 30
(d) Epoch 50
(e) Epoch 100
(f) Epoch 190
Figure 4: Topological distance dynamics of 10 networks optimized by adaptive sparse connectivity with three-layer SET-MLP on Fashion-MNIST. The initial networks (epoch 0) have the same density level, with a tiny percentage (1%) of topological difference with each other. represents different networks.

First, we want to study that, initialized with very similar structures, how the topologies of these networks change when they are optimized by adaptive sparse connectivity. We choose the same 10 networks generated for Fashion-MNIST in Figure 2(a) and train them with SET for 200 epochs. All the hyper-parameters are the same as in Section 4.2. We apply NNSTD to measure the pairwise topological distance among these 10 networks at the , the , the and the epoch.

W0 W1 W2 W3 W4 W5 W6 W7 W8 W9
Fashion-MNIST 87.48 87.53 87.41 87.54 88.01 87.58 87.34 87.70 87.77 88.02
CIFAR-10 65.46 65.62 65.26 65.46 65.00 65.57 65.61 64.92 64.86 65.58
Table 1: The test accuracy of networks used for the evolutionary optimization process of adaptive sparse connectivity in Section 4.3.2, in percentage.

It can be observed in Figure 4 that, while initialized similarly, the topological distance between networks gradually increases from 0 to 0.6. This means that similar initial topologies gradually evolve to very different topologies while training with adaptive sparse connectivity. It is worth noting that while these networks end up with very different topologies, they achieve very similar test accuracy, as shown in Table 1. This phenomenon shows that there are many sparse topologies obtained by adaptive sparse connectivity that can achieve good performance. This result can be treated as the complement of Lottery Ticket Hypothesis, which claims that, with “lucky” initialization, there are subnetworks yielding an equal or even better test accuracy than the original network. We empirically demonstrate that many sub-networks having good performance can be found by adaptive sparse connectivity, even without the “lucky” initialization. Besides, Figure 5 depicts the comparison between the initial and the final topological distance among the 96 networks used in Figure 1(a). We can see that the distance among different networks also increases after the training process in varying degrees.

(a) The initial distance.
(b) The final distance.
Figure 5: Heatmap representing the topological distance between the first layer of the 96 three-layers SET-MLP networks on Fashion-MNIST.

Second, we conduct a controlled experiment to study the evolutionary trajectory of networks with very different topologies. We train 10 two-layer SET-MLPs on CIFAR-10 for 900 epochs. All the hyperparameters of these 10 networks are the same except for random seeds. The density level that we choose for this experiment is 0.7%. With this setup, all the networks have very different topologies even with the same density level. The topologies are optimized by adaptive sparse connectivity (prune-and-regrow strategy) during training with a pruning rate of 20% and the weights are optimized by momentum SGD with a learning rate of 0.01.

Figure 6: Heatmap representing the topological distance between the first layer of the two-layer SET-MLP networks on CIFAR-10. (a) refers to distance before training. (b) refers to distance after training. (c) represents the topological distance evolution during training for the first network. represents different networks.

The distance between different networks before training is very big as they are generated with different random seeds (Figure 5(a)), while the expectation is that these networks will end up after the training process also with very different topologies. This is clearly reflected in Figure 5(b).

We are also interested in how the topology evolves within one network trained with SET. Are the difference between the final topology and the original topology big or small? To answer this question, we visualize the optimization process of the sparse topology during training within one network. We save the topologies obtained every 100 epochs and we use the proposed method to compare them with each other. The result is illustrated in Figure 5(c). We can see that the topological distance gradually increases from 0 to a big value, around 0.8. This means that, initialized with a random sparse topology, the network evolves towards a totally different topology during training.

In all cases, after training, the topologies end up with very different sparse configurations, while at the same time all of them have very similar performance as shown in Table 1. We highlight that this phenomenon is in line with Fashion-MNIST, which confirms our observation that there is a plenitude of sparse topologies obtained by adaptive sparse connectivity which achieve very good performance.

4.4 Combinatorial Optimization of Sparse Neural Networks

Figure 7: Average test accuracy convergence of the SET network (yellow) and the average test accuracy of the retrained networks: starting from SET weights values (brown) and starting from random weights values (vivid cyan). Each line is the average of 10 trials.

Although the sparse networks with fixed topology are able to reach similar performance with dense models, randomly initialized sparse networks can not always guarantee good performance, especially when the sparsity is very high as shown in Figure 2. One effective way to optimize the sparse topology is adaptive sparse connectivity, a technique based on connection pruning followed by connection regrowing, which has shown good performance in the previous works [25, 28, 3, 5]. Essentially, the learning process of the above-mentioned techniques based on adaptive sparse connectivity is a combinatorial optimization problem (model parameters and sparse topologies). The good performance achieved by these techniques can not be solely achieved by the sparse topologies, nor by their initialization [28].

Here, we want to further analyze if the topologies optimized by adaptive sparse connectivity contribute to better test accuracy or not. We hypothesize that, the test accuracy of the optimized topologies should continuously be improving until they converge. To test our hypothesis, we first initialize 10 two-layer MLPs with an extremely low density level (0.5%) under different random seeds and then train them using SET with a pruning rate of 0.2 for 900 epochs on CIFAR-10. We save the sparse networks per 100 epochs and retrain these networks for another 1000 epochs with randomly re-initialized weights. Besides this, to sanity check the effectiveness of the combinatorial optimization, we also retrain the saved networks for 1000 epochs starting from the learned weights by SET.

Figure 7 plots the learning curves of SET and the averaged test accuracy of the retrained networks. We can observe that, the test accuracy of random initialized networks consistently increases as the training epoch increases. This behavior highlights the fact that the adaptive sparse connectivity indeed helps the sparse topology to evolve towards an optimal one. Besides this, it seems that the topology learns faster at the beginning. However, the retrained networks which start from random initialized weights no longer match the performance of SET after about 400 epochs, which indicates that both, the weight optimization and the topology optimization, are crucial to the performance of sparse neural networks. Compared with random re-initialization, training further with the original weights is able to significantly improve the performance. This phenomenon provides a good indication on the behavior of sparse neural networks. It may also pinpoint directions for future research on sparse neural connectivity optimization, which, however, is out of the scope of this paper.

5 Conclusion

In this work, we propose the first method which can compare different sparse neural network topologies, namely NNSTD, based on graph theory. Using this method, we obtain novel insights into sparse neural networks by visualizing the topological optimization process of Sparse Evolutionary Training (SET). We demonstrate that random initialized sparse neural networks can be a good choice to substitute over-parameterized dense networks when there are no particularly high requirements for accuracy. Additionally, we show that there are many low-dimensional structures (sparse neural networks) that always achieve very good accuracy (better than dense networks) and adaptive sparse connectivity is an effective technique to find them.

In the light of these new insights, we suggest that, instead of exploring all resources to train over-parameterized models, intrinsically sparse networks with topological optimizers can be an alternative approach, as our results demonstrate that randomly initialized sparse neural networks with adaptive sparse connectivity offer benefits not just in terms of computational and memory costs, but also in terms of the principal performance criteria for neural networks, e.g. accuracy for classification tasks.

In the future, we intend to investigate larger datasets, like Imagenet [30], while considering also other types of sparse neural networks and other techniques to train sparse networks from scratch. We intend to invest more in developing hardware-friendly methods to induce sparsity.


This research has been partly funded by the NWO EDIC project.


  • [1] Y. Chauvin (1989) A back-propagation algorithm with optimal use of hidden units. In Advances in neural information processing systems, pp. 519–526. Cited by: §2.1.1.
  • [2] X. Dai, H. Yin, and N. K. Jha (2018) Grow and prune compact, fast, and accurate lstms. arXiv preprint arXiv:1805.11797. Cited by: §2.1.1.
  • [3] T. Dettmers and L. Zettlemoyer (2019) Sparse networks from scratch: faster training without losing performance. arXiv preprint arXiv:1907.04840. Cited by: §1, §2.1.2, §4.4.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • [5] U. Evci, T. Gale, J. Menick, P. S. Castro, and E. Elsen (2019) Rigging the lottery: making all tickets winners. arXiv preprint arXiv:1911.11134. Cited by: 3rd item, §1, §2.1.2, §4.4.
  • [6] J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1, §1, §2.1.3.
  • [7] T. Gale, E. Elsen, and S. Hooker (2019) The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574. Cited by: §1, §2.1.3.
  • [8] E. N. Gilbert (1959) Random graphs. The Annals of Mathematical Statistics 30 (4), pp. 1141–1144. Cited by: §2.2.
  • [9] Y. Guo, A. Yao, and Y. Chen (2016) Dynamic network surgery for efficient dnns. In Advances in neural information processing systems, pp. 1379–1387. Cited by: §1.
  • [10] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1, §2.1.1.
  • [11] B. Hassibi and D. G. Stork (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171. Cited by: §2.1.1.
  • [12] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang (2018)

    Soft filter pruning for accelerating deep convolutional neural networks

    arXiv preprint arXiv:1808.06866. Cited by: §2.1.1.
  • [13] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4340–4349. Cited by: §2.1.1.
  • [14] X. Jin, C. Xu, J. Feng, Y. Wei, J. Xiong, and S. Yan (2016) Deep learning with s-shaped rectified linear activation units. In

    Thirtieth AAAI Conference on Artificial Intelligence

    Cited by: §4.1.
  • [15] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.
  • [16] Y. LeCun, J. S. Denker, and S. A. Solla (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §2.1.1.
  • [17] N. Lee, T. Ajanthan, and P. H. Torr (2018) Snip: single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340. Cited by: §2.1.2.
  • [18] Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. E. Hopcroft (2015) Convergent learning: do different neural networks learn the same representations?. In FE@ NIPS, pp. 196–212. Cited by: 1st item.
  • [19] J. Lin, Y. Rao, J. Lu, and J. Zhou (2017) Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191. Cited by: §1.
  • [20] S. Liu, D. C. Mocanu, and M. Pechenizkiy (2019)

    Intrinsically sparse long short-term memory networks

    arXiv preprint arXiv:1901.09208. Cited by: §1, §2.1.2.
  • [21] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2018) Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §1, §2.1.3.
  • [22] C. Louizos, K. Ullrich, and M. Welling (2017) Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pp. 3288–3298. Cited by: §1.
  • [23] C. Louizos, M. Welling, and D. P. Kingma (2017) Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §1, §2.1.3.
  • [24] D. C. Mocanu, E. Mocanu, P. H. Nguyen, M. Gibescu, and A. Liotta (2016)

    A topological insight into restricted boltzmann machines

    Machine Learning 104 (2), pp. 243–270. External Links: Document Cited by: §2.1.2.
  • [25] D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, and A. Liotta (2018) Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications 9 (1), pp. 2383. Cited by: §1, §2.1.2, §2.2, §4.1, §4.2, §4.4.
  • [26] D. C. Mocanu (2017) Network computations in artificial intelligence. Ph.D. Thesis, Technische Universiteit Eindhoven. External Links: ISBN 978-90-386-4305-2 Cited by: §2.1.2.
  • [27] D. Molchanov, A. Ashukha, and D. Vetrov (2017) Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2498–2507. Cited by: §1, §2.1.3.
  • [28] H. Mostafa and X. Wang (2019) Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. arXiv preprint arXiv:1902.05967. Cited by: §1, §2.1.2, §4.4.
  • [29] S. Narang, E. Elsen, G. Diamos, and S. Sengupta (2017) Exploring sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119. Cited by: §2.1.1.
  • [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §5.
  • [31] A. Sanfeliu and K. Fu (1983-05) A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics SMC-13 (3), pp. 353–362. External Links: Document, ISSN Cited by: 1st item, §3.
  • [32] C. Wang, G. Zhang, and R. Grosse (2020) Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376. Cited by: §2.1.2.
  • [33] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.
  • [34] H. Zhou, J. Lan, R. Liu, and J. Yosinski (2019) Deconstructing lottery tickets: zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067. Cited by: §1, §2.1.3.
  • [35] M. Zhu and S. Gupta (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §2.1.1, §2.1.3.