DeepAI

# Deep Rewiring: Training very sparse deep networks

Neuromorphic hardware tends to pose limits on the connectivity of deep networks that one can run on them. But also generic hardware and software implementations of deep learning run more efficiently on sparse networks. Several methods exist for pruning connections of a neural network after it was trained without connectivity constraints. We present an algorithm, DEEP R, that enables us to train directly a sparsely connected neural network. DEEP R automatically rewires the network during supervised training so that connections are there where they are most needed for the task, while its total number is all the time strictly bounded. We demonstrate that DEEP R can be used to train very sparse feedforward and recurrent neural networks on standard benchmark tasks with just a minor loss in performance. DEEP R is based on a rigorous theoretical foundation that views rewiring as stochastic sampling of network configurations from a posterior.

• 9 publications
• 9 publications
• 20 publications
• 14 publications
11/16/2016

### Training Spiking Deep Networks for Neuromorphic Hardware

We describe a method to train spiking deep networks that can be run usin...
01/13/2022

### Automatic Sparse Connectivity Learning for Neural Networks

Since sparse neural networks usually contain many zero weights, these un...
06/13/2018

### The streaming rollout of deep networks - towards fully model-parallel execution

Deep neural networks, and in particular recurrent networks, are promisin...
04/11/2019

### Cramnet: Layer-wise Deep Neural Network Compression with Knowledge Transfer from a Teacher Network

Neural Networks accomplish amazing things, but they suffer from computat...
08/07/2022

### N2NSkip: Learning Highly Sparse Networks using Neuron-to-Neuron Skip Connections

The over-parametrized nature of Deep Neural Networks leads to considerab...
08/26/2017

### TraNNsformer: Neural Network Transformation for Memristive Crossbar based Neuromorphic System Design

Implementation of Neuromorphic Systems using post Complementary Metal-Ox...
08/22/2015

### StochasticNet: Forming Deep Neural Networks via Stochastic Connectivity

Deep neural networks is a branch in machine learning that has seen a met...

## 1 Introduction

Network connectivity is one of the main determinants for whether a neural network can be efficiently implemented in hardware or simulated in software. For example, it is mentioned in Jouppi et al. (2017)

that in Google’s tensor processing units (TPUs), weights do not normally fit in on-chip memory for neural network applications despite the small 8 bit weight precision on TPUs. Memory is also the bottleneck in terms of energy consumption in TPUs and FPGAs

(Han et al., 2017; Iandola et al., 2016)

. For example, for an implementation of a long short term memory network (LSTM), memory reference consumes more than two orders of magnitude more energy than ALU operations

(Han et al., 2017). The situation is even more critical in neuromorphic hardware, where either hard upper bounds on network connectivity are unavoidable (Schemmel et al., 2010; Merolla et al., 2014) or fast on-chip memory of local processing cores is severely limited, for example the MByte local memory of cores in the SpiNNaker system (Furber et al., 2014)

. This implementation bottleneck will become even more severe in future applications of deep learning when the number of neurons in layers will increase, causing a quadratic growth in the number of connections between them.

Evolution has apparently faced a similar problem when evolving large neuronal systems such as the human brain, given that the brain volume is dominated by white matter, i.e., by connections between neurons. The solution found by evolution is convincing. Synaptic connectivity in the brain is highly dynamic in the sense that new synapses are constantly rewired, especially during learning

(Holtmaat et al., 2005; Stettler et al., 2006; Attardo et al., 2015; Chambers & Rumpel, 2017). In other words, rewiring is an integral part of the learning algorithms in the brain, rather than a separate process.

We are not aware of previous methods for simultaneous training and rewiring in artificial neural networks, so that they are able to stay within a strict bound on the total number of connections throughout the learning process. There are however several heuristic methods for pruning a larger network

(Han et al., 2015b, a; Collins & Kohli, 2014; Yang et al., 2015; Srinivas & Babu, 2015), that is, the network is first trained to convergence, and network connections and / or neurons are pruned only subsequently. These methods are useful for downloading a trained network on neuromorphic hardware, but not for on-chip training. A number of methods have been proposed that are capable of reducing connectivity during training (Collins & Kohli, 2014; Jin et al., 2016; Narang et al., 2017). However, these algorithms usually start out with full connectivity. Hence, besides reducing computational demands only partially, they cannot be applied when computational resources (such as memory) is bounded throughout training.

Inspired by experimental findings on rewiring in the brain, we propose in this article deep rewiring (DEEP R), an algorithm that makes it possible to train deep neural networks under strict connectivity constraints. In contrast to many previous pruning approaches that were based on heuristic arguments, DEEP R is embedded in a thorough theoretical framework. DEEP R is conceptually different from standard gradient descent algorithms in two respects. First, each connection has a predefined sign. Specifically, we assign to each connection a connection parameter and a constant sign . For non-negative , the corresponding network weight is given by . In standard backprop, when the absolute value of a weight is moved through , it becomes a weight with the opposite sign. In contrast, in DEEP R a connection vanishes in this case (), and a randomly drawn other connection is tried out by the algorithm. Second, in DEEP R, gradient descent is combined with a random walk in parameter space (de Freitas et al., 2000; Welling & Teh, 2011). This modification leads to important functional differences. In fact, our theoretical analysis shows that DEEP R jointly samples network weights and the network architecture (i.e., network connectivity) from the posterior distribution, that is, the distribution that combines the data likelihood and a specific connectivity prior in a Bayes optimal manner. As a result, the algorithm continues to rewire connections even when the performance has converged. We show that this feature enables DEEP R to adapt the network connectivity structure online when the task demands are drifting.

We show on several benchmark tasks that with DEEP R, the connectivity of several deep architectures — fully connected deep networks, convolutional nets, and recurrent networks (LSTMs) — can be constrained to be extremely sparse throughout training with a marginal drop in performance. In one example, a standard feed forward network trained on the MNIST dataset, we achieved good performance with % of the connectivity of the fully connected counterpart. We show that DEEP R reaches a similar performance level as state-of-the-art pruning algorithms where training starts with the full connectivity matrix. If the target connectivity is very sparse (a few percent of the full connectivity), DEEP R outperformed these pruning algorithms.

## 2 Rewiring in deep neural networks

Stochastic gradient descent (SGD) and its modern variants (Kingma & Ba, 2014; Tieleman & Hinton, 2012)

implemented through the Error Backpropagation algorithm is the dominant learning paradigm of contemporary deep learning applications. For a given list of network inputs

and target network outputs

, gradient descent iteratively moves the parameter vector

in the direction of the negative gradient of an error function such that a local minimum of is eventually reached.

A more general view on neural network training is provided by a probabilistic interpretation of the learning problem (Bishop, 2006; Neal, 1992)

. In this probabilistic learning framework, the deterministic network output is interpreted as defining a probability distribution

over outputs for the given input and the given network parameters . The goal of training is then to find parameters that maximize the likelihood of the training targets under this model (maximum likelihood learning). Training can again be performed by gradient descent on an equivalent error function that is usually given by the negative log-likelihood .

Going one step further in this reasoning, a full Bayesian treatment adds prior beliefs about the network parameters through a prior distribution (we term this distribution the structural prior for reasons that will become clear below) over parameter values and the training goal is formulated via the posterior distribution over parameters . The training goal that we consider in this article is to produce sample parameter vectors which have a high probability under the posterior distribution . More generally, we are interested in a target distribution that is a tempered version of the posterior where is a temperature parameter. For we recover the posterior distribution, for the peaks of the posterior are flattened, and for the distribution is sharpened, leading to higher probabilities for parameter settings with better performance.

This training goal was explored by Welling & Teh (2011), Chen et al. (2016), and Kappel et al. (2015)

where it was shown that gradient descent in combination with stochastic weight updates performs Markov Chain Monte Carlo (MCMC) sampling from the posterior distribution. In this paper we extend these results by (a) allowing the algorithm also to sample the network structure, and (b) including a hard posterior constraint on the total number of connections during the sampling process. We define the training goal as follows:

 \definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0produce samples θ with high probability in p∗(θ)={0 if θ violates the constraint1Zp∗(θ|X,Y∗)1T otherwise, (1)

where is a normalizing constant. The emerging learning dynamics jointly samples from a posterior distribution over network parameters and constrained network architectures. In the next section we introduce the algorithm and in Section 4 we discuss the theoretical guarantees.

### The DEEP R algorithm:

In many situations, network connectivity is strictly limited during training, for instance because of hardware memory limitations. Then the limiting factor for a training algorithm is the maximal connectivity ever needed during training. DEEP R guarantees such a hard limit. DEEP R achieves the learning goal (1) on network configurations, that is, it not only samples the network weights and biases, but also the connectivity under the given constraints. This is achieved by introducing the following mapping from network parameters to network weights :

A connection parameter and a constant sign are assigned to each connection . If is negative, we say that the connection is dormant, and the corresponding weight is . Otherwise, the connection is considered active, and the corresponding weight is . Hence, each encodes (a) whether the connection is active in the network, and (b) the weight of the connection if it is active. Note that we use here a single index for each connection / weight instead of the more usual double index that defines the sending and receiving neuron. This connection-centric indexing is more natural for our rewiring algorithms where the connections are in the focus rather than the neurons. Using this mapping, sampling from the posterior over is equivalent to sampling from the posterior over network configurations, that is, the network connectivity structure and the network weights.

DEEP R is defined in Algorithm 1. Gradient updates are performed only on parameters of active connections (line 3). The derivatives of the error function can be computed in the usual way, most commonly with the backpropagation algorithm. Since we consider only classification problems in this article, we used the cross-entropy error for the experiments in this article. The third term in line 3 () is an regularization term, but other regularizers could be used as well.

A conceptual difference to gradient descent is introduced via the last term in line 3. Here, noise is added to the update, where the temperature parameter controls the amount of noise and is sampled from a zero-mean Gaussian of unit variance independently for each parameter and each update step. The last term alone would implement a random walk in parameter space. Hence, the whole line 3 of the algorithm implements a combination of gradient descent on the regularized error function with a random walk. Our theoretical analysis shows that this random walk behavior has an important functional consequence, see the paragraph after the next for a discussion on the theoretical properties of DEEP R.

The rewiring aspect of the algorithm is captured in lines 4 and 6–9 in Algorithm (1). Whenever a parameter becomes smaller than , the connection is set dormant, i.e., it is deleted from the network and no longer considered for updates (line 4). For each connection that was set to the dormant state, a new connection

is chosen randomly from the uniform distribution over dormant connections,

is activated and its parameter is initialized to . This rewiring strategy (a) ensures that exactly connections are active at any time during training (one initializes the network with active connections), and (b) that dormant connections do not need any computational demands except for drawing connections to be activated. Note that for sparse networks, it is efficient to keep only a list of active connections and none for the dormant connections. Then, one can efficiently draw connections from the whole set of possible connections and reject those that are already active.

## 3 Experiments

### Rewiring in fully connected and in convolutional networks:

We first tested the performance of DEEP R on MNIST and CIFAR-10. For MNIST, we considered a fully connected feed-forward network used in Han et al. (2015b) to benchmark pruning algorithms. It has two hidden layers of and

neurons respectively and a 10-fold softmax output layer. On the CIFAR-10 dataset, we used a convolutional neural network (CNN) with two convolutional followed by two fully connected layers. For reproducibility purposes the network architecture and all parameters of this CNN were taken from the official tutorial of Tensorflow. On CIFAR-10, we used a decreasing learning rate and a cooling schedule to reduce the temperature parameter

over iterations (see Appendix A for details on all experiments).

For each task, we performed four training sessions. First, we trained a network with DEEP R. In the CNN, the first convolutional layer was kept fully connected while we allowed rewiring of the second convolutional layer. Second, we tested another algorithm, soft-DEEP R, which is a simplified version of DEEP R that does however not guarantee a strict connectivity constraint (see Section 4 for a description). Third, we trained a network in the standard manner without any rewiring or pruning to obtain a baseline performance. Finally, we trained a network with a connectivity that was randomly chosen before training and kept fixed during the optimization. The connectivity was however not completely random. Rather each layer received a number of connections that was the same as the number found by soft-DEEP R. The performance of this network is expected to be much better than a network where all layers are treated equally.

Fig. 1 shows the performance of these algorithms on MNIST (panel A) and on CIFAR-10 (panel B). DEEP R reaches a classification accuracy of % when constrained to % connectivity. To evaluate precisely the accuracy that is reachable with

% connectivity, we did an additional experiment where we doubled the number of training epochs. DEEP R reached a classification accuracy of

(less than % drop in comparison to the fully connected baseline). Training on fixed random connectivity performed surprisingly well for connectivities around %, possibly due to the large redundancy in the MNIST images. Soft-DEEP R does not guarantee a strict upper bound on the network connectivity. When considering the maximum connectivity ever seen during training, soft-DEEP R performed consistently worse than DEEP R for networks where this maximum connectivity was low. On CIFAR-10, the classification accuracy of DEEP R was 84.1 % at a connectivity level of %. The performance of DEEP R at 20 % connectivity was close to the performance of the fully connected network.

To study the rewiring properties of DEEP R, we monitored the number of newly activated connections per iteration (i.e., connections that changed their status from dormant to active in that iteration). We found that after an initial transient, the number of newly activated connections converged to a stable value and remained stable even after network performance has converged, see Appendix B.

### Rewiring in recurrent neural networks:

In order to test the generality of our rewiring approach, we also considered the training of recurrent neural networks with backpropagation through time (BPTT). Recurrent networks are quite different from their feed forward counterparts in terms of their dynamics. In particular, they are potentially unstable due to recurrent loops in inference and training signals. As a test bed, we considered an LSTM network trained on the TIMIT data set. In our rewiring algorithms, all connections were potentially available for rewiring, including connections to gating units. From the TIMIT audio data, MFCC coefficients and their temporal derivatives were computed and fed into a bi-directional LSTMs with a single recurrent layer of 200 cells followed by a softmax to generate the phoneme likelihood (Graves & Schmidhuber, 2005), see Appendix A.

We considered as first baseline a fully connected LSTM with standard BPTT without regularization as the training algorithm. This algorithm performed similarly as the one described in Greff et al. (2017). It turned out however that performance could be significantly improved by including a regularizer in the training objective. We therefore considered the same setup with regularization (cross-validated). This setup achieved a phoneme error rate of %. We note that better results have been reported in the literature using the CTC cost function and deeper networks (Graves et al., 2013). For the sake of easy comparison however, we sticked here to the much simpler setup with a medium-sized network and the standard cross-entropy error function.

We found that connectivity can be reduced significantly in this setup with our algorithms, see Fig. 2. Both algorithms, DEEP R and soft-DEEP R, performed even slightly better than the fully connected baseline at connectivities around %, probably due to generalization issues. DEEP R outperformed soft-DEEP R at very low connectivities and it outperformed BPTT with fixed random connectivity consistently at any connectivity level considered.

### Comparison to algorithms that cannot be run on very sparse networks:

We wondered how much performance is lost when a strict connectivity constraint has to be taken into account during training as compared to pruning algorithms that only achieve sparse networks after training. To this end, we compared the performance of DEEP R and soft-DEEP R to recently proposed pruning algorithms: -shrinkage (Tibshirani, 1996; Collins & Kohli, 2014) and the pruning algorithm proposed by Han et al. (2015b). -shrinkage uses simple -norm regularization and finds network solutions with a connectivity that is comparable to the state of the art (Collins & Kohli, 2014; Yu et al., 2012). We chose this one since it is relatively close to DEEP R with the difference that it does not implement rewiring. The pruning algorithm from Han et al. (2015b) is more complex and uses a projection of network weights on a constraint. Both algorithms prune connections starting from the fully connected network. The hyper-parameters such as learning rate, layer size, and weight decay coefficients were kept the same in all experiments. We validated by an extensive parameter search that these settings were good settings for the comparison algorithms, see Appendix A.

Results for the same setups as considered above (MNIST, CIFAR-10, TIMIT) are shown in Fig. 3. Despite the strict connectivity constraints, DEEP R and soft-DEEP R performed slightly better than the unconstrained pruning algorithms on CIFAR-10 and TIMIT at all connectivity levels considered. On MNIST, pruning was slightly better for larger connectivities. On MNIST and TIMIT, pruning and -shrinkage failed completely for very low connectivities while rewiring with DEEP R or soft-DEEP R still produced reasonable networks in this case.

One interesting observation can be made for the error rate evolution of the LSTM on TIMIT (Fig. 3D). Here, both -shrinkage and pruning induced large sudden increases of the error rate, possibly due to instabilities induced by parameter changes in the recurrent network. In contrast, we observed only small glitches of this type in DEEP R. This indicates that sparsification of network connectivity is harder in recurrent networks due to potential instabilities, and that DEEP R is better suited to avoid such instabilities. The reason for this advantage of DEEP R is however not clear.

### Transfer learning is supported by DEEP R:

If the temperature parameter is kept constant during training, the proposed rewiring algorithms do not converge to a static solution but explore continuously the posterior distribution of network configurations. As a consequence, rewiring is expected to adapt to changes in the task in an on line manner. If the task demands change in an online learning setup, one may hope that a transfer of invariant aspects of the tasks occurs such that these aspects can be utilized for faster convergence on later tasks (transfer learning). To verify this hypothesis, we performed one experiment on the MNIST dataset where the class to which each output neuron should respond to was changed after each training epoch (class-shuffled MNIST task). Fig. 4A shows the performance of a network trained with DEEP R in the class-shuffled MNIST task. One can observe that performance recovered after each shuffling of the target classes. More importantly, we found a clear trend of increasing classification accuracy even across shuffles. This indicates a form of transfer learning in the network such that information about the previous tasks (i.e., the previous target-shuffled MNIST instances) was preserved in the network and utilized in the following instances. We hypothesized for the reason of this transfer that early layers developed features that were invariant to the target shuffling and did not need to be re-learned in later task instances. To verify this hypothesis, we computed the following two quantities. First, in order to quantify the speed of parameter dynamics in different layers, we computed the correlation between the layer weight matrices of two subsequent training epoch (Fig. 4B). Second, in order to quantify the speed of change of network dynamics in different layers, we computed the correlation between the neuron outputs of a layer in subsequent epochs (Fig. 4C). We found that the correlation between weights and layer outputs increased across training epochs and were significantly larger in early layers. This supports the hypothesis that early network layers learned features invariant to the shuffled coding convention of the output layer.

## 4 Convergence properties of DEEP R and soft-DEEP R

The theoretical analysis of DEEP R is somewhat involved due to the implemented hard constraints. We therefore first introduce and discuss here another algorithm, soft-DEEP R where the theoretical treatment of convergence is more straight forward. In contrast to standard gradient-based algorithms, this convergence is not a convergence to a particular parameter vector, but a convergence to the target distribution over network configurations.

### Convergence properties of soft-DEEP R:

The soft-DEEP R algorithm is given in Algorithm 2. Note that the updates for active connections are the same as for DEEP R (line 3). Also the mapping from parameters to weights is the same as in DEEP R. The main conceptual difference to DEEP R is that connection parameters continue their random walk when dormant (line 7). Due to this random walk, connections will be re-activated at random times when they cross zero. Therefore, soft-DEEP R does not impose a hard constraint on network connectivity but rather uses the norm regularization to impose a soft-constraint.

Since dormant connections have to be simulated, this algorithm is computationally inefficient for sparse networks. An approximation could be used where silent connections are re-activated at a constant rate, leading to an algorithm very similar to DEEP R. DEEP R adds to that the additional feature of a strict connectivity constraint.

The central result for soft-DEEP R has been proven in the context of spiking neural networks in (Kappel et al., 2015) in order to understand rewiring in the brain from a functional perspective. The same theory however also applies to standard deep neural networks. To be able to apply standard mathematical tools, we consider parameter dynamics in continuous time. In particular, consider the following stochastic differential equation (SDE)

 dθk=β∂∂θklogp∗(θ|X,Y∗)∣∣∣θtdt+√2βTdWk, (2)

where is the equivalent to the learning rate and denotes the gradient of the log parameter posterior evaluated at the parameter vector at time . The term denotes the infinitesimal updates of a standard Wiener process. This SDE describes gradient ascent on the log posterior combined with a random walk in parameter space. We show in Appendix C that the unique stationary distribution of this parameter dynamics is given by

 p∗(θ)=1Zp∗(θ|X,Y∗)1T. (3)

Since we considered classification tasks in this article, we interpret the network output as a multinomial distribution over class labels. Then, the derivative of the log likelihood is equivalent to the derivative of the negative cross-entropy error. Together with an regularization term for the prior, and after discretization of time, we obtain the update of line 3 in Algorithm 2 for non-negative parameters. For negative parameters, the first term in Eq. (2) vanishes since the network weight is constant zero there. This leads to the update in line 7. Note that we introduced a reflecting boundary at in the practical algorithm to avoid divergence of parameters (line 8).

### Convergence properties of DEEP R:

A detailed analysis of the stochastic process that underlies the algorithm is provided in Appendix D. Here we summarize the main findings. Each iteration of DEEP R in Algorithm 1 consists of two parts: In the first part (lines 2-5) all connections that are currently active are advanced, while keeping the other parameters at 0. In the second part (lines 6-9) the connections that became dormant during the first step are randomly replenished.

To describe the connectivity constraint over connections we introduce the binary constraint vector which represents the set of active connections, i.e., element of is if connection is allowed to be active and zero else. In Theorem 2 of Appendix D, we link DEEP R to a compound Markov chain operator that simultaneously updates the parameters according to the soft-DEEP R dynamics under the constraint and the constraint vector itself. The stationary distribution of this Markov chain is given by the joint probability

 p∗(θ,c)∝p∗(θ)C(θ,c)pC(c), (4)

where is a binary function that indicates compatibility of with the constraint and is the tempered posterior of Eq. (3) which is left stationary by soft-DEEP R in the absence of constraints. in Eq. (4) is a uniform prior over all connectivity constraints with exactly synapses that are allowed to be active. By marginalizing over , we obtain that the posterior distribution of DEEP R is identical to that of soft-DEEP R if the constraint on the connectivity is fulfilled. By marginalizing over , we obtain that the probability of sampling a network architecture (i.e. a connectivity constraint ) with DEEP R and soft-DEEP R are proportional to one another. The only difference is that DEEP R exclusively visits architectures with active connections (see equation (39) in Appendix D for details).

In other words, DEEP R solves a constraint optimization problem by sampling parameter vectors with high performance within the space of constrained connectivities. The algorithm will therefore spend most time in network configurations where the connectivity supports the desired network function, such that, connections with large support under the objective function (1) will be maintained active with high probability, while other connections are randomly tested and discarded if found not useful.

## 5 Discussion

Conclusions: We have presented a method for modifying backprop and backprop-through-time so that not only the weights of connections, but also the connectivity graph is simultaneously optimized during training. This can be achieved while staying always within a given bound on the total number of connections. When the absolute value of a weight is moved by backprop through , it becomes a weight with the opposite sign. In contrast, in DEEP R a connection vanishes in this case (more precisely: becomes dormant), and a randomly drawn other connection is tried out by the algorithm. This setup requires that, like in neurobiology, the sign of a weight does not change during learning. Another essential ingredient of DEEP R is that it superimposes the gradient-driven dynamics of each weight with a random walk. This feature can be viewed as another inspiration from neurobiology (Mongillo et al., 2017). An important property of DEEP R is that — in spite of its stochastic ingredient — its overall learning dynamics remains theoretically tractable: Not as gradient descent in the usual sense, but as convergence to a stationary distribution of network configurations which assigns the largest probabilities to the best-performing network configurations. An automatic benefit of this ongoing stochastic parameter dynamics is that the training process immediately adjusts to changes in the task, while simultaneously transferring previously gained competences of the network (see Fig. 4).

### Acknowledgements

Written under partial support by the Human Brain Project of the European Union , and the Austrian Science Fund (FWF): I 3251-N33. We thank Franz Pernkopf and Matthias Zöhrer for useful comments regarding the TIMIT experiment.

## Appendix A Methods

Implementations of DEEP R are freely available at github.com/guillaumeBellec/deep_rewiring.

### Choosing hyper-parameters for DEEP R:

The learning rate is defined for each task independently (see task descriptions below). Considering that the number of active connections is given as a constraint, the remaining hyper parameters are the regularization coefficient and the temperature . We found that the performance of DEEP R does not depend strongly on the temperature . Yet, the choice of has to be done more carefully. For each dataset there was an ideal value of : one order of magnitude higher or lower typically lead to a substantial loss of accuracy.

In MNIST, 96.3% accuracy under the constraint of 1% connectivity was achieved with and chosen so that . In TIMIT, and (higher values of could improve the performance slightly but it did not seem very significant). In CIFAR-10 a different was assigned to each connectivity matrix. To reach 84.1% accuracy with 5% connectivity we used in each layer from input to output . The temperature is initialized with and decays with the learning rate (see paragraph of the methods about CIFAR-10).

### Choosing hyper-parameters for soft-DEEP R:

The main difference between soft-DEEP R and DEEP R is that the connectivity is not given as a global constraint. This is a considerable drawback if one has strict constraint due to hardware limitation but it is also an advantage if one simply wants to generate very sparse network solutions without having a clear idea on the connectivities that are reachable for the task and architecture considered.

In any cases, the performance depends on the choice of hyper-parameters , and , but also - unlike in DEEP R - these hyper parameters have inter-dependent relationships that one cannot ignore (as for DEEP R, the learning rate is defined for each task independently). The reason why soft-DEEP R depends more on the temperature is that the rate of re-activation of connections is driven by the amplitude of the noise whereas they are decoupled in DEEP R. To summarize the results of an exhaustive parameter search, we found that should ideally be slightly below . In general high

leads to high performance but it also defines an approximate lower bound on the smallest reachable connectivity. This lower bound can be estimated by computing analytically the stationary distribution under rough approximations and the assumption that the gradient of the likelihood is zero. If

is the targeted lower connectivity bound, one needs .

For MNIST we used and for all data points in Fig. 1 panel A and a range of values of to scope across different ranges of connectivity lower bounds. In TIMIT and CIFAR-10 we used a simpler strategy which lead to a similar outcome, we fixed the relationships: and we varied only to produce the solutions shown in Fig. 1 panel B and Fig. 2.

### Re-implementing pruning and ℓ1-shrinkage:

To implement -shrinkage (Tibshirani, 1996; Collins & Kohli, 2014), we applied the -shrinkage operator after each gradient descent iteration. The performance of the algorithm is evaluated for different varying on a logarithmic scale to privilege a sparse connectivity or a high accuracy. For instance for MNIST in Figure 3.A we used of the form with going from to . The optimal parameter was .

We implemented the pruning described in Han et al. (2015b). This algorithm uses several phases: training - pruning - training, or one can also add another pruning iteration: training - pruning - training - pruning - training. We went for the latter because it increased performance. Each ”training” phase is a complete training of the neural network with -regularization111To be fair with other algorithms, we did not allocate three times more training time to pruning, each ”training” phase was performed for a third of the total number of epochs which was chosen much larger than necessary.. At each ”pruning” phase, the standard deviation of weights within a weight matrix is computed and all active weights with absolute values smaller than are pruned ( is called the quality parameter). Grid search is used to optimize the -regularization coefficient and quality parameter. The results for MNIST are reported in Figure 5.

### Mnist:

We used a standard feed forward network architecture with two hidden layers with

neurons each and rectified linear activation functions followed by a 10-fold softmax output. For all algorithms we used a learning rate of 0.05 and a batch size of 10 with standard stochastic gradient descent. Learning stopped after 10 epochs. All reported performances in this article are based on the classification error on the MNIST test set.

### Cifar-10:

The official tutorial for convolutional networks of tensorflow222TensorFlow version 1.3: www.tensorflow.org/tutorials/deep_cnn is used as a reference implementation. Its performance out-of-the-box provides the fully connected baseline. We used the values given in the tutorial for the hyper-parameters in all algorithms. In particular the layer-specific weight decay coefficients that interact with our algorithms were chosen from the tutorial for DEEP R, soft-DEEP R, pruning, and -shrinkage.

In the fully connected baseline implementation, standard stochastic gradient descent was used with a decreasing learning rate initialized to and decayed by a factor every epochs. Training was performed for one million iterations for all algorithms. For soft-DEEP R, which includes a temperature parameter, keeping a high temperature as the weight decays was increasing the rate of re-activation of connections. Even if intermediate solutions were rather sparse and efficient the solutions after convergence were always dense. Therefore, the weight decay was accompanied by annealing of the temperature . This was done by setting the temperature to be proportional to the decaying . This annealing was used for DEEP R and soft-DEEP R.

### Timit:

The TIMIT dataset was preprocessed and the LSTM architecture was chosen to reproduce the results from Greff et al. (2017). Input time series were formed by 12 MFCC coefficients and the log energy computed over each time frame. The inputs were then expanded with their first and second temporal derivatives. There are 61 different phonemes annotated in the TIMIT dataset, to report an error rate that is comparable to the literature we performed a standard grouping of the phonemes to generate 39 output classes (Lee & Hon, 1989; Graves et al., 2013; Greff et al., 2017). As usual, the dialect specific sentences were excluded (SA files). The phoneme error rate was computed as the proportion of misclassified frames.

A validation set and early stopping were necessary to train a network with dense connectivity matrix on TIMIT because the performance was sometimes unstable and it suddenly dropped during training as seen in Fig. 3D for -shrinkage. Therefore a validation set was defined by randomly selecting 5% of the training utterances. All algorithms were trained for 40 epochs and the reported test error rate is the one at minimal validation error.

To accelerate the training in comparison the reference from Greff et al. (2017) we used mini-batches of size 32 and the ADAM optimizer (Kingma & Ba (2014)). This was also an opportunity to test the performance of DEEP R and soft-DEEP R with such a variant of gradient descent. The learning rate was set to and we kept the default momentum parameters of ADAM, yet we found that changing the parameter (as defined in Kingma & Ba (2014)) from to improved the stability of fully connected networks during training in this recurrent setup. As we could not find a reference that implemented -shrinkage in combination with ADAM, we simply applied the shrinkage operator after each iteration of ADAM which might not be the ideal choice in theory. It worked well in practice as the minimal error rate was reached with this setup. The same type of regularization in combination with ADAM was used for DEEP R and soft-DEEP R which lead to very sparse and efficient network solutions.

### Initialization of connectivity matrices:

We found that the performance of the networks depended strongly on the initial connectivity. Therefore, we followed the following heuristics to generate initial connectivity for DEEP R, soft-DEEP R and the control setup with fixed connectivity.

First, for the connectivity matrix of each individual layer, the zero entries were chosen with uniform probability. Second, for a given connectivity constraint we found that the learning time increased and the performance dropped if the initial connectivity matrices were not chosen carefully. Typically the performance dropped drastically if the output layer was initialized to be very sparse. Yet in most networks the number of parameters is dominated by large connectivity matrices to hidden layers. A basic rule of thumb that worked in our cases was to give an equal number of active connections to the large and intermediate weight matrices, whereas smaller ones - typically output layers - should be densely connected.

We suggest two approaches to refine this guess: One can either look at the statistics of the connectivity matrices after convergence of DEEP R or soft-DEEP R, or, if possible, the second alternative is to initialize once soft-DEEP R with a dense matrix and observe the connectivity matrix after convergence. In our experiments the connectivities after convergence were coherent with the rule of thumb described above and we did not need to pursue intensive search for ideal initial connectivity matrices.

For MNIST, the number of parameters in each layer was 235k, 30k and 1k from input to output. Using our rule of thumb, for a given global connectivity , the layers were respectively initialized with connectivity , and .

For CIFAR-10, the baseline network had two convolutional layers with filters of shapes and respectively, followed by two fully connected layer with weight matrices of shape and . The last layer was then projected into a softmax over 10 output classes. The numbers of parameters per connectivity matrices were therefore 5k, 102k, 885k, 74k and 2k from input to output. The connectivity matrices were initialized with connectivity and where is approximately the resulting global connectivity.

For TIMIT, the connection matrix from the input to the hidden layer was of size , the recurrent matrix had size and the size of the output matrix was . Each of these three connectivity matrices were respectively initialized with a connectivity of , and where is approximately the resulting global connectivity.

### Initialization of weight matrices:

For CIFAR-10 the initialization of matrix coefficients was given by the reference implementation. For MNIST and TIMIT, the weight matrices were initialized with where is the number of afferent neurons, samples from a centered gaussian with unit variance and is a binary connectivity matrix.

It would not be good to initialize the parameters of all dormant connections to zero in soft-DEEP R. After a single noisy iteration, half of them would become active which would fail to initialize the network with a sparse connectivity matrix. To balance out this problem we initialized the parameters of dormant connections uniformly between the clipping value and zero in soft-DEEP R.

### Parameters for Figure 4

The experiment provided in Figure 4 is a variant of our MNIST experiment where the target labels were shuffled after every training epoch. To make the generalization capability of DEEP R over a small number of epochs visible, we enhanced the noise exploration by setting a batch to 1 so that the connectivity matrices were updated at every time step. Also we used a larger network with 400 neurons in each hidden layer. The remaining parameters were similar to those used previously: the connectivity was constrained to and the connectivity matrices were initialized with respective connectivities: , , and . The parameters of DEEP R were set to , and .

## Appendix B Rewiring during training on MNIST

Fig. 6

shows the rewiring behavior of DEEP R per network layer for the feed-forward neural network trained on MNIST and the training run indicated by the small gray box around the green dot in Fig.

1A. Since it takes some iterations until the weights of connections that do not contribute to a reduction of the error are driven to , the number of newly established connections in layer is small for all layers initially. After this initial transient, the number of newly activated connections stabilized to a value that is proportional to the total number of potential connections in the layer (Fig. 1B). DEEP R continued to rewire connections even late in the training process.

## Appendix C Details to: Convergence properties of soft-DEEP R

Here we provide additional details on the convergence properties of the soft-DEEP R parameter update provided in Algorithm 2. We reiterate here Eq. (2):

 dθk=β∂∂θklogp∗(θ|X,Y∗)∣∣∣θtdt+√2βTdWk. (5)

Discrete time updates can be recovered from the set of SDEs (5) by integration over a short time period

 Δθk=η∂∂θklogp∗(θ|X,Y∗)+√2ηT νk, (6)

where the learning rate is given by .

We prove that the stochastic parameter dynamics Eq. (5) converges to the target distribution given in Eq. (3). The proof is analogous to the derivation given in Kappel et al. (2015, 2017)

. We reiterate the proof here for the special case of supervised learning. The fundamental property of the synaptic sampling dynamics Eq. (

5) is formalized in Theorem 1 and proven below. Before we state the theorem, we briefly discuss its statement in simple terms. Consider some initial parameter setting . Over time, the parameters change according to the dynamics (5). Since the dynamics include a noise term, the exact value of the parameters at some time cannot be determined. However, it is possible to describe the exact distribution of parameters for each time . We denote this distribution by , where the “FP” subscript stands for “Fokker-Planck” since the evolution of this distribution is described by the Fokker-Planck equation (7) given below. Note that we make the dependence of this distribution on time explicit in this notation. It can be shown that for the dynamics (7), converges to a well-defined and unique stationary distribution in the limit of large . To prove the convergence to the stationary distribution we show that it is kept invariant by the set of SDEs Eq. (5) and that it can be reached from any initial condition.

We now state Theorem 1 formally. To simplify notation we drop in the following the explicit time dependence of the parameters .

###### Theorem 1.

Let be a strictly positive, continuous probability distribution over parameters , twice continuously differentiable with respect to , and let . Then the set of stochastic differential equations Eq. (5) leaves the distribution (3) invariant. Furthermore, is the unique stationary distribution of the sampling dynamics.

###### Proof.

The stochastic differential equation Eq. (5) translates into a Fokker-Planck equation (Gardiner, 2004) that describes the evolution of the distribution over parameters

 ∂∂tpFP(θ,t)=∑k−∂∂θk(β∂∂θklogp∗(θ|X,Y∗))p% FP(θ,t)+∂2∂θ2k(βTpFP(θ,t)), (7)

where denotes the distribution over network parameters at time . To show that leaves the distribution invariant, we have to show that (i.e., does not change) if we set to . Plugging in the presumed stationary distribution for on the right hand side of Eq. (7), one obtains

 ∂∂tpFP(θ,t) =∑k−∂∂θk(β∂∂θklogp∗(θ|X,Y∗)p∗(θ))+∂2∂θ2k(βTp∗(θ)) =∑k−∂∂θk(βp∗(θ)∂∂θklogp∗(θ|X,Y∗))+∂∂θk(βT∂∂θkp∗(θ)) =∑k−∂∂θk(βp∗(θ)∂∂θklogp∗(θ|X,Y∗))+∂∂θk(βTp∗(θ)∂∂θklogp∗(θ)),

which by inserting , with normalizing constant , becomes

 ∂∂tpFP(θ,t) =1Z∑k−∂∂θk(βp∗(θ)∂∂θklogp∗(θ|X,Y∗)) +∂∂θk(βTp∗(θ)1T∂∂θklogp∗(θ|X,Y∗))=∑k0=0.

This proves that is a stationary distribution of the parameter sampling dynamics Eq. (5). Since is positive by construction, the Markov process of the SDEs (5) is ergodic and the stationary distribution is unique (see Section 5.3.3. and 3.7.2 in Gardiner (2004)).

The unique stationary distribution of Eq. (7) is given by , i.e., is the only solution for which becomes , which completes the proof. ∎

The updates of the soft-DEEP R algorithm (Algorithm 2) can be written as

 Δθk={√2Tη νkif θk<0 (dormant connection)−η∂∂θkEX,Y∗(θ)−ηα+√2Tη νkotherwise. (8)

Eq. (8) is a special case of the general discrete parameter dynamics (6). To see this we apply Bayes’ rule to expand the derivative of the log posterior into the sum of the derivatives of the prior and the likelihood:

 ∂∂θklogp∗(θ|X,Y∗)=∂∂θklogpS(θ)+∂∂θklogpN(Y∗|X,θ),

such that we can rewrite Eq. (6)

 (9)

To include automatic network rewiring in our deep learning model we adopt the approach described in Kappel et al. (2015). Instead of using the network parameters directly to determine the synaptic weights of network , we apply a nonlinear transformation to each connection , given by the function

 wk=f(θk)=sk1γlog(1+exp(γskθk)), (10)

where is a parameter that determines the sign of the connection weight and is a constant parameter that determines the smoothness of the mapping. In the limit of large Eq. (10) converges to the rectified linear function

 wk=f(θk)={0if θk<0(dormant connection)skθkelse(active connection), (11)

such that all connections with are not functional.

Using this, the gradient of the log-likelihood function in Eq. (9) can be written as which for our choice of , Eqs. (10), becomes

 ∂∂θklogpN(Y∗|X,θ)=−σ(γskθk)sk∂∂θkEX,Y∗(θ), (12)

where

denotes the sigmoid function. The error gradient

can be computed using standard Error Backpropagation

Neal (1992); Rumelhart et al. (1985).

Theorem 1 requires that Eq. (12) is twice differentiable, which is true for any finite value for . In our simulations we used the limiting case of large such that dormant connections are actually mapped to zero weight. In this limit, one approaches the simple expression

 (13)

Thus, the gradient (13) vanishes for dormant connections (). Therefore changes of dormant connections are independent of the error gradient.

This leads to the parameter updates of the soft-DEEP R algorithm given by Eq. (8). The term results from the diffusion term integrated over , where

is a Gaussian random variable with zero mean and unit variance. The term

results from the exponential prior distribution (the -regularization). Note that this prior is not differentiable at 0. In (8) we approximate the gradient by assuming it to be zero at and below. Thus, parameters on the negative axis are only driven by a random walk and parameter values might therefore diverge to . To fix this problem we introduced a reflecting boundary at (parameters were clipped at this value). Another potential solution would be to use a different prior distribution that also effects the negative axis, however we found that Eq. (8) produces very good results in practice.

## Appendix D Analysis of convergence of the DEEP R algorithm

Here we provide additional details to the convergence properties of the DEEP R algorithm. To do so we formulate the algorithm in terms of a Markov chain that evolves the parameters and the connectivity constraints (listed in Algorithm 3). Each application of the Markov transition operators corresponds to one iteration of the DEEP R algorithm. We show that the distribution of parameters and network connectivities over the iterations of DEEP R converges to the stationary distribution Eq. (4) that jointly realizes parameter vectors and admissible connectivity constraints.

Each iteration of DEEP R corresponds to two update steps, which we formally describe in Algorithm 3 using the Markov transition operators and and the binary constraint vector over all connections of the network with elements , where represents an active connection . is a constraint on the dynamics, i.e., all connections for which

have to be dormant in the evolution of the parameters. The transition operators are conditional probability distributions from which in each iteration new samples for

and are drawn for given previous values and .

1. Parameter update: The transition operator updates all parameters for which (active connections) and leaves the parameters at their current value for (dormant connections). The update of active connections is realized by advancing the SDE (2) for an arbitrary time step (line 3 of Algorithm 3).

2. Connectivity update: for all parameters that are dormant, set and randomly select an element which is currently 0 and set it to 1. This corresponds to line 3 of Algorithm 3 and is realized by drawing a new from .

The constraint imposed by on is formalized through the deterministic binary function which is if the parameters are compatible with the constraint vector and otherwise. This is expressed as (with denoting the Boolean implication):

 C(θ,c)={1if for % all k,1≤k≤K:ck=0⇒θk<00else. (14)

The constraint is fulfilled if all connections with are dormant ().

Note that the transition operator depends only on the parameter vector . It samples a new with uniform probability among the constraint vectors that are compatible with the current set of parameters . We write the number of possible vectors that are compatible with as , given by the binomial coefficient (the number of possible selections that fulfill the constraint of new active connections)

 μ(θ)=∑c∈χC(θ,c)=(M−|θ≥0|K−|θ≥0|),withχ={ξ∈{0,1}M∣∣|ξ|=K}, (15)

where denotes the number of non-zero elements in and is the set of all binary vectors with exactly elements of value . Using this we can define the operator as:

 (16)

where denotes the vectorized Kronecker delta function, with and 0 else. Note that Eq. (16) assigns non-zero probability only to vectors that are zero for elements for which is true (assured by the second term). In addition vectors have to fulfill . Therefore, sampling from this operator introduces randomly new connection for the number of missing ones in . This process models the connectivity update of Algorithm 3.

The transition operator in Eq. (34) evolves the parameter vector under the constraint , i.e., it produces parameters confined to the connectivity constraint. By construction this operator has a stationary distribution that is given by the following Lemma.

###### Lemma 1.

Let be the transition operator of the Markov chain over which is defined, as the integration of the SDE written in Eq. (2) over an interval for active connections (), and as the identity for the remaining dormant connections (). Then it leaves the following distribution invariant

 p∗(θ|c)=1p∗(θ∉c<0)p∗(θ)C(θ,c), (17)

where denotes the truncation of the vector to the active connections (), thus is the probability that all connections outside of are dormant according to the posterior, and is the posterior (see Theorem 1).

The proof is divided into two sub proofs. First we show that the distribution defined as with a normalization constant, is left invariant by , second we will show that this normalization constant has to be equal to . In coherence with the notation we will use verbally that is an element of if .

###### Proof.

To show that the distribution defined as is left invariant, we will show directly that . To do so we will show that both and factorizes in terms that depend only on or on and thus we will be able to separate the integral over as the product of two simpler integrals.

We first study the distribution . Before factorizing, one has to notice a strong property of this distribution. Let’s partition the tempered posterior distribution over the cases when the constraint is satisfied or not

 p∗(θ′|c) = 1L(c)p∗(θ′)C(c,θ) (18) = (19)

when we multiply individually the first and the second term with , can be replaced by its binary value and the second term is always null. It remains that

 p∗(θ′|c) = 1L(c)p∗(θ′,C(c,θ)=1) (20)

seeing that one can rewrite the condition as the condition on the sign of the random variable (note that in this inequality is a deterministic constant and is a random variable)

 p∗(θ′|c) = 1L(c)p∗(θ′,θ′∉c<0) (21)

We can factorize the conditioned posterior as . But when the dormant parameters are negative , the active parameters do not depend on the actual value of the dormant parameters , so we can simplify the conditions of the first factor further to obtain

 p∗(θ′|c) = 1L(c)p∗(θ′∈c|θ′∉c<0)p∗(θ′∉c,θ′∉c<0) (22)

We now study the operator . It factorizes similarly because it is built out of two independent operations: one that integrates the SDE over the active connections and one that applies identity to the dormant ones. Moreover all the terms in the SDE which evolve the active parameters are independent of the dormant ones as long as we know they are dormant. Thus, the operator splits in two

 Tθ(θ|θ′,c) = (23)

To finally separate the integration over as a product of two integrals we need to make sure that all the factor depend only on the variable or only on . This might not seem obvious but even the conditioned probability is a function of because in the conditioning , refers to the random variable and not to a specific value over which we integrate. As a result the double integral is equal to the product of the two integrals

 ∫θ′Tθ(θ|θ′,c)p∗(θ′|c)dθ′ = 1L(c)∫θ′∈cTθ(θ∈c|θ′∈c,c)p∗(θ′∈c|θ′∉c<0)dθ′∈c (25) ∫θ′∉cTθ(θ∉c|θ′∉c,c)p∗(θ′∉c,θ′∉c<0)dθ′∉c

We can now study the two integrals separately. The second integral over the parameters is simpler because by construction the operator is the identity

 ∫θ′∉cTθ(θ∉c|θ′∉c,c)p∗(θ′∉c,θ′∉c<0)dθ′∉c = p∗(θ∉c,θ∉c<0) (26)

There is more to say about the first integral over the active connections . The operator integrates over the active parameters the same SDE as before with the difference that the network is reduced to a sparse architecture where only the parameters are active. We want to find the relationship between the stationary distribution of this new operator and that is written in the integral which is defined in equation (3) as the tempered posterior of the dense network. In fact, the tempered posterior of the dense network marginalized and conditioned over the dormant connections is equal to the stationary distribution of (i.e. of the SDE in the sparse network). To prove this, we detail in the following paragraph that the drift in the SDE evolving the sparse network is given by the log-posterior of the dense network condition on and using Theorem 1, we will conclude that is the stationary distribution of .

We write the prior and the likelihood of the sparse network as function of the prior and the likelihood with of the dense network. The likelihood in the sparse network is defined as previously with the exception that the dormant connections are given zero-weight so it is equal to . The difference between the prior that defines soft-DEEP R and the prior of DEEP R remains in the presence of the constraint. When considering the sparse network defined by the constraint is satisfied and the prior of soft-DEEP R marginalized over the dormant connections is the prior of the sparse network with defined as before. As this prior is connection-specific ( independent of ), this implies that is independent of the dormant connection, and the prior is equal to . Thus, we can write the posterior of the sparse network which is by definition proportional to the product . Looking back to the definition of the posterior of the dense network this product is actually proportional to posterior of the dense network conditioned on the negativity of dormant connections . The posterior of the sparse network is therefore proportional to the conditioned posterior of the dense network but as they both normalize to they are actually equal. Writing down the new SDE, the diffusion term remains unchanged, and the drift term is given by the gradient of the log-posterior . Applying Theorem 1 to this new SDE, we now confirm that the tempered and conditioned posterior of the dense network is left invariant by the SDE evolving the sparse network. As is the integration for a given of this SDE, it also leaves invariant. This yields

 ∫θ′∈cTθ<