I Introduction
Training deep neural networks requires both powerful hardware and a significant amount of time. Long training times are a significant bottleneck to deep learning research, as researchers typically iteratively design and test new architectures for a specific problem. While a lot of research has been dedicated to accelerating inference, we investigate training as (1) accelerating training can speed up research iteration, (2) evolutionary algorithms for DNN architecture exploration are increasingly being used as an alternative to domain expertise
[DBLP:journals/corr/abs171109846], and network training is moving to edge devices [DBLP:journals/corr/abs190604312]. Unfortunatelly, the memory requirements of dense, convolutional and recurrent layers grow quadratically with layer information bandwidth ^{1}^{1}1We define a layer’s information bandwidth as the number of independent signals passing through that layer.. In other words, doubling the size of layer inputs and outputs quadruples the size of the layer. This causes majority of the networks to be memorybound, making DNN training impractical without batching, a method where training is performed on multiple inputs at a time and updates are aggregated per batch. While batching alleviates the pressure on DRAM bandwidth, it can decrease model accuracy [small_batch_size] especially when scaling training on large clusters [imagenet_15_minutes]. Furthermore, larger models in offchip memory become dominant energy cost [Han2015DeepCC], complicating online training on batterypower devices.Conventional dense and convolutional layers do not offer the user to individually tune layer size and the number of layer inputs and outputs. In this work, we seek a method to decouple the information bandwidth from layer expressivity. Such a method would allow us to (1) speed up training networks by storing them in onchip memory, (2) remove the memory bottleneck and the need for batching, (3) allow more efficient training on distributed systems, and (4) reduce the energy consumption due to the excessive compute and storage requirements of modern DNNs, potentially allowing us to move training to edge devices. Several works have proposed a priori structured sparsity [DBLP:journals/corr/abs171108757, closnets] or weight sharing [circnn] to allow training simpler but ‘wider’ models. A priori sparsity, where the sparse network topology is selected before training has started, is a promising approach that allows the user to finely and transparently tune the ratio of information bandwidth to memory requirements. If the topology is structured, efficient software or hardware implementations can be built to accelerate processing with dense network performance . However, before custom architectures or lowlevel kernels can be built, a general theory of why certain topologies perform – or underperform – is needed. To the best of our knowledge, no work yet tackles the question of the existence of a ‘best’ topology for sparse neural network training. This paper provides an answer on how a topology should be selected.
Our contributions are as following:

We propose a sparse cascade architecture that can replace dense or convolutional layers without affecting the rest of the network architecture.

We develop a sparse neural network initialization scheme that allows us to train very deep sparse networks without suffering from the vanishing gradient effect.

We evaluate sevaral topologies on a matrix reconstruction task and show that the choice of topology has a strong effect on attainable network accuracy.

In order to evaluate topologies independently of a dataset, we develop a datafree heuristic for predicting the expressiveness of a given sparse network.

From the heuristic, we derive requirements that make a good topology, and settle on a single family of sparse networks.
Ii Related Work
We classify methods that arrive at a sparse network into those that enforce sparsity before, during, or after training. The following is a brief description of each class.
Enforcing sparsity after training: In this class of methods, certain weights are zeroed out after training has finished. This approach has the benefit of first discovering the baseline model accuracy, allowing the training mechanism to evaluate the accuracy to size tradeoff. Since training is performed using the dense model, only inference can benefit from these posttraining pruning methods. One of the early pruning methods are Optimal Brain Damage [NIPS1989_250] and Optimal Brain Surgeon [NIPS1992_647], where authors remove weights based on the second derivative of the loss w.r.t. to each weight. The insight here is that removing a weight causes some perturbation in the network, and by picking weights with the smallest second derivative of the loss, the effect of the perturbation on the network functionality is be minimized. DeepCompression [Han2015DeepCC, DBLP:journals/corr/HanPTD15] uses a similar approach, but replaces the Hessianbased metric with weight magnitudes. Authors show that high (>95%) sparsity can be achieved as long as networks are finetuned after pruning to restore performance. Alternatively, in [liu2015sparse] authors decompose convolutional layers into a set of perchannel basis kernels, which are applied to input feature maps, and a sparse kernel matrix that mixes outputs of the basis kernels into the output feature maps. However, all of these methods lead to unstructured sparsity that is difficult to take advantage of. Structured sparsity, where some assumptions can be made on the structure of sparse kernels, has been explored as a way to improve the execution efficiency of sparse structures on GPUs and CPUs. In [DBLP:journals/corr/AnwarHS15], authors use particle filters to prune whole channels or kernels. Similarly, [Kadetotad] explore structured intralayer sparsity, where instead of individual weights, small blocks of weights are pruned.
Enforcing sparsity during training: Instead of pruning after training, pruning can also be applied during training. This has the benefit of potentially reducing the computational load of the training phase, however, the device performing training must still be able to store the whole dense model at the beginning of training. L1 regularization or L1 weight decay is known to cause sparsity during training, as unlike in the case of L2 regularization, all weights will equally be incentivized to approach zero. However, L1 weight decay often causes a decrease in accuracy, and the sparsity is unstructured. In [DBLP:journals/corr/WenWWCL16], authors use Group Lasso [group_lasso] regularization to enforce the sparsity of more coarsegrained structures instead of individual weights.
Enforcing sparsity before training: Model size can also be reduced before training has started. We focus on layerlevel methods and not architecture level approaches, as they are orthogonal. Majority of works reducing the size of layers before training have focused either on a priori sparsity or weight reuse. On the weight reuse side, HashedNets [hashnets] use a hash function to group multiple weights and have them share and train a single value. CirCNN [circnn]
uses blockcirculant matrices for storing weights, where elements are shared in a predictable manner and Fourier transforms are used for inference, reducing the computational complexity of both inference and training.
On the a priori sparsity side, several topologies have been proposed in literature. Deep Expander Networks (XNets) [DBLP:journals/corr/abs171108757]
replace dense layers with sparse layers with the expander graph topology. Authors give guarantees of each input neuron being connected to each output neuron within a logarithmic number of layers. Similarly, RadiXNets
[radixnets] build on Xnets but use the radix topology instead of graph expanders. Alternatively, ClosNets [closnets] replace a single dense layer with a cascade of three sparse layers with the Clos topology. Clos topology guarantees full connectivity, and has a tunable parameter for the path diversity between all inputs and outputs. While deep expander networks grow in depth with the number of neurons per layer, ClosNets grow in width.None of the above a priori sparse network works give a definitive answer to which topology maximizes performance per weight. In this work we aim to answer that question.
Iii Approach
The number of parameters in a neural network layer is decided by the number of input and output neurons (in case of fullyconnected networks), or the number of input and output channels (in the case of convolutional networks). This prevents decoupling the network bandwidth (i.e. number of inputs or outputs of a certain layer) and the parameter count. We propose that sparsifying layers can allow users to train wider networks without the quadratic growth in network size. Our approach is to replace each fullyconnected layer with a cascade of sparselyconnected layers, as shown in Figure 1
. The cascade topology and depth are selected at design time so that the number of parameters in the cascade is lower than in the original dense layer. Hidden layer neurons in the cascade have a linear activation function, while the output neurons use the activation of the original network. The cascade needs to have certain properties such as connectivity between all cascade inputoutput pairs, a small parameter count, and hardware efficiency. In Section
VII we explore the topology requirements further.Overall, a network layer is decomposed into a cascade of layers with sparsity
. For a dense layer, the forward pass complexity of a single input vector is
. For the sparse cascade, the same complexity is . As long as , the sparse cascade has less parameters than the original dense layer, and the input can be processed more efficiently.Similarly, a priori pruning can be applied to convolutional networks. In conventional CNN layers, each input channel is connected to each output channel by a convolution. For a filter of size , input and output channels of size , the convolutional layer has parameters, and uses operations per input. Similarly, a priori pruning can be applied to convolutional networks. In conventional CNN layers, each input channel is connected to each output channel by a convolution. For a filter of size , input and output channels, the convolutional layer has parameters. Since the number of input and output channels and directly control both the information bandwidth and the size of the network, we propose to disentangle the number of input/output features and the number of convolutional filters. We adopt the architecture of MobileNets [Howard2017] and break up convolutional layers into depthwise and pointwise convolutions, as seen in Figure 1(a). In the original MobileNets, the majority of parameters belong to the pointwise convolutions, which are simply dense neural networks applied to each ‘pixel’ individually. We propose to prune only the pointwise convolutions in the same manner we prune dense layers.
The depthwise convolution applies individual spatial convolutions (4 in the Figure 1(a)) to each of the input channels. The intermediate layer now has feature maps. The pointwise convolution now applies layers of sparse convolutions that perform a superposition of feature maps, and results in output maps. The first stage requires parameters (notice that does not factor here), and operations. The second stage applies layers of sparse convolutions, which can be interpreted as scaling and adding up specific sets of channels to form the final output channels. The second stage uses parameters and operations, where is the sparsity of convolutions. Overall, the original convolutional layer requires parameters, while the proposed a priori sparse convolutional layer requires parameters.
Iv Initializing A Priori Sparse Neural Networks
Initializing deep neural networks highly affects the training accuracy of the models, but is often not given as much attention as network architecture. If a network is suboptimally initialized, it may train slower than a correctly initialized one or may even not train at all. This problem has been extensively researched in the context of deep neural networks [pmlrv9glorot10a, DBLP:journals/corr/HeZR015]
[orthogonal_rnns]. Our proposed approach replaces fullyconnected layers with cascades of sparse layers. As such, the new network may be several times deeper than the original one, and may not train as efficiently. This problem is further compounded by the fact that our networks are a priori sparse. Our tests show that deep sparse networks initialized with common initialization schemes like Xavier initalization [pmlrv9glorot10a]completely fail to learn. By observing the activation and error values of deep sparse networks, we deduce that these networks suffer from the vanishing gradient problem, i.e., with each successive layer, the variance of both the activations and the errors drops exponentially. In this section, we develop a new initialization scheme for a priori sparse networks that alleviates the vanishing gradient problem by taking layer sparsity into account.
Iva Sparse Xavier initialization
In Appendix A, we briefly cover the original derivation of the Xavier initialization. Here we generalize it to apply to sparse networks as well. We construct a sparse layer from a matrix by multiplying it elementwise with the mask with sparsity . For a random topology, each element of the mask is set as , where
is the Bernoulli distribution. For a layer
with input and output neurons, an output neuron’s activation variance depends on the variance of each input neuron, each weight connected to it, and the number of input neurons (Appendix Equation 21). For sparse networks, each output neuron is on average only connected to neurons, hence we update Equation 21 as:(1) 
Updating the Xavier initialization to take sparsity into account, we write our sparse initialization as:
(2) 
We test the new initialization on the MNIST dataset with networks of different sparsities and depths (Figure 1(b)). We train randomlyconnected networks with 256 neurons in the hidden layers, 1 to 20 hidden layers, and with sparsities between 0 and . Using the sparse Xavier initialization, we are able to train deep sparse networks. For very sparse networks (sparsity of 63/64 and higher), often there exists no path between certain inputs and outputs, limiting trainability. A better, nonrandom topology with the same amount of parameters may however be able train.
V Topology Exploration
The choice of topology has a strong impact on both the accuracy and the parallelizability of a sparse network. In this section we aim to (1) answer how two topologies can be compared, and (2) create a metric for evaluating a topology independently of a task, given that we can assume nothing about training data beforehand. We first devise a task that allows us to experimentally evaluate a topology. We choose a matrix reconstruction problem where an original matrix is reconstructed as a product of sparse matrices with adjacency matrices as:
(3) 
The matrix
must be random, so that the network cannot abuse any regularities within it. The topology we seek should perform well independently of the matrix structure. Though topologies derived for a specific task may perform better, on average across all tasks, the general topology should achieve the best results. A number of loss functions can be used for this evaluation but we restrict ourselves to L2 loss for now. To gain intuition into the problem, we use the reconstruction loss in Equation
3 on a number of common topologies.Figure 3 illustrates the impact of topology choice on overall network accuracy. Our reasoning is that as topologies can underperform on an unseen task, there exists a ‘best’ topology, one that on average achieves optimal performance. In this and the following section we aim find that topology.
Though we may arrive at an optimal topology purely through evolution, this approach has several issues: (1) evaluating a topology on random matrices is inherently noisy, and we would have to reconstruct many random matrices in order to compare the fitness of two different topologies, (2) the evolved topology is only a point solution, and we would have to rerun the process for a topology with a different number of inputs, outputs, parameters, or layers, and (3) the evolved topologies tell us nothing of the underlying reasons for why a certain topology is underperforming. Therefore, we aim to develop a heuristic that can accurately predict the quality of a topology, so that by analyzing the heuristic we can arrive at the root cause of why certain topologies are underperforming, and produce conclusions on how to construct better performing ones.
Va L0 constraint satisfaction
We revisit Equation 3 and define the sparse decomposition as:
(4) 
From here we can write an individual element of matrix from Equation 4 by looking at paths between input and output as:
(5) 
where is the set of all paths from input neuron to output neuron . According to Equation 5, in order to satisfy , at least one edge in must be set to a specific value. Given an edge with weight that exists on some path between nodes and :
(6) 
Due to the difficulty of analyzing the quality of topologies using an L2 loss, one approach is to use L0 loss. Here, we task the matrix decomposition with perfectly reconstructing as many individual elements of as possible. The networks are still trained using SGD with L1 loss. After training converges, we can count the number of elements in where for some arbitrarily small . As cannot take multiple values, it can only satisfy a single constraint. Therefore, if is the number of edges in the network, and is the number of satisfiable constraints, we can say that . At best, the topology can solve as many constraints as it has edges. As we will show in Section VII, this is not possible to achieve without several modifications to the sparse network.
Equation 6 allows us to reframe the problem of finding the minimal L0 reconstruction loss as a bipartite maximal matching problem. First, we create a set of inputoutput constraints , and a set of weights . An edge is connected to all constraints where . The problem of finding the largest number of elements in that can be perfectly reconstructed by becomes a bipartite maximal matching problem, and can be solved in linear time. However, we find that the maximal number of constraints matched with edges is not a good heuristic for the performance of a topology. Constraint satisfaction counting heuristics fail to account for topology fairness, and equally rate topologies that balance the satisfaction of all inputoutput pairs, and topologies that greedily satisfy only a subset of inputoutputpairs.
Vi Graph Controllability
In this section, we present a continious method of evaluating the capability of a topology to reconstruct a certain graph. While averaging reconstruction results of many random matrices may provide an estimate of a topology quality, we develop a datafree approach that (1) evaluates a topology in a single pass, and (2) is not sensitive to the randomness of generated matrices.
Via Neurons need only learn input ratios:
In this work we focus on networks that use ReLU activations between sparse cascades, and linear activations inside them. Both ReLU and linear functions are homogenous, i.e.
for constants . Take a neuron with activation and inputs with activations and weights . Activation can be written as:(7) 
where is a homogenous function. We can extract the magnitude and normalize the vector :
(8) 
Since the neurons are using homogeneous activation functions, we can shift the job of learning neuron magnitude to the next layer. Now the neuron is only tasked with learning the input ratios.
In the previous section we have used L0 loss to measure the number of constraints a network can solve. Here we see that for constraints that exist when reconstructing a matrix , of those constraints are magnitude constraints, and are ratio constraints. In other words, a neuron with inputs has ratio and 1 magnitude constraint. In Appendix B we give a practical way of eliminating magnitude constraints and only measuring the number of ratio constraints.
ViB Neuron control
We define controllability of a certain neuron w.r.t. to an inputs and as the ability of an optimizer to set the ratio of to at . We give an inductive definition on three graph primitives: (1) individual input neurons, (2) a neuron with two inputs and one output, and (3) a neuron with one input and two outputs, and show how controllability propagates through them. We show how any sparse neural network can be decomposed into these primitives and how control can be calculated on neurons with larger numbers of inputs and outputs.
Definition VI.1.
For a neuron connected to a set of inputs , we define the controllability of input relative to at as . If , the optimizer can set the ratio at neuron to any value, without impacting any other ratios already set in the network.
Lemma VI.1.
For inputs and any neuron , controllability is bounded as:
(9) 
This is understandable since the optimizer can only set the ratio of inputs and at neuron to a single value, hence controllability of to plus controllability of to cannot be greater than 1.
Lemma VI.2.
For input neurons , controllability of ratio at neuron is:
(10) 
This is obvious since the optimizer has no control over network inputs, and we will use this lemma as the base case of our induction.
Lemma VI.3.
For an neuron connected to a set of inputs , total controllability of is limited as:
(11) 
This lemma is a direct result of Section VIA limit on the number of ratios.
We now analyze two graph primitives that show how control propagates through the graph.
Theorem VI.4.
Control aggregation: For a neuron with two input neurons and connected with weights and , where and are the sets of inputs directly or indirectly connected to neurons and , the controllability is:
(12) 
where
(13) 
Intuitively, neuron inherits the controllabilities of inputs and , and if at least one weight is tunable, can additionally control the ratio between the two input neurons. This allows it to set an additional ratio between any of the inputs in . If the loss function is quadratic, instead of using this additional ratio to solve a single constraint, the optimizer may want to partially satisfy multiple constraints. Hence, we allow added controllability to have a value in . Notice that : (1) abides by Lemma VI.1, and (2) the optimizer can tune all individual values in . In corollary C.0.1 we extend this Lemma for the case where has multiple inputs.
Theorem VI.5.
Control fannout: For a neuron and two output neurons and connected with constant connections such that , and controllabilities and abide by:
(14) 
In other words, neuron ’s control is split across outputs. The optimizer chooses how best to split this control, i.e., it does not have to be fair. In Appendix D, we show how graphs can be decomposed so that we can apply Theorems VI.4 and VI.5.
Corollary VI.5.1.
For a neuron with a set of output neurons connected with constant connections, output neuron controllabilities abide by:
(15) 
We can reframe Equation 15 using trainable ratios:
(16)  
(17) 
ViC Training controllability
Finally, we decompose the sparse cascade’s topology so that we can apply Theorems VI.4 and VI.5, in order to write out the equations for the controllability of each output neuron with respect to each input neuron.
Take a cascade with layers. Each layer has sparsely connected neurons, with being the number of cascade inputs, and being the number of cascade outputs. We define layer
controllability tensor
as a tensor where the element represents neuron ’s control over ratio . The added controllability tensor represents the added controllability added by the tunable connections between layer and , as per Theorem VI.4. The element represents the additional control over provided by the edge between and . Each tensor abides by Equation 28. The ratio tensor represents the control split from Equation 16 with representing the portion of controllability passed on to .We can propagate controllability through the network as:
(18) 
where the operation is defined as:
(19) 
The controllability tensor , added controllability tensor and ratio tensor still have to abide by constraints in Lemmas VI.1, VI.2, and Equations 13, 16.
If is the number of cascade inputs, we define the controllability loss as:
(20) 
i.e., if a certain neuron output has control over ratios, that neuron’s loss is 0. We can now minimize this loss to discover how much control each cascade output has over the cascade inputs.
Vii Deriving Better Topologies
In this section, we analyze the results of the controllability heuristic and answer (1) how topologies can be improved on the matrix reconstruction task, and (2) if there exists a definitive answer to what topology performs the best.
Viia The need for skip connections
Similar to skip connections used in ResNet networks [DBLP:journals/corr/HeZRS15], we propose that skip connections in can significantly improve the performance of sparse cascades, though for a different reason. As mentioned in Section VIA, if a network uses homogenous activation functions, each neuron only needs to learn the ratio of inputs at that neuron, as the job of learning the magnitude can be shifted to the layer above. Since the number of learnable ratios at a neuron is one less than the number of inputs of the same neuron, that neuron does not need to have all of it’s connections trainable. One of the connections can have a constant value of 1, and the network performance will not be impacted. This effect is particularly noticable in butterfly networks, where each neuron only has 2 inputs and 2 outputs, hence 50% of connections are wasted. We replace one connection per neuron with a skip connection valued at 1, and show their performance in Figure 4. See that skip connections significantly improve the performance of butterfly and hypercube networks.
ViiB The need for inputoutput pair equality
While skip connections help topologies achieve similar performance (see hypercube and butterfly in Figure 4), some topologies still outperform / underperform. We turn to our controllability heuristic for an explanation of this behavior. We train our controllability network from Section VIC with topologies from Figure 4. The trained network produces , the controllability tensor of the last layer of the network. This is a 3D tensor where specifies the optimizer’s control of input ratio at output neuron . Since we optimize for the number of ratios set, and not the specific configuration, we sum in the second dimension with . We plot the resulting matrices in Figure 5 (ad).
The total controllability a network achieves is equal to the sum of , and can be interpreted as the total ‘brightness’ of Figures 5 (ad). With skip connections and a topology that does not oversaturate certain inputoutput pairs, we can guarantee that the number of controllable ratios is equal to the number of trainable edges in the graph. Notice that the hypercube and torus figures have significant variance, while Clos and butterfly networks are smooth. This means Clos and butterfly networks do not prioritize any specific inputoutput pair. Since torii and hypercubes are examples of smallworld networks [small_world_networks]
, these networks find it easier to satisfy closer inputoutput pairs. Even though these networks are trained with L2 loss, where outlier pairs are incentivized to approach the mean, torus and hypercube networks show signifcant vraiance. This is important since when given a matrix
which is to be reconstructed, permuting columns or rows in a targeted way may improve reconstruction accuracy. Ideally, the position of an input within the topology should not impact the performance of the network. In Figure 5, we give four examples: hypercubes whose controllability matrix has high mean and high variance, Clos with high mean and low variance, butterfly with low mean and low variance, and torus with low mean and high variance.ViiC The need for shallowness
[saxe] explore the dynamics of training deep linear networks, and show that deep linear networks have at most a constant time slower convergence compared to shallow linear networks. This effect (1) may still be detrimental to training, and (2) to the best of our knowledge, has not been studied for the case of sparse deep linear networks. Hence, when choosing between two otherwise equivalent topologies (i.e., topologies with the same controllability mean and variance), we should choose the shallower one. Furthermore, observing Figure 3, we see that after a certain depth, butterfly, torus, and hypercube networks lose performance with depth, despite gaining parameters. This is likely due to an issue with initializing very sparse deep networks, as sparse random initializations may be more vulnerable to noise compared to dense networks. On the other hand, constantdepth topologies such as Clos (with depth 3) and low rank (with depth 2) eventually outperform all other variabledepth topologies. Similarly, our ideal topology should have constant depth.
ViiD The need for high information bandwidth
In Figure 4 we notice that for small parameter counts, the Clos topology is outperformed by both butterflies and hypercubes. By analyzing the controllability matrices of lowparameter Clos networks, we see that this behavior stems from the limited information bandwidth of Clos networks. In Figure 8(a), we show an example of a Clos network that underperforms due to limited bandwidth.
ViiE One topology to rule them all
We evaluate different topologies with the above criterions, namely: (1) a topology should use skip connections, (2) the controllability matrix of a topology should have no variance, (3) the topology depth should not change with the number of parameters, and (4) the topology should have high information bandwidth, independent of the number of parameters. All of the above topologies can satisfy constraint (1) given skip connections, however, only Clos and butterfly satisfy constraint (2). Since butterfly, hypercube and torus topologies grow in depth as the parameter budget grows, while Clos grows in width, only Clos satisfies constraint (3). However, while butterfly, hypercube, and torus satisfy requirement (4), Clos does not. Hence, we propose a topology we call parallel butterfly, which satisfies all of these constraints. Parallel butterfly consists butterfly networks of maximum depth , where is a metaparameter selected ahead of time. All networks are connected to the same inputs, and sum their outputs. With a small number of parameters at it’s disposal, the parallel butterfly sets , and grows in depth up to layers. Afterwards, it grows in width by increasing the parameter . In Figure 4 we show that parallel butterfly outperforms all other topologies. In Appendix Figure 8(b), we give an example of a parallel butterfly topology where .
Viii Sparse Convolutional Neural Networks
While in previous sections we have focused on linear layers, and it is trivial to extend these changes to GRUs and LSTMs, here we show how a priori sparsity can be applied to convolutional layers. The core idea is that any convolution of input and output channels where can be decomposed into two separate operations: a depthwise convoulution followed by a pointwise convolution. The depthwise convolution uses kernels of size , one applied on each channel independently. The pointwise convolution ‘mixes’ these convolved channels together by into output channels by applying filters, each one only looking at a column of values. The pointwise convolutions can therefore be viewed as sliding a dense network across each column of pixels of the input image. The depthwise convolution has and the pointwise convolution has parameters, hence we expect the majority of parameters to exist in the pointwise convolution.
We can replace the dense pointwise convolution with a cascade of a priori sparse pointwise convolutions, as illustrated in Figure 1(a). In Figure 6
we show the training accuracy per epoch of MobileNet v0
[DBLP:journals/corr/abs180104381] and four different ClosNet configurations with and less parameters.The reduction in parameters (red line) actually increases accuracy compared to the original MobileNet, which is likely due to the MobileNet overfitting.
Ix Conclusion
In this work, we have explored accelerating DNN training by pruning networks ahead of time. We proposed replacing dense and convolutional layers using sparse cascades with topologies selected ahead of time. We presented an a priori sparse neural network initialization scheme that allows us to train very deep networks without the vanishing gradient problem. Since networks are pruned before the model has seen any training data, we investigated topologies that maximize accuracy over any domain. We have developed a datafree heuristic that can evaluate the sparse network’s control of outputs with respect to inputs, allowing us to assess the expressiveness of a given topology. We have extracted several requirements that make for a good topology, such as the need for skip connections, information bandwidth, shallowness, and inputoutput pair equality. Finally, we have proposed a topology we call parallel butterfly as the ideal topology for training a priori sparse networks, and have experimentally shown that it outperforms other considered topologies.
References
Appendix A Initializing Deep Linear Neural Networks
In [pmlrv9glorot10a], authors propose that the difficulty of training deep neural networks lies in their initialization. They observe that for the common weight initialization of
, the variance of activations decreases as the signal progresses through the layers. Similarly, the variance of the gradients is the highest at the last layer, and decreases as gradients are backpropagated towards the input layer. We briefly cover a derivation of the Xavier initialization here.
Given a layer with input and output neurons, and a uniform elementwise variance of th layer activations, of th layer weights, and of th layer gradients, we can calculate the variance of the next / previous layer’s actvations and gradients as:
(21) 
In order to maintain the variance accross layers, layer and activation / gradient variances should be equal:
(22) 
For nonsquare weight matrices, authors compromise and set the weight variance as:
(23) 
If the weight matrix is initialized with a uniform distribution
, the distribution variance can be calculated as:(24) 
Appendix B Measuring the number of solvable ratio constraints
On a practical note, one way to test how many ratios a network can learn is to append a ‘diagonal layer’ to the end of the network (i.e., a new layer with a single neuron attached to each output), as seen in Figure 7. The diagonal layer is a diagonal matrix whose only trainable elements are on the main diagonal, and all other values are 0. When training a network, this diagonal layer can only learn magnitudes, and not ratios between signals, because each neuron only has one input and cannot ‘mix’ any signals. This gives us an easy way of measuring the number of ratios a network can correctly express: we train a network with L1 loss until it converges. We then count the number of constraints the network has satisfied. These constraints can be ratio constraints or magnitude constraints. If we have output neurons, we know that the last layer will have satisfied all magnitude constraints. Hence, the number of ratios the network can satisfy is . For example, the network in Figure 7 (right, though true for left too) can satisfy three out of the 4 absolute constraints. 2 of those are magnitude constraints, meaning it can only satisfy one ratio constraint. That ratio is calculated at neuron , so either neuron or can get a correct ratio of inputs, but not both. Of course, with L2 loss, the network will settle for a solution that doesn’t satisfy either, but picks some middle ground.
Appendix C Controllability Corollaries
Corollary C.0.1.
For a neuron with a set of input neurons connected with trainable and constant connections, controllability is:
(27) 
where
(28) 
Notice that as at least one connection is constant, the network can make full use of all trainable connections.
Appendix D Decomposing graphs
We briefly show how graphs can be decomposed so that we can apply Theorems VI.4 and VI.5. In Figure 8 we see a 312 graph decomposed into a graph where each neuron has at most 2 inputs or outputs. We can apply Theorem VI.4 to subgraph and , and Theorem VI.5 to subgraph . Since neurons and only have one input, their controllability is identical to that of and , respectively.
Comments
There are no comments yet.