Experiments on Properties of Hidden Structures of Sparse Neural Networks

07/27/2021 ∙ by Julian Stier, et al. ∙ Universität Passau 0

Sparsity in the structure of Neural Networks can lead to less energy consumption, less memory usage, faster computation times on convenient hardware, and automated machine learning. If sparsity gives rise to certain kinds of structure, it can explain automatically obtained features during learning. We provide insights into experiments in which we show how sparsity can be achieved through prior initialization, pruning, and during learning, and answer questions on the relationship between the structure of Neural Networks and their performance. This includes the first work of inducing priors from network theory into Recurrent Neural Networks and an architectural performance prediction during a Neural Architecture Search. Within our experiments, we show how magnitude class blinded pruning achieves 97.5 compression and re-training, which is 0.5 points more than without compression, that magnitude class uniform pruning is significantly inferior to it and how a genetic search enhanced with performance prediction achieves 82.4 Further, performance prediction for Recurrent Networks learning the Reber grammar shows an R^2 of up to 0.81 given only structural information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding the structure of deep neural networks promises advances across many open problems such as energy-efficient hardware, computation times, and domain-specific performance improvements. The structure is coupled with sparsity on different levels of the neural architecture, and if there is no sparsity, then there is also no structure: a single hidden layered neural network is capable of universal approximation kidger2020universal, but as soon as there exists a deeper structure, there naturally occurs sparsity.

Clearly, the structure between the input domain and the first hidden layer is tightly coupled with the structure within the data – correlations between the underlying random variables such as the spatial correlation of images or correlation in windows of time series data. In theory and with perfectly fitting functions, that should be all there is, but in practice, neural architectures got deeper and deeper, and hidden structures seem to have an effect when neural networks are not just measured by their goodness of fit but also, e.g., on hardware efficiency or robustness

benamor2021robustness. Assuming such hidden structures exist for the better, we wonder how we can automatically find them, how they can be controlled during learning, and whether we can exploit given knowledge about them.

We give our definition for sparse neural networks and show experiments on automatic methods to obtain hidden structures: pruning, neural architecture search, and prior initialization. With structural performance prediction, we also show experiments on exploiting structural information to speed up neural architecture search methods.

Our contributions

comprise a pytorch tool called

deepstruct111http://github.com/innvariant/deepstruct which provides models and tools for Sparse Neural Networks, a genetic neural architecture search enhanced with structural performance prediction, a comparison of magnitude-based pruning on feed-forward and recurrent networks, an original correlation analysis on recurrent networks with different biologically plausible structural priors from social network theory, and performance prediction results on these recurrent networks. Details on the experiments and code for reproducibility can be found at github.com/innvariant/sparsity-experiments-2021.

2 Sparse Neural Networks

Sparse Neural Networks (SNNs) are deep neural networks with a low proportion of connectivity with respect to all possible connections.

Sparsity

Given a vector

with , its sparsity is , given the cardinality function (of which 0 refers to the case of of a norm) and the size of the vector. Density is defined as its complement with

. The definition extends naturally to tensors and simply provides the proportion of non-zero elements in a tensor compared to the total number of its elements. A tensor can be considered as sparse as soon as its sparsity is below a given threshold value, e.g.,

– as soon as more than 50% of its elements are zero.

What is the motivation for sparsity at all? First, more sparsity implies a lower number of parameters which is desirable if the approximation and generalization capabilities are not heavily affected. In theory, it also implies a lower number of computations. From a technical perspective, sparse structures could lead to specialized hardware. Further, sparsity means that there is space for compression that can affect the overall model memory footprint. Memory requirements are an important aspect for limited capacity devices such as in mobile deployment. In the feature transformation layers, sparsity explains data dependencies and provides room for explainability.

Neural Networks

A neural network is a function composed of non-linear transformation layers

extended with transformations for skip-layer connections such that with being the activation of layer with being e.g. tanh or max(x, 0). describes the weights from layer to for a network with . The input to the function is from the input domain. Consecutive sizes of weight matrices need to be aligned and define the layer sizes. The final weight matrices map to the output domain with .

Given the weights of a neural network as a set of grouped vectors , we overload such that we obtain the sparsity of a neural network . A Sparse Neural Network is a neural network with low sparsity, e.g.

. The set of grouped vectors could, e.g., be all neurons with their weights from all possible incoming connections.

Sparse Neural Networks (SNN) naturally provide a directed acyclic graph associated with them. Neurons defined in translate to vertices, and weighted connections from a neuron in layer to a neuron in layer translate to edges between the according vertices. The resulting graph structure contains reduced information about the original model. Technically, the graph can be reflected in an SNN with masks associated in a multiplicative manner through the Hadamard product with each , and we get

This mask can be either obtained through learned weights by, e.g., pruning or regularization or induced by prior design.

All deep feed-forward neural networks are sparse by design as they lack residual connections. Even ResNets and DenseNets contain sparsity as they are employing convolutions which are sparse by design

he2016deep; zhang2018residual. Techniques such as poolings compute order statistics over their input and thus must be either excluded or can be understood as a structural singularity.

Sparse Recurrent Neural Networks (SRNN)

Recurrent Neural Networks additionally have recurrent connections which unfold over time. These recurrent connections are initialized as hidden states, . At any sequence , with and being input-to-hidden weigths and hidden-to-hidden weights, respectively. We refer to as the input at sequence and as the hidden state value from the previous step.

Similar to SNNs, SRNNs also consists of extended non-linear transformation layers with skip-layers such that with .

Figure 1: A simple Recurrent Neural Network unrolled with sequences. Here, represents input-to-hidden weights, represents hidden-to-hidden weights, and represents hidden-to-output weights.

In SRNNs, directed acyclic graphs are not associated with recurrent connections, only with consecutive transformation layers. Therefore, to reflect a graph in an SRNN, only input-to-hidden weights are multiplied with masks through the Hadamard product such that can be formulated as:

Achieving Sparsity

Sparsity refers to a structural property of Neural Networks which can be desirable due to reasons such as model size or hardware acceleration. There exist multiple ways to achieve sparsity, e.g., through regularization, pruning, constraints, or by prior initialization.

Regularization

affects the optimization objective such that not only a target loss but also a parameter norm is minimized. As such, regularization takes effect during training and can force weights to be of small magnitude. Under sufficient conditions, e.g., with an L1-norm and rectified linear units as activation functions, sparsity in the trained network can be achieved in an end-to-end fashion during learning.

Pruning refers to removing elements of the network outside of the end-to-end training phase. Based on a selection criterion such as the magnitude of a weight, one or multiple weights can be set to zero. A pruning scheme decides on what sets the criterion is applied or how often the pruning is repeated. Sparsity is enforced based on this selection criterion and pruning scheme.

Prior design is a constraint on the overall search space of the optimization procedure. More generally, prior structure to a neural network is restricting the hypothesis space of all possible obtainable functions during learning to a smaller space. Convolutions as feed-forward neural networks with local spatial restrictions can be understood as such a prior design.

Pruning Neural Networks

Pruning is a top-down method to derive a model from an originally larger one by iteratively removing elements of the neural network. The motivation to prune is manifold: 1) finding high-performing network structures can be faster in comparison to other search methods such as grid- or random-search, 2) pruning can improve metrics such as error, generalization, fault tolerance, or robustness or 3) reduce computational costs in terms of speed, memory requirements and energy efficiency or 4) support the interpretation of neural networks.

Pruning consists of a selection method and a strategy. The selection method decides which elements to choose based on a criteria, e.g., the magnitude of a weight. The strategy applies the selection method repeatedly on a model until some stopping criterion is reached, e.g., a certain number of iterations are conducted.

One-shot or single pruning refers to applying the pruning method once. After pruning, often a certain number of re-training cycles are conducted. Fixed-size pruning refers to selecting a fixed number of elements based on the ranking obtained through the pruning selection method. In each step, the same number of elements are removed. Relative or percentage pruning refers to selecting a percentage of remaining elements to be pruned. This results in fewer numbers to be removed in sudden decays of performance. Bucket pruning holds a bucket value which is filled by, e.g., the weight magnitude or the saliency measure of the pruning selection method, and as many elements as the bucket can hold are removed per step.

A naive method for pruning is the random selection of components. Differences can be made by defining the granularity, e.g., whether to prune weights, neurons, or even channels or layers. Random pruning often serves as a baseline for pruning methods to show their general effectiveness, and it has been shown in various articles that most magnitude- and error-based methods outperform their random baseline, see Figure (a)a

. Usually, models drop in performance after pruning but recover within a re-training phase of few epochs.

For magnitude-based pruning, good explanations can be found in marchisio2018prunet; han2015learning and in recent surveys such as liang2021pruning; gale2019state

. Class-blinded selects weights based on their magnitude regardless of their class, i.e., their layer, class-uniform selects the same amount of weights from each class, and class-distributed selects elements in proportion to the standard deviation of weight magnitudes in the respected class.

Prior Design

Restricting the search space of the neural network architecture fosters faster convergence, and domain-specific improved performance can be achieved. Convolutions with kernels applied over spatially related inputs are a good example for such a prior. Similarly, realizations as the MLP-Mixer tolstikhin2021mlp

show that even multi-layer perceptrons with additional imposed structure and final poolings can achieve state-of-the-art performance.

3 Related Work

Pruning

Pruning dates as far back as 1988 sietsma1988neural in which Sietsma et al. provided the first error-based approach to prune hidden units from a neural network. Notable contributions of this time have been Optimal Brain Damage and its modification Optimal Brain Surgeon lecun1990optimal; hassibi1993optimal besides techniques such as Skeletonization mozer1989skeletonization

which introduced estimating the sensitivity of the error, fault-tolerance

segee1991fault and improved generalization thodberg1991improving. Magnitude-based approaches followed with weight-elimination weigend1991generalization; weigend1991back. There have also been several specializations of pruning, such as sankar1991optimal. Reed reed1993pruning provides a well-known survey on pruning algorithms up until then.

Recent work on pruning has been conducted by Han et al. han2015learning; han2015learning using different magnitude-based pruning methods for deep neural networks in the context of compression or by Dong et al. dong2017learning. A survey by Liang et al. liang2021pruning provides extensive insights into pruning and quantization for deep neural networks. In 2018, authors of the Lottery Ticket Hypothesis reported on finding sparse subnetworks after iteratively pruning, re-setting to the original weight initialization, and training it from scratch to a comparable performance frankle2018lottery. We collected over 300 articles on pruning just up until 2019.

First pruning experiments on Recurrent Network Network were performed by Lee et al. giles1994pruning. Han et al. han2017adaptive proposed recurrent self-organising neural networks, adding or pruning the hidden neurons based on their competitiveness during training. A Baidu research group narang2017exploring could reduce the network size by 8 while managing the near-original performance. In 2019, Zhang et al. zhang2019one proposed one-shot pruning for Recurrent Neural Networks using the recurrent Jacobian spectrum. Using this technique, the authors confirmed that their network, even with 95% sparsity, performed better than fully dense networks.

Sparse Training

Besides - or -regularizations which can lead to real-zeros with appropriate activation functions such as ReLUs, there are also training methods outside the optimization objective to enforce sparsity such as Sparse Evolutionary Training (SET) by Mocanu et al. mocanu2018scalable, Dynamic Sparse Reparamterization (DSR) mostafa2019parameter or the Rigged Lottery (RigL) evci2020rigging. For RNNs, there exists Selfish sparse RNN training liu2021selfish.

Neural Architecture Search

On Neural Architecture Search (NAS), there are two notable recent surveys by Elsken et al. elsken2019neural and Wistuba et al. wistuba2019survey providing an overview and dividing NAS into the definition of a search space, a search strategy over this space and the performance estimation strategy. Differentiable architecture search liu2018darts is a notable method for finding sparse neural networks on a high-level graph based search space by allowing to choose among paths in a categorical and differentiable manner. While there exist hundreds of variations in the definition of search spaces and methods, the field recently came up with benchmarks and comparable metrics ying2019bench.

Structural Performance Prediction

Structural Performance Prediction refers to using structural features of a neural network to predict a performance estimate without any or only partial training. Such performance prediction was already conducted by Baker et al. baker2018accelerating

on two structurally simple features, namely the total number of weights and the number of layers. But they mostly focused on prediction based on hyperparameters and time-series information. Klein et al. also did performance prediction based on time-series information. They conducted “learning curve prediction with Bayesian Neural Networks”

klein2016learning. In stier2019structural

more extensive graph properties of randomly induced structures were used to predict the performance of neural networks for image classification. A related work on a “genetic programming approach to design convolutional neural network architectures”

wendlinger2021evofficient included an acceleration study for accuracy prediction based on path depth, breadth-first-search level width, layer out height and channels, and connection type counts. The performance prediction during a NAS yields a 1.69 speed-up. Similar to stier2019structural in benamor2021robustness we used structural properties to predict the robustness of recurrent neural networks.

4 Experiments

We conducted four experiments: First, pruning feed-forward neural networks to investigate the effect of different pruning methods, namely random pruning, magnitude class-blinded, magnitude class-uniform, magnitude class-distributed, and Optimal Brain Damage lecun1990optimal. Second, pruning recurrent neural networks to investigate whether we observe similar compression rates and to have a baseline comparison for recurrent models in the subsequent experiment. Third, inducing random graphs as structural priors into recurrent neural networks, based on the biological motivation that biological neural networks are also connected like small-world networks hilgetag2016brain. And fourth, conducting a genetic neural architecture search with architectural performance prediction – a complex search space with a strategy that can be accelerated when having information about properties of sparse genotypes.

4.1 Pruning Feed-Forward Networks

On MNIST lecun1998gradient we used two different feed-forward architectures with rectified linear units with 100 and 300-100 neurons in the hidden layers, trained up to 200 epochs with cross-entropy, a batch size of 64, and a learning rate of 0.01 and 0.0001. With the five pruning methods, we conducted several repeated experiments with iterative fixed-size pruning or iterative relative pruning of the number of weights.

We found that magnitude class blinded pruning clearly outperforms the other methods and Optimal Brain Damage, surprisingly, performs nearly the same although it uses second-order derivative information for pruning. Pruning in general can dramatically reduce overfitting in the examined network and can even outperform other regularization techniques.

(a) Five pruning methods on feed-forward networks. Choosing weight-magnitude over random selections clearly has advantages. Optimal Brain Damage is expensive and worth in a low-parameter regime.
(b) Pruning both input-to-hidden and hidden-to-hidden weights on a recurrent neural network.
Figure 4: Pruning performance in feed-forward networks (Figure (a)a) and recurrent networks (Figure (b)b).

4.2 Pruning Recurrent Networks

(a)
(b)
(c)
Figure 8: The red right arrow () resembles pruning of hidden-to-hidden weights, the red up-arrow () pruning of input-to-hidden weights. Figure (a)a shows pruning both, Figure (b)b only i2h weights, and Figure (c)c only h2h weights.

In this experiment, we prune input-to-hidden“i2h” and hidden-to-hidden “h2h” weights individually and simultaneously as depicted in Figure 8 on a pre-trained base recurrent model for the Reber grammar reber1976implicit, trained for 50 epochs.

Figure 9: Flow diagram to generate Reber grammar sequences.

One of the first uses of Reber grammar was in the paper that introduced the LSTM hochreiter1997long. A true Reber grammar sequence follows the flow-chart shown in Figure 9. For example, BPTVPXTTVVE is a true Reber grammar sequence, while BTTVPXTVPSE will be the false one. The base recurrent model consists of an embedding layer that accepts input in the form of ASCII values of each character in the input Reber

sequence. Three recurrent layers of 50 neurons follow the embedding layer and a linear layer predicts the final scores. TanH and ReLU are used as non-linearities. The models are trained with a learning rate of 0.001 and a batch size of 32.

From the Reber grammar, we generated 25000 sequences, out of which 12500 are true Reber grammar sequences, and the remaining are false. The dataset resembles a binary classification task in which a model has to predict whether a sequence is in the Reber grammar or not. Logically, a baseline performance from random guessing is accuracy of 50%. We then split this dataset into a train-test split of 75%-25%, with 18750 sequences in the training set and 6250 sequences in the test set.

The threshold based on which we prune weights is calculated based on the percent of weights to prune. Therefore, for example, to prune of weights for a given layer, the threshold is the 10th percentile of all the absolute weights in that layer. In our experiment, we go from percent to while incrementing by after each round.

We considered LSTM, GRU, and vanilla RNNs as architectures for comparison. Base models were trained separately for each to get the base performance. Then, we pruned i2h and h2h weights; simultaneously and individually. Based on these results, we can identify the effect pruning has on the performance of RNNs and the amount of re-training required to regain the original performance.

The base models for RNN_ReLU, LSTM, and GRU achieve perfect accuracies of 100% on the test set within the first two epochs. RNN_TanH achieved 90% after six epochs and showed a drop in accuracy between epoch three and five down to 50%, which we observed over multiple repetitions of the experiment.

Pruning both i2h and h2h weights simultaneously, about 80% of weights in RNN_Tanh, 70% of weights in RNN_ReLU, 60% of weights in LSTM, and 80% of weights in GRU can be safely reduced as can be observed in Figure (b)b. After pruning above the safe threshold, we re-trained each pruned model and found that it only takes one epoch to regain the original performance. Pruning 100% weights, the model never recovers.

(a) Pruning only input-to-hidden weights.
(b) Pruning only hidden-to-hidden weights.
Figure 12: Accuracies of RNN_Tanh, RNN_ReLU, LSTM, and GRU after applying iterative magnitude percentage pruning on a common base model.

Pruning only i2h weights, results showed that we safely prune about 70% for RNN_Tanh, RNN_ReLU, and LSTM. For GRU, we prune 80% of i2h weights without noticing a significant reduction in performance, see Figure (a)a. As in the case of pruning both i2h and h2h weights simultaneously, our pruned model still recovers only after one re-training epoch with up to 90% of pruning i2h weights in RNN_ReLU, LSTM, and GRU. RNN_Tanh takes about two re-training epochs to recover after 90% of i2h weight pruning. Finally, as expected, this pruned model never recovers with 100% of i2h weights pruning.

Subsequently, we prune only h2h weights of each recurrent layer in our trained base model. Results showed that we could safely prune about 70% of h2h weights for RNN_ReLU and LSTM, while 80% of h2h weights for RNN_Tanh and GRU, see Figure (b)b. Like pruning only i2h weights, models still recover after one re-training epoch with up to 90% of pruning h2h weights. Pruning 100% h2h weights, RNN_Tanh and RNN_ReLU never recover, but GRU and LSTM still retain the original performance with just one re-training epoch.

4.3 Random Structural Priors for Recurrent Neural Networks

Another method than pruning to induce sparsity in a recurrent network is by applying prior structures by design. We use random structures that are generated by converting random graphs into neural architectures, similar as in stier2019structural. For this, we begin with a random graph and calculate the layer indexing of each vertex. A layer index is obtained recursively by . This layer indexing helps to identify the layer of a neural architecture a vertex belongs to.

Such a graph is used to generate randomly structured ANNs by embedding it between an input and an output layer, as in Figure (b)b. RNNs can be understood as a sequence of neural networks, in which a network model at sequence accepts outputs from a model at sequence . Introducing recurrent connections as in Figure (c)c provides us with Sparse RNNs with random structure.

(a) Initial graph
(b) Sparse Neural Network with induced structural prior.
(c) A sparse RNN with a structural prior based on the initial graph of five vertices.
Figure 16: We select one directed version of the graph from Figure (a)a, compute its topological ordering based on the described layer indexing and embed it into a neural network as a structural prior as shown in Figure (b)b. This randomly structured ANN can then be converted into a randomly structured RNN by introducing recurrent connections, as in Figure (c)c.

We generate 100 connected Watts–Strogatz watts1998collective and 100 Barabási–Albert barabasi1999emergence graphs using the graph generators provided by NetworkX. The graphs are transformed into recurrent networks and trained on the Reber grammar dataset. Analogue to pruning, this experiment is also conducted with RNN with Tanh nonlinearity, RNN with ReLU nonlinearity, LSTM, and GRU. Figure 21 shows the performance differences between Watts-Strogatz and Barabási-Albert graphs.

(a)
(b)
(c)
(d)
Figure 21: The performance differences between Watts–Strogatz and Barabási–Albert graphs for ((a)a) RNN with Tanh nonlinearity, ((b)b) RNN with ReLU nonlinearity, ((c)c) LSTM, and ((d)d) GRU.

To identify essential graph properties that correlate with the performance, we calculated the Pearson correlation of each graph property to its corresponding performance results. Table 1 shows the Pearson correlation between test accuracy and different graph properties.

Property Correlation with test accuracy
RNN_Tanh RNN_ReLU LSTM GRU
layers 0.25 0.30 0.28 0.34
nodes 0.40 0.44 0.44 0.49
edges 0.38 0.43 0.42 0.49
source_nodes 0.35 0.47 0.57 0.74
diameter -0.23 -0.27 -0.32 -0.20
density 0.29 0.15 0.29 0.34
average_shortest_path_length -0.27 -0.25 -0.36 -0.23
eccentricity_var -0.22 -0.24 -0.30 -0.21
degree_var -0.28 -0.26 -0.39 -0.58
closeness_var -0.46 -0.39 -0.51 -0.67
nodes_betweenness_var -0.49 -0.41 -0.56 -0.52
edge_betweenness_var -0.34 -0.30 -0.44 -0.26

Table 1: Pearson correlation between the test accuracy of an architecture and different graph properties.

Based on this correlation, we found closeness_var, nodes_betweenness_var, and the number of nodes to be essential properties for randomly structured RNN_Tanh. For randomly structured RNN_ReLU, the essential properties are the number of nodes, the number of edges, the number of source nodes, and nodes_betweenness_var. In the case of randomly structured LSTM, we found six essential properties, i.e., the number of nodes, the number of edges, the number of source nodes, closeness_var, nodes_betweenness_var, and edge_betweenness_var. Similarly, we found six essential properties for randomly structured GRU, namely, the number of nodes, the number of edges, the number of source nodes, degree_var, closeness_var, and nodes_betweenness_var.

By storing the graph properties and their corresponding performance during the training of randomly structured recurrent networks, we create a small dataset of 200 rows for each RNN variant. We then train three different regression algorithms, namely Bayesian Ridge, Random Forest, and AdaBoost, on this dataset and report an R-squared value for each.

Performance of RNN_TANH was best predicted with Bayesian Ridge (BR) Regression with an of 0.47919, while Random Forest (RF) achieved 0.43163 and AdaBoost (AB) 0.35698. All regressors have an of below 0.5, from which we conclude only a weak fit and predictability based on the used structural features. For RNN_RELU, RF was best with an of 0.61504, followed by AB with 0.53469 and BR with 0.36075. Structural features on LSTM predicted performance with AB with 0.59514, with RF with 0.57933, and with BR with 0.37206. We found a moderate fit for random forests, similar as in stier2019structural. GRU accuracies were predicted with AB with 0.78313 and with BR with 0.67224, and RF achieved an of 0.87635

. This indicates a strong fit and a good predictability that we interpreted carefully as potentially coming from a skewed underlying distribution of the overall dataset of Sparse Neural Networks but also an indication of possible strong predictability in larger settings in which structural properties have even more impact.

4.4 Architectural Performance Prediction in Neural Architecture Search

We investigated a Genetic Neural Architecture Search to find correlations between structural properties of sparse priors and neural networks on a more coarse level and analysed the predictive capabilities for performances of Sparse Neural Networks when having just architectural information available.

A genetic search is a population-based search with individuals represented in a graph-based genotypical encoding such as in a. Genotypical operations such as crossover between two genotypes distinguish the genetic from an evolutionary search. We employed the three operations selection, mutation, and crossover.

Our analysis of the genetic search included performance prediction on purely genotypical information. In wendlinger2021evofficient Wendlinger et al. investigated performance prediction for Neural Architecture Search on the search space described by Suganuma et al. in CGP-CNN suganuma2017genetic. We followed a more open search space based on labelled directed acyclic graphs as in irwin2019graph and used a genetic approach that mixes re-sampling, mutation, and cross-over instead of a purely and small evolutionary approach with just point mutations. This has the benefit of a more diverse search space through more operations and a more loose genetic encoding scheme.

(a)
(b)
Figure 24: Figure (a)a shows an exemplary graph from our search space with a mutation through an inserted sub-graph of depth two. Figure (b)b shows the correlations between structural properties and the maximum validation accuracy.

Our search space is based on directed acyclic graphs (DAG) and follows Irwin-Harris irwin2019graph to represent CNN architectures as depicted in Figure (a)a. Each vertex of the DAG is labelled with an operation: a convolution, max or average pooling, a summation, or concatenation or a skip connection. A convolution can have kernel sizes of either , or .

In total, five genotypical operations were used on the search space: two mutations and three crossover operations. The first mutation remaps vertex operations in a genotype, e.g. it could replace max_pool in the left genotype of Figure (a)a with an avg_pool. Figure (a)a depicts the result of the second mutation operation by inserting a smaller sub-graph #S into a randomly selected edge. The first crossover operation considers the longest simple sub-paths and swaps them between both parent genotypes. Bridge-crossover searches for bridges in both parent genotypes and swaps the succeeding child graphs after both found bridges. In a third crossover operation, matching layers with feature maps are searched. The number of feature maps within all vertices of both genotypes for matching layers are averaged.

We use a minimum depth of 6 and a maximum of 12 for all DAGs. Due to our final choice of mutation operation that increases the depth size with every generation, we set the minimum and maximum depths for the random search to 10 and 36. The hyperparameter search uses a population size of 30, a mutation probability of 0.5, a crossover probability of 0, a probability of removing the worst candidates of 0.1, and an architectural estimation fraction of 0.5.

Architectural Performance Prediction

The experiment is conducted on cifar10 krizhevsky2009learning

and the meta dataset to investigate architectural performance prediction consists of 56 features with a total of 2,472 data points - we split it into 70% training and 30% testing. The resulting meta-dataset constitutes a new supervised learning task containing graph-based features and the estimated performance of each candidate evaluated on

cifar10. Three categories, namely layer masses, path lengths, and remaining graph properties, make up the meta-dataset.

The layer mass is the number of channels of the current vertex times the sum of all channels of preceding vertices with an incoming connection to the current vertex. The average and standard deviation of this layer mass are used as a graph-based feature in the meta-dataset. For path lengths, we consider the shortest, longest and expected random walk length from a source to a target vertex. Again, we take the average and standard deviation of these properties over all vertices in a graph and obtain 36 features over all possible vertex operations. Further, we include the depth, edge betweenness, eccentricity, closeness, average neighborhood degree, and vertex degree as graph properties to the meta-dataset features.

Six of the ten most important features relate to standard deviations of path lengths regarding convolutional blocks or pooling layers. When combining this result with the correlations of features and the maximum accuracy, which shows positive correlations for all these six features and the maximum accuracy, it seems that an even distribution of pooling and convolution layers benefits the performance of an architecture. This assumption is further backed by the observation that handcrafted models like DenseNet use pooling vertices to connect architectural cells. Mean centrality is also included in the ten most important features and shows a negative correlation to the maximum accuracy. Centrality is an inverted measurement of the farness of one vertex to other vertices in the network. Thus, we interpret these findings as evidence that deeper architectures perform on average better than shallower ones and that varying path lengths might support a similar effect as model ensembles. Compare Table 1 and Figure (b)b for correlations between graph properties and accuracy estimations across different experiments.

5 Discussion & Conclusion & Future Work

We presented experimental results on different methods for optimizing structures of neural networks: pruning, neural architecture search, and prior initialization. These methods unite under joint questions of how structure influences the performance of neural networks given data. Structural performance prediction is an emerging method to exploit this fact to speed up search procedures or to control bias towards desirable properties such as low memory or energy consumption or computational speed for specialized hardware.

We compared five pruning techniques for pruning feed-forward networks, namely random pruning, magnitude class blinded, magnitude class uniform, magnitude class distributed, and optimal brain damage. Out of these five, random pruning immediately showed an accuracy drop, while the other four performed consistently near original performance for over 60% compression rate. In the end, magnitude class blinded outperformed the remaining four.

While applying pruning on recurrent networks, we found models to perform consistently better for up to 60% pruning. All the pruned models regain the original performance in just one to two epochs of re-training. This means these recurrent networks can achieve the same results as any dense recurrent network with almost 60% fewer weights. As opposed to our expectations, LSTM and GRU recovered even after 100% pruning of hidden-to-hidden weights. In LSTM, this might be due to a separate cell state that acts as long-term memory.

Our experiment with random structural priors for recurrent networks aimed to find essential graph properties and use them for performance prediction. Similar to the results of stier2019structural, three of the essential features are the number of edges, vertices, and source nodes. Although the construction and properties of Watts–Strogatz and Barabási–Albert random graphs are different, recurrent networks based on this two performed equally with RNN_Tanh and RNN_ReLU. Barabási–Albert based recurrent networks perform better than Watts–Strogatz based with LSTM and GRU.

Correlation analyses between structural properties and the performance of an untrained network reveal themselves to be difficult – after all, the mere structure is, at first sight, and ignoring the structural dependencies on the input feature space, independent of data. All the more promising is if such a relationship between structure of models and application domains can be found. The idea that structures contain relevant information implies that architectural priors or search strategies over architectures can be heavily biased and influenced. To what extend this bias takes shape is difficult to understand, and we hope to foster more research towards its impact.

Paul Häusner and Jerome Würf contributed to this work during their research at the University of Passau, and we thank them for their contributions and valuable discussions.

bibintoc[Bibliography]