(Artificial) deep neural networks are nowadays very popular algorithms aiming to imitate processes inside a human brain. They train by examples and are shown to be very effective for pattern recognition problems of different kinds. Despite an avalanche of applications, much is unknown about neural networks from the mathematical point of view. Much is done by guessing and by whatever worked well in similar problems.
One important open problem is determining the best architecture for a neural network. For a layered network this means to determine the number of neurons and the number of layers for a neural network. Among the main approaches to automatic architecture search are the following. See also survey .
Empirical/statistical methods that choose the weights according to the effect they make on the model’s performance, see, e.g. .
Pruning methods that start with a larger than necessary multilayer network and then remove neurons that have little contribution to the solution. There are several different ways to decide which neuron is not needed, see e.g. [18, 16, 6, 11, 5]. Pruning does not lead  to the increase of fault tolerance of the system. Among known disadvantages is that one usually does not know a priori how large the original network should be. Also starting with a large network could be excessively costly to trim the unnecessary units.
Constructive methods that start with an initial network of a small size, and then incrementally add new hidden neurons and/or hidden layers until some prescribed error requirement is reached or if no performance improvement can be observed, see e.g. surveys  (add a neuron, add a layer) , and e.g. papers [14, 1, 22, 17, 19, 9, 8, 21]. Among known disadvantages of these methods is that the size of the obtained multilayer networks are reasonable but rarely “optimal”.
Our growing architecture algorithm is a combination of ideas of both prunning and constructing algorithms. We also extend our domain to a more general one (that include layered networks as a particular case). For us an architecture is an arbitrary oriented graph with some weights (along with some biases and an activation function), so there may be no layered structure in such a network. We compare our optimized network with the large number of networks with standard architectures. We show that for the same error we can have a significantly smaller complexity.
In recent years, we have seen the ever-increasing efficiency of neural networks. At the same time, their complexity is growing. Here we measure the complexity of a neural network by the number of weights
, the values of which are selected in the learning process. Those who do not work directly with neural networks usually expect the complexity of the network to be tens, hundreds, at most thousands. In reality, the complexity of modern neural networks is much higher. Thus, for the standard MNIST handwritten digit classification problem, the number of learnable parameters in the best networks is hundreds of thousands and millions, while there are onlytraining examples (small square pixel grayscale images of handwritten single digits between and ).
Minimization of the network complexity is the goal of our work.
Most of our computations are realized in C++ instead of some conventional package (e.g. TensorFlow/Python). This is because layered architectures are the main objective of such specialized packages, and dealing with non-layered ones presents such difficulties that outweigh their conveniences.
The structure of the paper is as follows. In section 2, we introduce our notation. In section 3, we describe our “architecture growing” algorithm. In section 4, we present the results of our experiments with the brightness prediction problem and compare our algorithm with polynomial regression and standard neural networks. In section 6, we summarize our results.
2. Preliminaries. Architecture and complexity
Let be a function that represents a feedforward neural network, where
be an input vector andbe a vector of learnable parameters (weights). For regression tasks we minimize target function
We do not consider convolutional neural networks: all training vectors have a fixed length. The training dataset is represented in the form of a matrix , each line of which first contains the value , and then coordinates of the vector .
Fully connected are layered networks where each neuron receives the values of all neurons from the previous layer. Our algorithm allows networks of even more general type, where every neuron receives all input values and also values from all previous neurons. We shall call such maximally fully connected. By changing weights, one can include layered fully and not fully connected networks into the network of such a general type. Each neuron is a function of of the form
where are neuron’s input arguments, are the corresponding weights and is some activation function. In this paper we consider some most common ones, see details in section 4.3.
Hardware specification: Intel(R) Pentium(R) CPU G4500 @ 3.50GHz, 32G byte mem. and Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 16 G byte mem.
Software specification: Visual Studio 2019 Community (C++), Python 3.9.5, PyCharm 2019 Community Edition, Numpy 1.19.5, TensorFlow 2.5.0.
3. Architecture growing algorithm
The rough idea is to remove redundant connections in the neural network, while possible, and then to add a new neuron to its beginning, with running training processes in between. Then again remove some connections, while possible, and when it not possible, add a neuron, and so on. We first describe elements of the algorithm and then put it all together at the end of the section.
3.1. Connection removal procedure
Find three connections with the least (w.r.t. their absolute value) weights.
Create three different networks, by removing one of these three connections in each case. In every case start the learning process for the resulting network to minimize (1) within the specified time .
Choose from the three reduced optimized networks the one with the smallest error and optimize it with training time .
While removing a connection, it may turn out that this connection is the only input connection for some neuron, i.e. it works by the formula
In this case, we remove such a neuron, and approximate its action by linear function
where we choose s.t. to minimize the deviation
where is the interval of values taken by the parameter for the whole training matrix.
While removing a connection, it may turn out that the value of some neuron does not participate anymore in further calculations. We remove such a neuron.
Repeat the “one connection removal procedure” until the error increases by no more than times, where is small enough. In our experiments .
One might expect that by removing connections in a neural network, we increase its error. However, as we discovered, it is frequently not the case if the original network was not trained well enough. In such cases the quality of the “reduced” network can be much improved after the optimization. Our experiments showed that in the beginning, when the connections are started to be removed, the error in the network almost does not grow until we reach a “saturation” point where any attempt to remove any further connection results in a noticeable increase of the error. The boundary was chosen based on the results of these experiments.
3.2. Adding a neuron procedure
In the course of the algorithm, we run the connection removal procedure until the error increase is too large. After that we do the following.
Add one extra neuron to the very beginning of the neural network and connect it with all input parameters and with all other neurons of the original network.
Set all the weights of all new connections equal to . Thus, the computations in this new neural network go exactly by the same algorithm as in the original one, and the error will remain the same.
Retrain the new network with training time . The error of the network decreases.
3.3. Architecture growing algorithm
We start with an arbitrary architecture and then execute the following procedure.
On fig. 1, the starting point of the algorithm is represented by the blue point above the start of . Removing connections, we move in the direction indicated by
, from the right to the left: complexity decreases while the error slightly increases. At some moment, removing a connection leads to a sudden jump in error (see along vertical-ish arrow).
Once this happens we stop removing connections and add one more neuron. This moves us to the start of . From there we continue the process of removing connections (and thus move in the direction indicated by ) until we reach the jump in error, and then the process repeats.
4. Comparison with other approaches on brightness prediction problem
To illustrate our idea, consider the brightness prediction problem for an image point knowing the brightness of several previous points.
4.1. Neuro network built using our approach
The data is kept in a table, where rows and columns are and coordinates of the point and values are the numbers from to .
The previous points are ordered as shown in table 1. Here is the current point, and the first column and the first row give the -coordinates of the points relative to . The other numbers in the table indicate the order in which the points will be considered.
To normalize the numbers, we subtract from all values of the brightness of the previous (left) point and divide all values by . So building the forecast by points, there will be only input parameters in the neural network. For the experiments we choose a graphics file of size . For previous points, the resulting training file looks as in table 2.
4.2. For comparison: linear and polynomial approximation
As a starting point, we compare the performance of our optimized networks with the results obtained using linear/polynomial regressions:
where is a polynomial in of degree , and is the vector of weights. table 3 contains the mean square errors (over all points of the image) in such approximation with polynomials of degrees , , and .
|number of points||Error||Complexity||Error||Complexity||Error||Complexity|
Here the number of points used for the approximation is larger by than the number of input parameters of the network. The complexity of the network is the number of parameters. For example, the cubic approximation of the brightness of a point by the previous points has parameters (that are coefficients of the corresponding polynomial) and gives error .
To see the error in the original units and before normalization, one takes values of errors from table 3 and calculate the value of . For example, for , it is approximately .
4.3. For comparison: neural networks with some fixed architectures
Here we compare our optimized networks with the networks having some standard architectures. Specifically, we consider a large number of 3-layered networks and maximally fully connected networks. The considered 3-layered networks have from to neurons in the hidden layers, and the considered maximally fully connected networks have up to – neurons.
The choice of the activation function is a part of the architecture. We consider the most popular ones, and through many computations, choose the one that is the most efficient. The following activation functions were considered (see their graphs on fig. 3).
All but the first three functions are odd. Function was found to be the most effective.
Note that since the minimization process of the neural network is stochastic, repeating the experiment several times, we obtain different results. Thus, for the same architecture, we repeat the experiment times and choose the best result. In fig. 4 are the results of the performance on input points by standard networks.
The envelope from below (the piece-wise linear line) in fig. 4 gives interpolation of our data and allows us to approximate the smallest expected error for a given complexity, or vice versa, the lowest complexity for a given error. We use this to compare best achievable results for networks of different complexities. The approach in particular makes sense since the results are of stochastic nature.
We obtained similar results through a large number of experiments with to of the input points for -layer and full-connected networks.
4.4. Our algorithm
On the same training matrices, we optimize the architectures using our algorithm.
table 4 allows us to compare the lowest achievable complexities (for our and the standard networks). For example, in the last line: with input points, among our experiments there is an optimized architecture network with a complexity . The corresponding error is . Then the best complexity of the standard networks with the same error is obtained based on the envelope shown in fig. 4. We see that for the same error, the complexity of the optimized network is significantly less than that of the standard network.
|Number||Error||Best complexity of||Complexity of|
|of points||standard networks for this error||optimized networks|
A different type of comparison is given in table 5. For a given number of points, it shows the best achieved quality (i.e. minimizing ) by standard and our optimized networks.
|standard networks||our optimized networks|
|number of||smallest achieved||corresponding||smallest achieved||corresponding|
The constructed by our algorithm network architectures can be represented as a graph: neurons correspond to vertices, and connections between neurons correspond to graph edges. For better visualization the input values and edges outgoing from these vertices are not shown. Below are a few obtained graphs in “circular” form where vertices of graph are located in the vertices of a regular polygon.
5. Comparison with other approaches on image approximation problem
Here we look into approximating of a black and white graphics file of size by a function of two input variables, . We consider 3-layered architectures with two inputs,
neurons in the first layer, neurons in the second layer,
and output neuron. Here .
We run the learning process on TensorFlow/Keras
TensorFlow/Keras. We optimize these networks using our approach. On fig. 10 is the graph of one of the optimized by us networks.
A comparison of the described networks one can see in table 6. To compare complexities we use the same approach as in Sec. 4, i.e. construct an envelope from the below to extrapolate discrete data that we get from computations, see fig. 9. One can see that for the same error, our optimized networks offer significantly smaller complexities.
|Given||Best complexity||Best complexity|
|error||of standard architectures||of our optimized networks|
We propose a new kind of automatic architecture search algorithm. The algorithm alternates pruning connections and adding neurons, and it does not restrict itself to layered networks only. Instead, we search for an architecture among arbitrary oriented graph with some weights (along with some biases and an activation function), so there may be no layered structure in such a network. The goal is to minimize the complexity staying within a given error.
We begin with any standard architecture and create our optimized one by pruning and letting it grow, pruning and again letting it grow, and so on.
For large networks, where the number of connections counts in hundreds, the complexity of the optimized network is times smaller than that of a standard network with the same error. For small networks, where the number of connections counts in tens, the complexity decreases by a factor of . Here by standard networks, we mean the best results obtained with -layer and full-connection networks.
The algorithm can be sped up by for example not considering every connection while pruning.
Dynamic node creation in backpropagation networks.Connection Science, 1(4):365–375, 1989.
Vijay B. Zoph, Vasudevan, Jonathon Shlens, and Quoc V Le.
Learning transferable architectures for scalable image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
P.G. Benardos and G.-C. Vosniakos.
Optimizing feedforward artificial neural network architecture.
Engineering Applications of Artificial Intelligence, 20(3):365–382, 2007.
-  PG Benardos and G Cl Vosniakos. Prediction of surface roughness in cnc face milling using neural networks and taguchi’s design of experiments. Robotics and Computer-Integrated Manufacturing, 18(5-6):343–354, 2002.
-  Giovanna Castellano, Anna Maria Fanelli, and Marcello Pelillo. An iterative pruning algorithm for feedforward neural networks. IEEE transactions on Neural networks, 8(3):519–531, 1997.
-  E.D. Karnin. A simple procedure for pruning back-propagation trained neural networks. IEEE Transactions on Neural Networks, 1(2):239–242, 1990.
-  J. Koza and J. F. Rice. Genetic generation of both the weights and architecture for a neural network. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume 2, pages 397–404, 1991.
-  Tin-Yan Kwok and Dit-Yan Yeung. Objective functions for training new hidden units in constructive neural networks. IEEE Transactions on neural networks, 8(5):1131–1148, 1997.
-  Tin-Yau Kwok and Dit-Yan Yeung. Bayesian regularization in constructive neural networks. In International Conference on Artificial Neural Networks, pages 557–562. Springer, 1996.
-  Tin-Yau Kwok and Dit-Yan Yeung. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE transactions on neural networks, 8(3):630–645, 1997.
-  Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
-  Tsu-Chang Lee. Structure level adaptation for artificial neural networks, volume 133. Springer Science & Business Media, 2012.
-  A. Rawat M. Wistuba and T. Pedapati. A survey on neural architecture search. arXiv:1905.01392 [cs.LG], 2019.
-  L. Ma and K. Khorasani. A new strategy for adaptively constructing multilayer feedforward neural networks. Neurocomputing, 51:361–385, 2003.
-  R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, and B. Hodjat. Ch.15 - evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pages 293–312. 2019.
-  Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in neural information processing systems, pages 107–115, 1989.
-  Lutz Prechelt. Investigation of the cascor family of learning algorithms. Neural Networks, 10(5):885–896, 1997.
-  Russell Reed. Pruning algorithms – a survey. IEEE transactions on Neural Networks, 4(5):740–747, 1993.
-  C. Lebiere S.E. Fahlman. The cascade-correlation learning architecture. Advances in Neural Information Processing Systems, 2, 1990.
-  Bruce E Segee and Michael J Carter. Fault tolerance of pruned multilayer networks. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume 2, pages 447–452. IEEE, 1991.
-  A.E. Shaw, D. Hunter, F.N. Iandola, and S. Sidhu. Squeezenas: Fast neural architecture search for faster semantic segmentation. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 2014–2024, 2019.
-  Wei Weng and Khashayar Khorasani. An adaptive structure neural networks with application to eeg automatic seizure detection. Neural Networks, 9(7):1223–1240, 1996.
-  Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10734–10742, 2019.