1. Introduction
(Artificial) deep neural networks are nowadays very popular algorithms aiming to imitate processes inside a human brain. They train by examples and are shown to be very effective for pattern recognition problems of different kinds. Despite an avalanche of applications, much is unknown about neural networks from the mathematical point of view. Much is done by guessing and by whatever worked well in similar problems.
One important open problem is determining the best architecture for a neural network. For a layered network this means to determine the number of neurons and the number of layers for a neural network. Among the main approaches to automatic architecture search are the following. See also survey [13].

Empirical/statistical methods that choose the weights according to the effect they make on the model’s performance, see, e.g. [4].

Evolutionary algorithms that start with selecting parent networks, then proceed with combination and mutations, and selecting the best ones. The algorithms then repeat by assigning the best ones as new parents. See e.g. [3, 7, 15].

Pruning methods that start with a larger than necessary multilayer network and then remove neurons that have little contribution to the solution. There are several different ways to decide which neuron is not needed, see e.g. [18, 16, 6, 11, 5]. Pruning does not lead [20] to the increase of fault tolerance of the system. Among known disadvantages is that one usually does not know a priori how large the original network should be. Also starting with a large network could be excessively costly to trim the unnecessary units.

Constructive methods that start with an initial network of a small size, and then incrementally add new hidden neurons and/or hidden layers until some prescribed error requirement is reached or if no performance improvement can be observed, see e.g. surveys [10] (add a neuron, add a layer) [12], and e.g. papers [14, 1, 22, 17, 19, 9, 8, 21]. Among known disadvantages of these methods is that the size of the obtained multilayer networks are reasonable but rarely “optimal”.
Our growing architecture algorithm is a combination of ideas of both prunning and constructing algorithms. We also extend our domain to a more general one (that include layered networks as a particular case). For us an architecture is an arbitrary oriented graph with some weights (along with some biases and an activation function), so there may be no layered structure in such a network. We compare our optimized network with the large number of networks with standard architectures. We show that for the same error we can have a significantly smaller complexity.
In recent years, we have seen the everincreasing efficiency of neural networks. At the same time, their complexity is growing. Here we measure the complexity of a neural network by the number of weights
, the values of which are selected in the learning process. Those who do not work directly with neural networks usually expect the complexity of the network to be tens, hundreds, at most thousands. In reality, the complexity of modern neural networks is much higher. Thus, for the standard MNIST handwritten digit classification problem, the number of learnable parameters in the best networks is hundreds of thousands and millions, while there are only
training examples (small square pixel grayscale images of handwritten single digits between and ).Minimization of the network complexity is the goal of our work.
Most of our computations are realized in C++ instead of some conventional package (e.g. TensorFlow/Python). This is because layered architectures are the main objective of such specialized packages, and dealing with nonlayered ones presents such difficulties that outweigh their conveniences.
The structure of the paper is as follows. In section 2, we introduce our notation. In section 3, we describe our “architecture growing” algorithm. In section 4, we present the results of our experiments with the brightness prediction problem and compare our algorithm with polynomial regression and standard neural networks. In section 6, we summarize our results.
2. Preliminaries. Architecture and complexity
Let be a function that represents a feedforward neural network, where
be an input vector and
be a vector of learnable parameters (weights). For regression tasks we minimize target function(1) 
We do not consider convolutional neural networks: all training vectors have a fixed length. The training dataset is represented in the form of a matrix , each line of which first contains the value , and then coordinates of the vector .
Fully connected are layered networks where each neuron receives the values of all neurons from the previous layer. Our algorithm allows networks of even more general type, where every neuron receives all input values and also values from all previous neurons. We shall call such maximally fully connected. By changing weights, one can include layered fully and not fully connected networks into the network of such a general type. Each neuron is a function of of the form
where are neuron’s input arguments, are the corresponding weights and is some activation function. In this paper we consider some most common ones, see details in section 4.3.
Hardware specification: Intel(R) Pentium(R) CPU G4500 @ 3.50GHz, 32G byte mem. and Intel(R) Core(TM) i78565U CPU @ 1.80GHz 16 G byte mem.
Software specification: Visual Studio 2019 Community (C++), Python 3.9.5, PyCharm 2019 Community Edition, Numpy 1.19.5, TensorFlow 2.5.0.
3. Architecture growing algorithm
The rough idea is to remove redundant connections in the neural network, while possible, and then to add a new neuron to its beginning, with running training processes in between. Then again remove some connections, while possible, and when it not possible, add a neuron, and so on. We first describe elements of the algorithm and then put it all together at the end of the section.
3.1. Connection removal procedure

Find three connections with the least (w.r.t. their absolute value) weights.

Create three different networks, by removing one of these three connections in each case. In every case start the learning process for the resulting network to minimize (1) within the specified time .

Choose from the three reduced optimized networks the one with the smallest error and optimize it with training time .

While removing a connection, it may turn out that this connection is the only input connection for some neuron, i.e. it works by the formula
In this case, we remove such a neuron, and approximate its action by linear function
where we choose s.t. to minimize the deviation
where is the interval of values taken by the parameter for the whole training matrix.

While removing a connection, it may turn out that the value of some neuron does not participate anymore in further calculations. We remove such a neuron.
Repeat the “one connection removal procedure” until the error increases by no more than times, where is small enough. In our experiments .
Remark 3.1.
One might expect that by removing connections in a neural network, we increase its error. However, as we discovered, it is frequently not the case if the original network was not trained well enough. In such cases the quality of the “reduced” network can be much improved after the optimization. Our experiments showed that in the beginning, when the connections are started to be removed, the error in the network almost does not grow until we reach a “saturation” point where any attempt to remove any further connection results in a noticeable increase of the error. The boundary was chosen based on the results of these experiments.
3.2. Adding a neuron procedure
In the course of the algorithm, we run the connection removal procedure until the error increase is too large. After that we do the following.

Add one extra neuron to the very beginning of the neural network and connect it with all input parameters and with all other neurons of the original network.

Set all the weights of all new connections equal to . Thus, the computations in this new neural network go exactly by the same algorithm as in the original one, and the error will remain the same.

Retrain the new network with training time . The error of the network decreases.
3.3. Architecture growing algorithm
We start with an arbitrary architecture and then execute the following procedure.

Remove all redundant connections as described in section 3.1.

If the complexity of the network reaches the preset limit, end the procedure.

Add a neuron as described in section 3.2.

Return to step (1).
On fig. 1 and fig. 2 are two typical graphs of the dependence of the error on the complexity that we obtain executing our algorithm.
On fig. 1, the starting point of the algorithm is represented by the blue point above the start of . Removing connections, we move in the direction indicated by
, from the right to the left: complexity decreases while the error slightly increases. At some moment, removing a connection leads to a sudden jump in error (see along verticalish arrow).
Once this happens we stop removing connections and add one more neuron. This moves us to the start of . From there we continue the process of removing connections (and thus move in the direction indicated by ) until we reach the jump in error, and then the process repeats.
4. Comparison with other approaches on brightness prediction problem
To illustrate our idea, consider the brightness prediction problem for an image point knowing the brightness of several previous points.
4.1. Neuro network built using our approach
The data is kept in a table, where rows and columns are and coordinates of the point and values are the numbers from to .
The previous points are ordered as shown in table 1. Here is the current point, and the first column and the first row give the coordinates of the points relative to . The other numbers in the table indicate the order in which the points will be considered.
3  2  1  0  1  2  3  
3  15  13  16  
2  10  7  5  8  11  
1  14  6  2  1  3  9  17 
0  12  4  0  Y       
To normalize the numbers, we subtract from all values of the brightness of the previous (left) point and divide all values by . So building the forecast by points, there will be only input parameters in the neural network. For the experiments we choose a graphics file of size . For previous points, the resulting training file looks as in table 2.
4.2. For comparison: linear and polynomial approximation
As a starting point, we compare the performance of our optimized networks with the results obtained using linear/polynomial regressions:
where is a polynomial in of degree , and is the vector of weights. table 3 contains the mean square errors (over all points of the image) in such approximation with polynomials of degrees , , and .
number of points  Error  Complexity  Error  Complexity  Error  Complexity 

3  0.00466  3  0.00465  6  0.00462  10 
4  0.00417  4  0.00416  10  0.00381  20 
5  0.00402  5  0.00399  15  0.00354  35 
6  0.00366  6  0.00366  21  0.00359  56 
8  0.00359  8  0.00359  36  0.00350  120 
10  0.00336  10  0.00325  55  0.00274  220 
12  0.00333  12  0.00322  78  0.00269  364 
18  0.00331  18  0.00318  171  ?  1140 
Here the number of points used for the approximation is larger by than the number of input parameters of the network. The complexity of the network is the number of parameters. For example, the cubic approximation of the brightness of a point by the previous points has parameters (that are coefficients of the corresponding polynomial) and gives error .
To see the error in the original units and before normalization, one takes values of errors from table 3 and calculate the value of . For example, for , it is approximately .
4.3. For comparison: neural networks with some fixed architectures
Here we compare our optimized networks with the networks having some standard architectures. Specifically, we consider a large number of 3layered networks and maximally fully connected networks. The considered 3layered networks have from to neurons in the hidden layers, and the considered maximally fully connected networks have up to – neurons.
The choice of the activation function is a part of the architecture. We consider the most popular ones, and through many computations, choose the one that is the most efficient. The following activation functions were considered (see their graphs on fig. 3).
All but the first three functions are odd. Function was found to be the most effective.
Note that since the minimization process of the neural network is stochastic, repeating the experiment several times, we obtain different results. Thus, for the same architecture, we repeat the experiment times and choose the best result. In fig. 4 are the results of the performance on input points by standard networks.
The envelope from below (the piecewise linear line) in fig. 4 gives interpolation of our data and allows us to approximate the smallest expected error for a given complexity, or vice versa, the lowest complexity for a given error. We use this to compare best achievable results for networks of different complexities. The approach in particular makes sense since the results are of stochastic nature.
We obtained similar results through a large number of experiments with to of the input points for layer and fullconnected networks.
4.4. Our algorithm
On the same training matrices, we optimize the architectures using our algorithm.
table 4 allows us to compare the lowest achievable complexities (for our and the standard networks). For example, in the last line: with input points, among our experiments there is an optimized architecture network with a complexity . The corresponding error is . Then the best complexity of the standard networks with the same error is obtained based on the envelope shown in fig. 4. We see that for the same error, the complexity of the optimized network is significantly less than that of the standard network.
Number  Error  Best complexity of  Complexity of 

of points  standard networks for this error  optimized networks  
4  0.00358090  23  21 
0.00345647  29  22  
0.00329252  61  50  
0.00324596  114  100  
0.00323453  179  119  
5  0.00323960  22  18 
0.00290990  52  50  
0.00282196  329  100  
6  0.00316931  27  21 
0.00286380  56  50  
0.00277786  180  100  
0.00276770  359  118  
7  0.00268228  200  103 
0.00266749  300  116  
0.00266493  374  123  
8  0.00266364  100  66 
0.00259725  162  104  
10  0.00263869  100  67 
0.00258670  150  94  
0.00254739  205  123  
11  0.00265986  100  52 
0.00256666  200  92  
0.00251747  300  121  
0.00251157  312  128 
A different type of comparison is given in table 5. For a given number of points, it shows the best achieved quality (i.e. minimizing ) by standard and our optimized networks.
standard networks  our optimized networks  
number of  smallest achieved  corresponding  smallest achieved  corresponding 
previous points  error  complexity  error  complexity 
5  0.00323453  180  0.003245946  78 
6  0.00282148  346  0.002768079  415 
7  0.00276767  361  0.002721608  339 
8  0.00266489  376  0.002628534  261 
9  0.00257263  277  0.00259757  101 
11  0.00252952  232  0.002547398  124 
12  0.00251157  313  0.002490456  173 
The constructed by our algorithm network architectures can be represented as a graph: neurons correspond to vertices, and connections between neurons correspond to graph edges. For better visualization the input values and edges outgoing from these vertices are not shown. Below are a few obtained graphs in “circular” form where vertices of graph are located in the vertices of a regular polygon.
5. Comparison with other approaches on image approximation problem
Here we look into approximating of a black and white graphics file of size by a function of two input variables, . We consider 3layered architectures with two inputs, neurons in the first layer, neurons in the second layer, and output neuron. Here . We run the learning process on
TensorFlow/Keras
. We optimize these networks using our approach. On fig. 10 is the graph of one of the optimized by us networks.A comparison of the described networks one can see in table 6. To compare complexities we use the same approach as in Sec. 4, i.e. construct an envelope from the below to extrapolate discrete data that we get from computations, see fig. 9. One can see that for the same error, our optimized networks offer significantly smaller complexities.
Given  Best complexity  Best complexity 

error  of standard architectures  of our optimized networks 
0.014  67  48 
0.013  87  63 
0.011  117  99 
0.010  123  118 
0.009  145  138 
0.008  182  169 
0.007  237  214 
0.006  308  292 
0.005  408  371 
0.0045  474  410 
0.0040  590  449 
0.0035  768  488 
0.0030  987  527 
0.00225  1765  562 
0.0020  —  606 
0.0012  —  847 
0.0010  —  921 
0.0007  —  1158 
0.0006138  —  1274 
6. Conclusions
We propose a new kind of automatic architecture search algorithm. The algorithm alternates pruning connections and adding neurons, and it does not restrict itself to layered networks only. Instead, we search for an architecture among arbitrary oriented graph with some weights (along with some biases and an activation function), so there may be no layered structure in such a network. The goal is to minimize the complexity staying within a given error.
We begin with any standard architecture and create our optimized one by pruning and letting it grow, pruning and again letting it grow, and so on.
For large networks, where the number of connections counts in hundreds, the complexity of the optimized network is times smaller than that of a standard network with the same error. For small networks, where the number of connections counts in tens, the complexity decreases by a factor of . Here by standard networks, we mean the best results obtained with layer and fullconnection networks.
The algorithm can be sped up by for example not considering every connection while pruning.
References

[1]
T. Ash.
Dynamic node creation in backpropagation networks.
Connection Science, 1(4):365–375, 1989. 
[2]
Vijay B. Zoph, Vasudevan, Jonathon Shlens, and Quoc V Le.
Learning transferable architectures for scalable image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 8697–8710, 2018. 
[3]
P.G. Benardos and G.C. Vosniakos.
Optimizing feedforward artificial neural network architecture.
Engineering Applications of Artificial Intelligence
, 20(3):365–382, 2007.  [4] PG Benardos and G Cl Vosniakos. Prediction of surface roughness in cnc face milling using neural networks and taguchi’s design of experiments. Robotics and ComputerIntegrated Manufacturing, 18(56):343–354, 2002.
 [5] Giovanna Castellano, Anna Maria Fanelli, and Marcello Pelillo. An iterative pruning algorithm for feedforward neural networks. IEEE transactions on Neural networks, 8(3):519–531, 1997.
 [6] E.D. Karnin. A simple procedure for pruning backpropagation trained neural networks. IEEE Transactions on Neural Networks, 1(2):239–242, 1990.
 [7] J. Koza and J. F. Rice. Genetic generation of both the weights and architecture for a neural network. In IJCNN91Seattle International Joint Conference on Neural Networks, volume 2, pages 397–404, 1991.
 [8] TinYan Kwok and DitYan Yeung. Objective functions for training new hidden units in constructive neural networks. IEEE Transactions on neural networks, 8(5):1131–1148, 1997.
 [9] TinYau Kwok and DitYan Yeung. Bayesian regularization in constructive neural networks. In International Conference on Artificial Neural Networks, pages 557–562. Springer, 1996.
 [10] TinYau Kwok and DitYan Yeung. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE transactions on neural networks, 8(3):630–645, 1997.
 [11] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
 [12] TsuChang Lee. Structure level adaptation for artificial neural networks, volume 133. Springer Science & Business Media, 2012.
 [13] A. Rawat M. Wistuba and T. Pedapati. A survey on neural architecture search. arXiv:1905.01392 [cs.LG], 2019.
 [14] L. Ma and K. Khorasani. A new strategy for adaptively constructing multilayer feedforward neural networks. Neurocomputing, 51:361–385, 2003.
 [15] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, and B. Hodjat. Ch.15  evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pages 293–312. 2019.
 [16] Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in neural information processing systems, pages 107–115, 1989.
 [17] Lutz Prechelt. Investigation of the cascor family of learning algorithms. Neural Networks, 10(5):885–896, 1997.
 [18] Russell Reed. Pruning algorithms – a survey. IEEE transactions on Neural Networks, 4(5):740–747, 1993.
 [19] C. Lebiere S.E. Fahlman. The cascadecorrelation learning architecture. Advances in Neural Information Processing Systems, 2, 1990.
 [20] Bruce E Segee and Michael J Carter. Fault tolerance of pruned multilayer networks. In IJCNN91Seattle International Joint Conference on Neural Networks, volume 2, pages 447–452. IEEE, 1991.
 [21] A.E. Shaw, D. Hunter, F.N. Iandola, and S. Sidhu. Squeezenas: Fast neural architecture search for faster semantic segmentation. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 2014–2024, 2019.
 [22] Wei Weng and Khashayar Khorasani. An adaptive structure neural networks with application to eeg automatic seizure detection. Neural Networks, 9(7):1223–1240, 1996.
 [23] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10734–10742, 2019.
Comments
There are no comments yet.