Over the last decade, machine learning algorithms have achieved vast progress in various fields. Namely, general approach called deep neural networks (DNN) with multiple hidden layers, has enabled machine learning algorithms to perform at an acceptable level in the many areas, in some cases outperforming human accuracy . Such progress, in no small measure, has become available due to modern hardware computational capabilities, enabling the training of large DNN on an immense amount of data.
On the other hand, even though large models perform very well on complex tasks, we cannot endlessly rely on an infinite increase in computational resources and size of datasets. Training large neural networks is energy, time and memory demanding task. Recently, researchers started questioning energy consumption of machine learning algorithms and their carbon footprint . Thus it will not be superfluous to develop a strategy for the models that have a constrained number of parameters, sufficient enough for the certain task, and can be trained fast, rather than chasing higher accuracy by enlarging the number of parameters and using more complex hardware.
Universal approximation theorem  claims that a feed-forward artificial neural network with a single hidden layer can approximate any continuous well-behaved function of arbitrary number of variables with any accuracy. The conditions are: a sufficient number of neurons in the hidden layer, and a correct weight selection. Above mentioned theorem for an arbitrary width case was originally proved by Cybenko  and Hornik  and later extended to an arbitrary depth case (DNN) in .
In this paper we get a deeper insight on the practical application of Cybenko’s theorem, in order to train a neural network, where all hidden neurons will be used efficiently. Therefore, we have to pay attention to two following aspects: number of neurons and correct weight selection.
Number of neurons in a hidden layer is a quite straightforward parameter that became trendy with availability of multi-threaded parallel computing on GPU 
. Models of a vast number of trainable parameters are not devoid of logic, as they generalize better and can be so-called ‘universal learners’. For example, GPT-3 having 175 billion parameters, is a perfect example of a universal learner. Thus, the community has been experimenting with model architectures increasing width  or depth  of neural networks. Issues, such as vanishing gradient [17, 18] was resolved by applying methods, including second-order Hessian-free optimization , training schedules by using greedy layer-wise training [34, 15, 39]11], layer-size-dependent initialization, such as Xavier  and Kaiming  and skip connections 
. Even though, we can make arbitrarily large models make good predictions, to achieve computational sustainability by expanding the number of trainable parameters up to infinity, would not be the best option for the tasks of lower complexity. The community has been already trying to address this problem, thus several solutions dealing with this issue have occurred. For example, widely used ReLU activation function, saturated only in one dimension, which helps with vanishing gradient problem, on the other hand results in so-called ‘dying neurons’, modified activation functions such as Leaky ReLU , adaptive convolutional ReLU , Swish , Antirectifier  and many other were addressed to solve the problem of ‘neural graveyard’. Resource efficient solutions, such as pooling operations , LightLayers  depth-wise separable convolutions  were developed to reduce the complexity of the models.
Correct weight selection
, at first sight, depends on training parameters, such as loss function, number of epochs, learning rate etc. However, to train the neural network competently these weights have to be initialized stochastically. There are several ways to initialize weight, mainly aimed to avoid vanishing gradients. Nevertheless, stochastic weight initialization can result in neuron redundancy, when different neurons are trained in a similar manner. This is not crucial if the neural network is excessively large, however, in computationally sustainable models, neuron redundancy and ‘neural graveyards’ are undesirable. Moreover, there are numerous application when memory efficient model is required (e.g. autonomous devices such as sensors, detectors, mobile or portable devices). Such devices require memory and performance efficient solutions to learn spontaneously and improve from experience. In this case adding excessive parameters to the model can be rather questionable for the model application.
Therefore, once we consider each neuron of the model as an individual learner, the neural network can be seen as an ensemble. It is known that for ensembles diversity of learners is desirable to some extent . Thus, we can assume that diversity between neurons or reinforced diversification during the training can be beneficial for the model.
In this paper we foremost explore how the diversity between neurons evolves during the training and as a following step suggest methods for diversification of the neurons during the model training. This is especially relevant in resource constrained models, where neuron redundancy means reducing the number of predictors. Additionally, we show how weight pre-initialization can affect neural network training at the early steps.
2 Our Approach
Let us start with a term negative correlation (NC) learning , which is a simple, yet elegant technique to diversify individual base-models in the ensemble and reduce their correlations. Ambiguity decomposition 
of the loss function raises the possibility of controlling the trade-off between bias, variance, and covariance using the strength parameter, to reduce covariance. In its order the concept of an NC learning is originated from bias-variance decomposition [3, 20] of ensemble learning. In this case, bias is the output shift from the true value, and variance is the measure of ensemble ambiguity, which simply means dispersion around the mean output value.
As it was first demonstrated by Krogh and Vedelsby 
quadratic error of ensemble prediction is always less that the quadratic error of each individual estimator of the ensemble:
Later Brown  demonstrated decomposition of ensemble error into three components - bias, variance and covariance, and shown, the connection between ambiguity and covariance:
The ensemble ambiguity is nothing less than the variance of the weighted ensemble around the weighted mean. Therefore, higher ambiguity, i.e. decorrelation between the ensemble output is desirable up to some measure.
Our first trial was to decorrelate neurons in the hidden layer by penalizing the difference between mean weight of the neurons and each neuron :
where is the regularization strength parameter, and n is the number of neurons in a layer.
However, it is likely more profitable to compare not only single weights, but weight matrices or e.g. kernels in convolutional neural networks (CNN), as trainable kernels represent. Thus, thesecond
way to define diversity is comparing neurons by cosine similarity:
where are weights of individual neurons and is the diversity measure.
In this technique we compare each weight in the layers and define a diversity measure . However, it has quadratic complexity of such expression, which would oppose the idea of the current work, as our indent is fast and efficient training of resource constrained neural networks.
Therefore, combining the first two approaches we introduce and explore another method to define diversity in the neural networks:
After observing the training process and evolution of diversity measure in the models, we explored the possibility of weight pre-optimization using diversification. In this case, we used Kaiming weight initialization, with further optimization to enlarge the diversity between the weights, and at the same time keep weight mean and standard deviation of the weight matrix close to initial:
where is loss, is the initial weight mean, is the weight mean at k training step, is standard deviation of the initial weights array, and is standard deviation of the weights array at k training step.
We perform some initial experiments using DNN in order to study diversity evolution during the model training and demonstrate the effectiveness of proposed diversification mechanisms.
The experiments were performed on publicly available benchmark dataset Fashion MNIST. This dataset was chosen as it is suitable for DNN training and has higher variance than traditional hand-written digits dataset MNIST 
. We implemented one-hidden-layer neural network with 16, 32, 64, 128, and 256 neurons in the hidden layer (see Table 1), using PyTorch library. Otherwise, we used standard parameters for the training, including Adam optimizer  with a learning rate of 0.01, cross entropy loss function with penalization terms (Eq. 3-5):
where presents training set, is true distribution, is predicted distribution, is standard deviation of the weights array at k training step,
is the probability of eventestimated from the training set, and is the diversity measure, obtained using Eq. 3, 4, or 5.
4 Results and Discussion
4.1 Evolving Diversity and Symmetry Breaking
During the model training, one can notice sub-optimal accuracy stagnation for a several epochs, this can be associated with the existence of local minima on a loss function surface [2, 35]. This can be associated with a symmetry in the neural network layer, which is shown to be a critical point especially for small neural networks[1, 37]
. We found out that naturally the model tends to decrease the correlation between the neurons, however when the model converges to a local minimum with a sub-optimal accuracy, the similarity between the neurons rises up until the moment when the optimization process surpasses the local minimum and the accuracy increases. (see Figure 1) This correlates with an existence of symmetry in the weights, once weights are symmetrical (correlated) and the number of neurons is constrained, the overall output of the model will likely to be inefficient.
4.2 Negative Correlation Learning
|Number of Neurons &|
|Test Accuracy, %|
The experiment above inspired us to study certain ways to decorrelate neurons in the hidden layer, thus brake the symmetry that can appear during the learning process. As we discussed earlier, we consider the output of neural network as an output of an ensemble. Thus, first, we did simple NC learning, applied to the individual neurons, rather than ensemble of classifiers. The logic behind this experiment was rather comprehensible. Once the model has constrained number of parameters to generalize the data, higher variance would help to eliminate redundant neurons and overall prediction has to be more accurate. As it can be seen from the Figure 2. decorrelation mechanism helps to avoid local minima at the early stage on the model learning. Nevertheless, decorrelation using NC learning generally did not result in the higher accuracy overall. We associate it to several factors, such as Kaiming weight initialization that help to avoid vanishing gradient, and Adam optimizer which is a replacement optimization algorithm that can handle sparse gradients on noisy data, and thus is able to efficiently overcome local minima due to adaptive learning rated for each parameter. Eventhough, these widely used techniques are dealing with the above mentioned problem of the neuron redundancy, our proposed model can help at the early stages of a model training.
[.3][c] [.3][c] [.3][c]
Moreover, with an increasing number of neurons the influence of decorrelation diminishes, this can be explained, that excessively large NN performs good at the low variance data as well as not every neuron is needed for a good prediction. However, in the present work we consider computationally sustainable DNN, where all the neuron are forced to contribute the prediction and on the other hand, for complex data larger amount of neurons would be needed to generalize the dataset. Therefore, for more sophisticated problems neuron diversification may be efficient for a larger number of neurons. However, in the present case we performed further experiments on the model with 64 neurons in the hidden layer, which we consider sufficient for a given dataset. All the models were trained for 10 times to calculate mean and standard deviation. In Table 2 the average testing accuracy of the first 10 epoch for the DNN with 64 neurons in the hidden layer trained using negative correlation learning (Eq. 3) is shown.
|Train Acc., %||Test Acc., % ,||Test Acc. STD|
4.3 Pairwise Cosine Similarity Diversification
It has to be noted that, unlike in , where universal diversification strength parameter was found for the ensembles of all sizes, in our case value depends on the size of the hidden layer and has to be rather considered as per neuron. However, on the other hand it is loss-dependent, which means that, ideally, it has to be same or one order of magnitude smaller than the output of the loss function during the training, otherwise, rather than the model loss (e.g. cross entropy), reciprocal diversity measure will be optimized. Thus, the reader has to consider optimizing value for each certain neural network and loss function. Thus optimal approximately can be estimated as:
where n is the number of neurons in the hidden layer and is the loss function order of magnitude.
In addition to NC learning, we introduced diversity measure based on cosine similarity between the neurons (Eq. 4). Such technique, seems to be promising due to several reasons: first, we, rather that mean values, compare patterns, which can be useful for more complex models, such as CNNs or transformers, moreover here, each neuron is compared with each, thus such model is intended to be more robust. Nevertheless, at least for DNN, results we comparable with NC learning (see Table 3), additionally, such method has quadratic complexity, which opposes our initial aim to train small models faster and more efficient.
|Train Acc., %||Test Acc., % ,||Test Acc. STD|
4.4 Reaching Linear Complexity
To enable our diversification method to compare patterns, however avoid quadratic complexity, we combined the fist concept of NC learning with the second one, and implemented diversity measure based on penalization of the cosine similarity of each neuron in the hidden and layer’s neurons mean (Eq. 5). The algorithm (see Table 4) overhead is comparable with regularization. Moreover, it has shown the highest accuracy gain among three.
|Train Acc., %||Test Acc., % ,||Test Acc. STD|
4.5 Iterative Diversified Weight Initialization
However, it can be noticed, that occasionally, during the training, the model do not behave exactly as expected, creating an outlying learning curves. This is most likely associated with stochastic weight initialization. In this case Kaiming initalization is used . Kaiming initialization is widely used for the neural networks with ReLU activation functions and related to the nonlinearities of the ReLU activation function, which make it non-differentiable at . The weights, in this case are initialized stochastically with the variance that depends on the number of neurons :
It is fair to suggest, that correlation between the initialized weights can play significant role in the model learning process. Indeed, in the Figure 1. it is clearly seen, the the model gained the most of its accuracy while reducing the correlation between neurons during the first few epochs. However, the aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network. Usually weight are initialized stochastically with a small number to avoid vanishing gradients especially if or activation functions are used. Thus, to obtain stochastically initialized, yet decorrelated, weights we introduced iteratively diversified Weight initialization, using custom loss function based on Eq. 6. The logic behind such initialization is to reduce the diversity measure between the weights and at the same time keep weights mean and weights standard deviation close to the originally initialized using Kaiming initialization.
|Train Acc., %||Test Acc., % ,||Test Acc. STD|
In this paper we show how to explore and tame the diversity of neurons in the hidden layer. We studied how the correlation between the neurons evolves during the training and what is the effect on prediction accuracy. In appears, that once the model is converged to the local minimum on the loss landscape, correlation between the neurons increases up to the point when the optimization process overcome the local minimum. Thus, we introduced three methods how to dynamically reinforce diversification and thus decorrelate neural network layer. The concept of negative correlation suggested by Brown  was reviewed and expanded. Instead of decorrelation individual neural networks in the ensemble we diversified neurons in the hidden layer, using three techniques: negative correlation learning, cosine pairwise similarity, cosine similarity around the mean.
First technique is originated from the neural networks ensembles and shows a decent performance in our example using DNN, however for more sophisticated models, such as CNNs and transformers, second and third technique is likely to be more advantageous as far as it can compare patterns. Additionally to reach correct weight selection, we introduced weight iterative optimization using weight diversification. It was shown that such techniques are suitable for the fast training of small models and notably affect their accuracy at the early stage. Which is a small, yet important step towards the development of a strategy towards energy-efficient training of neural networks.
Our future plans for using neural network diversification primarily consists in using above described diversification techniques in more sophisticated models in order to explore the possibility to improve training speed and reduce the number of training parameters. Popular architectures, such as transformers can benefit from the individual head diversification in multi-head attention block, as far as multiple heads are intended to learn various representation. Furthermore, we are planning to explore more pattern-oriented techniques for defining diversity between neurons to enable efficient diversification application in CNNs.
This research is supported by the Czech Ministry of Education, Youth and Sports from the Czech Operational ProgrammeResearch, Development, and Education, under grant agreement No. CZ.02.1.01/0.0/0.0/15003/0000421 and the Czech Science Foundation (GAČR 18-18080S).
-  (2020) Symmetry and critical points for a model shallow neural network. External Links: Cited by: §4.1.
Avoiding local minima in feedforward neural networks by simultaneous learning.
AI 2007: Advances in Artificial Intelligence, M. A. Orgun and J. Thornton (Eds.), Berlin, Heidelberg, pp. 100–109. External Links: Cited by: §4.1.
-  (2021) When does diversity help generalization in classification ensembles?. External Links: Cited by: §2.
-  (2004) Diversity in neural network ensembles. Technical report . Cited by: §1, §2, §2, §4.3, §5.
-  (2020) Language models are few-shot learners. External Links: Cited by: §1.
Xception: deep learning with depthwise separable convolutions. External Links: Cited by: §1.
-  (2011) Flexible, high performance convolutional neural networks for image classification. In Twenty-second international joint conference on artificial intelligence, Cited by: §1.
Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: §1.
-  (2020) Adaptive convolutional relus. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 3914–3921. Cited by: §1.
-  (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §1.
-  (2011) Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. Cited by: §1.
-  (1990) Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence 12 (10), pp. 993–1001. Cited by: §2.
-  (2015) Deep residual learning for image recognition. External Links: Cited by: §1, §4.5.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In
Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §1.
-  (2006) A fast learning algorithm for deep belief nets. Neural computation 18 (7), pp. 1527–1554. Cited by: §1.
-  (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §1.
Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.
A Field Guide to Dynamical Recurrent Neural Networks, S. C. Kremer and J. F. Kolen (Eds.), Cited by: §1.
-  (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6 (02), pp. 107–116. Cited by: §1.
-  (1991) Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), pp. 251–257. Cited by: §1.
-  (2019) Averaging weights leads to wider optima and better generalization. External Links: Cited by: §2.
-  (2021) LightLayers: parameter efficient dense and convolutional layers for image classification. External Links: Cited by: §1.
-  (2017) Adam: a method for stochastic optimization. External Links: Cited by: §3.
Validation, and active learning. Advances in neural information processing systems 7 7, pp. 231. Cited by: §2.
-  (2019) Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700. Cited by: §1.
-  (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Cited by: §3.
-  (2020-06) Dying relu and initialization: theory and numerical examples. Communications in Computational Physics 28 (5), pp. 1671–1706. External Links: Cited by: §1.
-  (2017) The expressive power of neural networks: a view from the width. External Links: Cited by: §1, §1.
-  (2019) Deep learning for fast adaptive beamforming. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1333–1337. Cited by: §1.
-  (2011) Learning recurrent neural networks with hessian-free optimization. In ICML, Cited by: §1.
-  (2007) Massive threading: using gpus to increase the performance of digital forensics tools. Digital Investigation 4, pp. 73 – 81. External Links: Cited by: §1.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Cited by: §3.
-  (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §1.
-  (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In International conference on artificial neural networks, pp. 92–101. Cited by: §1.
-  (1992) Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Computation 4 (1), pp. 131–139. Cited by: §1.
-  (2017) Local minima in training of neural networks. External Links: Cited by: §4.1.
-  (2014) Going deeper with convolutions. External Links: Cited by: §1.
-  (2020) Inverse problems, deep learning, and symmetry breaking. External Links: Cited by: §4.1.
-  (1996) Generalization error of ensemble estimators. In Proceedings of International Conference on Neural Networks (ICNN’96), Vol. 1, pp. 90–95. Cited by: §2.
Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §1.
-  (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §3.
-  (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §1.