I Introduction
Deep Neural Networks (DNNs) have contributed in the recent years to the most remarkable progress in artificial intelligence. Algorithms reach human-comparable (or even super-human) performances in relevant tasks such as computer vision, Natural Language Processing etc.
[5]. Nonetheless, it is still unclear how neural networks encode knowledge about a specific task: interpretability has thus become increasingly popular [12] as the key to understanding enabling factors behind DNNs remarkable performances. Complex Network Theory [1] (CNT) is a branch of mathematics that investigates complex systems, from the human brain to networks of computers, by modeling and then simulating their dynamics through graphs where nodes are entities and vertices relationships [2], [3], [14].We address the problem of characterizing a Deep Neural Network and its training phase with (ad-hoc) CNT metrics. In this paper we formally show that this is possible and indeed insightful. As a network becomes progressively more suitable at a target task (e.g., classification of images), we analyse the dynamics of CNT metrics, to spot trends that generalize across different settings. We focus on Fully Connected (FC) and Convolutional Neural Networks (CNNs)
[9]with Rectified Linear Units (
ReLU) [6] activations, thus representing a step towards extending the spectrum of applications of CNT to general-purpose Deep Learning.Ii Related Works
Recent works have addressed the problem of analysing neural networks as directed graphs within the Complex Network Theory. In [17]
, the authors use CNT metrics to distill information from Deep Belief Networks: Deep Belief Networks - which are generative models that differ from feed-forward neural networks as the learning phase is generally unsupervised - are studied with the lens of CNT by turning their architectures into a stack of equivalent Restricted Boltzmann Machines. An application of CNT metrics to feed-forward neural networks is described in
[19], where the analysis is focused on the emergence of motifs, i.e., connections that present both an interesting geometric shape and strong values of the corresponding Link-Weights. CNT has also remarkably used to assess the parallel processing capability of the network architectures [13]. These are, to the best of our knowledge, the works that model DNNs through CNT metrics. On the other hand, the evaluation of how Link-Weights (which are widely referred simply as ‘weights‘ in DNNs theory) vary during the training phase has been remarkably addressed in [4], and many other works followed [15]. In view of our findings, our work connects also to the works on the dynamics - and drift - of Link-Weights [16], despite our approach - which extends the analysis to ‘hierarchical‘ metrics such as Nodes Strength and Layers Fluctuation - does not involve the evaluation of quantities related to the Information Theory. From the point of view of replicability and open-sourcing code for this kind of analysis, Latora et al.
[1] provide an efficient implementation of a wide range of tools to study generic Complex Networks. Our work is both a theoretical and an engineering effort to translate these metrics to Deep Learning, especially when it comes to CNNs, where exploiting weights sharing and the symmetry of Convolution considerably reduces the computational time.Iii Contributions
Our contribution is threefold: by extending the CNT analysis to DNNs, we answer two core research questions (RQ1, RQ2); furthermore, we provide an efficient implementation of our methods (C1) which we expect it benefits and accelerate the progress of the field.
(RQ1) CNT metrics can analyse DNNs. We introduce a formal framework to analyse Link-Weights, Nodes Strength and Layers Fluctuations of FC and Convolutional ReLU networks.
(RQ2) CNT makes the DNNs training dynamics emerge. We train populations of networks from which we distill trends for Link-Weights, Nodes Strength and Layers Fluctuation, with the accuracy on the test set as the control parameter. Our framework is computationally efficient and can analyse CNNs with thousands of neurons on non-trivial tasks such as CIFAR10 [8]. In this sense, we show that our framework is effective to characterise both populations of neural networks trained with different initializations (ensemble analysis) as well as single instances (individual analysis).
(C1) We open-source the code to study DNNs Dynamics through CNT metrics111Download the code, along with the relative instructions and supplementary experiments from here https://github.com/EmanueleLM/CNT-DNNs We provide, under a permissive license, an efficient implementation of our tools to fasten the research of people interested in the topic.
Iv Methodology
In this section we introduce the CNT notation for DNNs. We then describe the CNT metrics that we use to capture a network’s learning dynamics. We finally discuss how to specifically analyse ReLU FC and CNNs.
Iv-a Neural Networks as Complex Networks
W.l.o.g., we consider a supervised classification task, where an algorithm learns to assign an input vector of real numbers -
- to a categorical output from a discrete set of target classes - -: to introduce the notation, we consider a generic neural network with hidden layers. Within each layer , every neuron is connected through a weighted link to all the neurons in the next layer . The output of a layer is the product of an affine transformation in the form, followed by a non-linear activation function
. The output of the last layer is a vector of real numbers , from which the operator extracts the predicted class. We will refer to the input and output vectors respectively as and , while and in the previous formulae refer to the parameters of a neural network layer . For an FC layer, is a matrix of size and is a vector of size . In a CNN layer,is a multi-dimensional tensor employed to compute the convolution with the layer’s input
. A graphic overview is sketched in Figure 1 (left).We use networks where each activation function in the hidden layers is a Rectified Linear Unit or ReLU (i.e., ), except for the last layer where a softmax
is applied and interpreted as the probability that the input belongs to one of the output classes, namely
, being the cardinality of the target classes set.Within the framework of CNT, a neural network is a directed bipartite graph . Each vertex is a neuron that belongs to a neural network layer . The intensity of a connection - denoted as ”weight” in both CNT and DNNs - is a real number assigned to the edge that connects two neurons. Finally, is the weight that connects neuron from layer to neuron from layer .
Iv-B Metrics for Neural Networks as Complex Networks
We propose three CNT metrics to study the learning dynamics of DNNs. For each metric, we report a synopsis followed by an in-depth analysis (a sketch of how the metrics map to a neural network is reported is Figure 1). We also report a few relevant use-cases that each metric accomplishes within our tool.
Link-Weights. The dynamic of the parameters in each layer reflects the evolution of the network training phase.
Nodes Strength. It quantifies how strong the signal is propagated through each neuron. It reveals salient flows and weak connections.
Layers Fluctuation. It extends to DNNs the notion of Nodes Fluctuation so that it is possible to measure the Strength disparity at the level of the network hidden layers.
Iv-B1 Link-Weights Dynamics
As the network performs the training phase, we investigate weight dynamics in terms of mean and variance in each layer. Given a neural network layer
, we define:(1) |
(2) |
1.1) Use-cases: The evolution of mean and variance through the training steps give significant background about learning effectiveness and stability. An underrated yet common issue is when the weight norm does not grow. This is often a symptom of model over-regularization [20]. On the other hand, it is well known that where the weights grow too much, one may incur in over-fitting problems that are mitigated by regularization techniques.
Iv-B2 Nodes Strength
The strength of a neuron is the sum of the weights of the edges incident in . neural networks graphs are directed, hence there are two components that contribute to the Node Strength: the sum of the weights of outgoing edges , and the sum of the weights of in-going links .
(3) |
2.1) Use-cases: When a neuron has an anomalously high value of Node Strength, it is propagating, compared to the other neurons in the layer, a stronger signal: this is a hint that either the network is not able to propagate the signal through all the neurons uniformly or that the neuron is in charge of transmitting a relevant portion of the information for the task. On the other hands, a node that propagates a weak or null signal may be pruned (hence reducing the network complexity) as it doesn’t contribute significantly to the output of the layer. A viable approach might be extreme value theory [18], as it is important to quantify the probability of a particular value to occur before deciding either to prune it or to act on the other neurons of the layer.
Iv-B3 Layers Fluctuation
CNT identifies neural network asymmetries at nodes and links level. The standard measure, known as Nodes Disparity [7], is defined for a node as . Nodes Disparity ranges from to with the maximum value when all the weights enter a single link. Conversely, weights that are evenly distributed cause the nodes in the networks to have the same - minimum - value of Disparity.
Nonetheless Disparity and similar metrics are widely adopted for studying Complex Networks [11], a fundamental problem arises when weights in the previous equation assume positive and negative values, as in the case of DNNs: the denominator can be zero for either very small values of weights or concurrently as the sum of negative and positive values that are equally balanced. In addition, it is appropriate to include a metric that measures the fluctuation of strengths in each layer, as the nodes in a DNN contribute in synergy to the identification of increasingly complex patterns.
We propose a metric to measure the Strength fluctuations in each layer, as a proxy of the complex interactions among nodes at the same depth.
The Layers Fluctuation is defined as:
(4) |
where is computed as the average value of Nodes Strength at layer . Please note that differently from the standard Nodes Fluctuation introduced at the beginning of this section, the Layers Fluctuation formula drops the dependence from each specific node to characterize a layer’s dynamics.
The advantage of this metric is to measure disparity in a way that avoids numerical problems yet allowing to describe the networks whose weights can assume any range of values, without being restricted to positives only.
3.1) Use-cases: Layers Fluctuation can be used to spot bottlenecks in a network, i.e., cases where a layer impedes the information from flowing uninterrupted through the architecture. In the experimental evaluation we show how Layers Fluctuation enables to spot interesting behaviors of a network while other metrics (included the Link-Weights and Nodes Strength) are not sufficient.
Iv-C DNNs Setup for CNT Analysis
We take a number of precautions before setting up the experiments so that interpreting the dynamics of the CNT parameters is straightforward. The DNNs hidden layers are activated through ReLU functions, which are universal approximators [6]. We also normalize each input variable from the data between 0. and 1. before being fed into the network, so that the weights - which can be either positive or negative - define through the ReLU function the sign of each hidden neuron. In this way, CNT metrics capture the network dynamics and trends are easier to interpret as their sign depends on the parameters of the network. To further improve the interpretability of the results, CNT metrics can be extracted neglecting the output layer (softmax activation) and replacing it, after the training phase, with a linear activation: we note that this operation is legitimate only for the last layer [7], where the neuron that receives the weights with the greatest magnitude is selected as the placeholder of the output class.
V Experimental Results
In this section, we provide the results of our framework. We initially describe the neural architectures that we employ, their parameters, and the data that serve as test bed. We then detail results for a population of diverse neural networks (ensemble analysis), and we end the Section with the case of a single network analysis (individual analysis).
V-a Experimental Outline
V-A1 Settings
In the ensemble analysis, parameters of FC and CNNs are sampled initially from either normal or uniform family distributions222 For reasons of space, in the main paper we discuss the results for the networks whose parameters are initialized with normal distributions, while the analysis with uniform initialization is reported in Supplement, check this footnote
V-A2 Training
We independently train hundreds to thousands of neural networks (for each category we have introduced), which are then clustered in bins based on the accuracy of each model on the task’s test set: more details on the experimental setup are provided in Table I.A. Accuracy ranges from (random-guess) to , with bins each of size . Each bin contains at least networks. Networks segmented in this way are obtained through a mix of learning and early stopping. On the other hand, identifying classes of networks whose accuracies saturate is very hard and out of the scope of this work333Finding networks that naturally stop at a specific level of accuracy is hard for the following reasons: (i) when network topology and task are fixed, it is difficult to obtain several networks that cover the whole - low/high accuracy - performance spectrum; (ii) when incorporating networks with different topologies, one incurs unsustainable computational costs and difficulties to homogenize the ensemble analysis, especially for Nodes Strength and Layers Fluctuation where different numbers of neurons/links pose issues when comparing varying architectures.. As a final note, while we report the results for a single round of experiments, we calculate and plot the metrics multiple times hence randomizing the composition of the populations involved so that the analysis is statistically sound.
V-A3 CNT Analysis
We choose MNIST
[10] as a test bed for ensemble analysis444Despite we could extend the ensemble analysis to increasingly complex tasks (e.g. CIFAR10) without further adjustments to the framework, we leave it as future work. This non-trivial extension is beyond the scope of this work and involves a massive amount of computational resources.. We normalize the data-sets before training so that each input dimension spans from to , while DNNs weights are unbounded in . As a complement to the ensemble analysis and as an extension of the work in [19], we use our framework to study the dynamics of individual DNNs on CIFAR10 [8], where we compute CNT metrics for different snapshots (at different levels of accuracy) of a single neural network. Patterns that are specific to that instance are hence local. The details on both the ensemble and local approaches are reported in Table I.B.V-A4 Results Visualization
In the ensemble
analysis, we compute and plot the CNT metrics as Probability Distribution Functions (PDF) or alternatively as error bars. In each Figure, the population of lowest-performing networks is plotted in
grey, while the best performing network’s PDF is plotted in blue: we use color scale (from grey to blue) to plot increasingly performing networks. We note that the same approach is applied to plot the dynamics in the individual analysis.V-B Results
V-B1 Trends in Ensemble Analysis
When comparing low vs. high performing FC networks trained on MNIST, Link-Weights analysis distills patterns that only partially capture the complexity of the training dynamics, as evidenced in Figure 7. In fact, low-performing networks are hardly distinguishable from their high-performing correspondents. We note that the Link-Weights PDFs tend to flatten out slightly as performances improve, generating heavy tails with stable mean values. The absence of a strong discrimen between low and high performing networks becomes even more severe when analysing CNNs (Figure 2, top), thus providing motivation in pursuing an analysis that encompasses increasingly complex metrics such as Nodes Strength and Layers Fluctuations. The initial distribution of the Nodes Strength, which in the early stages of learning is characterized by zero-mean and a small variance, tends to widen its support as the train increases the network’s accuracy. As sketched in Figure 8, high-performing FC networks with support , exhibit values of Nodes Strength that concentrate around positive or negative values, relegating a contribution close to zero to a restricted number of nodes. Interestingly, in a few hidden layers the Nodes Strength PDFs is bimodal (Figure 8 bottom, layers 2 and 3) while in general heavy tails characterize accurate networks. As the variance of the Nodes Strength increases, Layers Fluctuation tends to flat with the PDF that in a few cases becomes multi-modal as model accuracy increases (Figure 9).
For FC networks initialized with Normal distributions and support 0.5, we report results from a batch of experiments with dynamics that are worth discussing: the networks with highest accuracy (Figures 7, 8, 9, top row, blue PDFs) significantly deviate from the behavior of the precedents as they exhibit restrained variance. Similar observations - trhough targeting different research questions - have been made in Shwartz-Ziv et al. [16]
, where they analyse mean and variance of a network’s weights at different phases of learning, with the training epochs as control parameter (the reader can refer to Figures 4 and 8 in
[16]). While so far patterns in the learning dynamics have been observed only for single neural instances, our approach permits to appreciate these at a global level.The same analysis conducted on a population of CNNs evidences similar trends for all the three metrics we evaluate, as sketched in Figures 2, 3, 4. Highly discriminant, monotonic behaviors of the PDFs are especially evident when analysing the Nodes Strength, where the Convolutional layers (Figure 3, the first 3 plots from the respective rows) are characterized by fat positive tails
, a hint that in a population of networks initialized with that configuration, which we remember is the same for all the networks, very different yet highly accurate models co-exist. We complement the qualitative analysis exposed so far, with the PDFs statistical quantification of skewness and kurtosis (Figures
10 and 5).V-B2 Trends in Individual Analysis
We train a layers CNN - 4 Convolutional + 2 dense layers - on CIFAR10, reaching different levels of accuracy - up to - and we show that: (i) CNT is both a local and a global method, i.e., it shows trends in the training dynamics for both populations and individual neural networks; (ii) it is not necessary to have models with strong performances to spot salient behaviors. We observe that, in continuity with the ensemble analysis, the Nodes Strength of Convolutional layers monotonically decreases (Figure 6, second row). For accurate models, in the PDFs of the Convolutional layers, we observe the emergence of numerous spikes, a hint that specific input pixels strongly activate a few network’s receptive fields, while the majority of the weights is less stimulated. Notably, FC layers do not show this motif but tend to assume more smooth and regular shapes (see previous discussion and Figures 3 and 8). As regards the individual analysis of Layers Fluctuation (Figure 6, bottom row), we provide an alternative (to the PDFs) yet equivalent visualization method based on error-bars. Error bars are useful to visualize Layers Fluctuations when fewer data points are available, as in the case of the individual analysis where fewer instances (of the same model) can be trained.
With the model that reaches its maximum level of accuracy, Layers Fluctuation grows monotonically, with the exception of the fifth layer, which connects the second dense layer to the third. The result is consistent with what we observe in the ensemble analysis, thus confirming that across different architectures and tasks, accurate networks generally exhibit stronger Layers Fluctuation.

Vi Conclusions
We have presented a new framework to interpret DNNs using CNT. We study the learning phase of Fully Connected (FC) and Convolutional Neural Networks (CNN) leveraging three metrics: Link-Weights, Nodes Strength and Layers Fluctuation. Nodes Strength and Layers Fluctuation are ad-hoc metrics that complement the Link-Weights analysis, evidencing unprecedented behaviours of the networks. With CNT, we study two complementary approaches: the former on a population of networks initialized with different strategies (ensemble analysis), the latter on different snapshots of the same instance (individual analysis
). As FCs and CNNs get accurate on the learning task, the Probability Density Functions (PDFs) of Nodes Strength and Layers Fluctuation gradually flatten out. This analysis allows to discriminate between low and high performing networks and their learning phases.
As a future work, we will extend the CNT framework to different and/or more complex architectures, such as Recurrent Neural Networks and Attention-based models. Another viable extension would encompass diverse tasks such as NLP and signal processing.
Table I.A: Ensemble Analysis - MNIST Dataset | ||||||
Analysis | Network | Max Accuracy | Layers | Scaling Factor | Init Methods | figure* |
Link-Weights | FC | 0.5 - 0.05 | normal | 7 | ||
Strength | FC | 0.5 - 0.05 | normal | 8 | ||
Fluctuation | FC | 0.5 - 0.05 | normal | 9 | ||
Link-Weights | CNN | 0.5 - 0.05 | uniform | 2 | ||
Strength | CNN | 0.5 - 0.05 | uniform | 3 | ||
Fluctuation | CNN | 0.5 - 0.05 | uniform | 4 | ||
Table I.B: Individual Analysis - CIFAR10 Dataset | ||||||
Analysis | Network | Max Accuracy | Layers | Scaling Factor | Init Methods | figure* |
Link-Weights | CNN | 0.05 | normal | 6 | ||
Strength | CNN | 0.05 | normal | 6 | ||
Fluctuation | CNN | 0.05 | normal | 6 |
References
- [1] Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.: Complex networks: Structure and dynamics. Physics Reports. 424, 175–308 (2006). https://doi.org/10.1016/j.physrep.2005.10.009.
- [2] Chavez, M., Valencia, M., Navarro, V., Latora, V., Martinerie, J.: Functional modularity of background activities in normal and epileptic brain networks. Physical review letters, 104(11), 118701.
- [3] Crucitti, P., Vito, L., Marchiori, M.: A topological analysis of the Italian electric power grid. Physica A: Statistical mechanics and its applications 338.1-2 (2004): 92-97.
- [4] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. 8.
- [5] Goodfellow I, Bengio Y, Aaron Courville A. 2016 Deep Learning. MIT Press. http://www.deeplearningbook.org.
- [6] Hanin, B.: Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations. Mathematics. 7, 992 (2019). https://doi.org/10.3390/math7100992.
- [7] Katz, G., Huang, D.A., Ibeling, D., Julian, K., Lazarus, C., Lim, R., Shah, P., Thakoor, S., Wu, H., Zeljić, A., Dill, D.L., Kochenderfer, M.J., Barrett, C.: The Marabou Framework for Verification and Analysis of Deep Neural Networks. In: Dillig, I. and Tasiran, S. (eds.) Computer Aided Verification. pp. 443–452. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_26.
-
[8]
Krizhevsky A, Nair V, Hinton G. 2010 Cifar-10 (canadian institute for advanced research). http://www. cs. toronto. edu/kriz/cifar.
- [9] Lecun Y, Bengio Y. 1995 Convolutional networks for images, speech, and time-series. In Arbib MA, editor, The handbook of brain theory and neural networks. MIT Press.
- [10] Lecun Y, Cortes C, Burges CJC. 1999 The mnist dataset of handwritten digits(images).
- [11] Lee, S.H., Kim, P.-J., Ahn, Y.-Y., Jeong, H.: Googling Social Interactions: Web Search Engine Based Social Network Construction. PLoS ONE. 5, e11233 (2010). https://doi.org/10.1371/journal.pone.0011233.
- [12] Montavon, G., Samek, W., Müller, K.-R.: Methods for interpreting and understanding deep neural networks. Digital Signal Processing. 73, 1–15 (2018). https://doi.org/10.1016/j.dsp.2017.10.011.
- [13] Petri G., Musslick S., Dey B., Ozcimder K., Turner D., Ahmed D., Willke T., Cohen J. 2021 Topological limits to the parallel processing capability of network architectures., arxiv:1708.03263 [q-bio]. (2021).
- [14] Porta, S., Crucitti, P., Latora, V.: The network analysis of urban streets: A dual approach. Physica A: Statistical Mechanics and its Applications, volume369, number 2, pages 853–866, year=2006, publisher Elsevier.
- [15] Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 [cond-mat, q-bio, stat]. (2014).
- [16] Shwartz-Ziv R and Tishby N. 2017 Opening the black box of deep neural networks via information. arxiv:1703.00810 [cs]. (2017).
- [17] Testolin A, Piccolini M, Suweis S, 2020 Deep learning systems as complex networks, Journal of Complex Networks, Volume 8, Issue 1. https://doi.org/10.1093/comnet/cnz018.
- [18] Weng, T.-W., Zhang, H., Chen, P.-Y., Yi, J., Su, D., Gao, Y., Hsieh, C.-J., Daniel, L.: Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach. arXiv:1801.10578 [cs, stat]. (2018).
- [19] Zambra, M., Maritan, A., Testolin, A.: Emergence of Network Motifs in Deep Neural Networks. Entropy. 22, 204 (2020). https://doi.org/10.3390/e22020204.
- [20] Zhang, H., Chen, H., Xiao, C., Gowal, S., Stanforth, R., Li, B., Boning, D., Hsieh, C.-J.: Towards Stable and Efficient Training of Verifiably Robust Neural Networks. arXiv:1906.06316 [cs, stat]. (2019).
Comments
There are no comments yet.