I Introduction, Rationale, and Related Work
In recent years, we have witnessed an impressive development of machine learning (ML) methods applied in various fields, including network design and management [1, 2]. In fact, the importance of these methods is growing, and it is expected that machine intelligence will become the basis for the recently postulated knowledge plane [3]
. The new plane will be an equal resident of the communication and computer networks (COMNETs), with the same rights as in the case of switching, routing, topology planning, security adjustment, etc. This way, network operations will be enriched with automated planning and reaction tools based on ML. Some aspects of this attractive vision are already here, including numerous algorithms which learn and apply the effects of learning, with the most prominent group represented by artificial neural networks (ANNs). They have been used to develop numerous applications, for instance related to image or audio processing, where deep learning enables us to effectively recognize patterns, predict future values, classify behaviors, etc.
Unfortunately, one important element of the network knowledge plane is still missing. When we consider image processing, we can see that there are no problems related to the use of various formats or sizes, blurred pictures and the like. Almost any image can be transformed into a format which can be processed by the assumed ML tool. The case with COMNETs is different. We do not have a universal method of representing COMNET structures comparable to those applied in image or sound processing. In this paper, we propose such a general approach by showing how to universally represent COMNETs with ANNs. In consequence, the output ANN representation can be used as a module for various ML applications.
The key issue with COMNET representation is related to its topology, which is typically modeled with graph theory tools. Some approaches to graph representations based on ANNs, not directly focused on COMNETs, were proposed in [4]. Moreover, the chemistry field has recently experienced similar difficulties with universal ANN mapping of objects it is interested in. The problem of molecule representation has been successfully overcome using messagepassing neural networks (MPNN) [5, 6]
. Our solution is based on the same ANN architecture. Additionally, we extend the idea by applying batch normalization and using modern SELU (
Scaled Exponential Linear Unit) activation functions, known for selfnormalization and good scaling properties
[7]. It should be noted that the use of MPNNs was proposed recently to support routing in data networks [8]. While we focus on the global performance of the network, the authors of [8] deal with the local (nexthop) behavior.Ii Application of MPNNs
First, we elaborate on how messagepassing neural architecture can be used for universal representation of graphs. We then show how to adopt this representation in COMNETs.
Iia MessagePassing Neural Architecture
Neural messagepassing is an ANN architecture originally developed in the form presented here for quantum chemistry [6]
. A single messagepassing neural network is constructed for any graph. In consequence, the information contained in the graph (including the topology) is compressed as a vector of real numbers. To define the architecture of MPNNs, the following assumptions are made:
(a) Any structure (a molecule, COMNET, etc.) is represented by a digraph . (b) Each vertex/node is characterized with an arbitrary set of features (related to topology and also operational aspects), represented by vector . (c) Likewise, each arc is characterized with an arbitrary set of features, represented by vector . (d) The state of a vertex (or an arc) is described by (or, respectively), i.e., an unknown hidden vector. This concept is analogous to the idea of latent variables present in mixture models or hidden states in hidden Markov models. This state is the subject of learning by the ANN. The trained ANN can then be used to find the hidden state for other graphs.
When providing the graph representation, our goal is to find an ANNbased model for the variable describing the whole structure, or for the vector of individual variables () related to each vertex (or arc) separately. Such result, known as a readout of the MPNN, represents useful information provided by the ML mechanism for potential prediction, classification etc. To realize this goal, it is necessary to perform the forward pass (inference) with Algorithm 1 shown below.
In MPNNs, the inference consists of three operations: (1) messagepassing, (2) update, and (3) finding the readout. In iteration , the messagepassing operation allows the nodes to exchange information described in the form of vector . The update operation encodes this information in the hidden state . This process resembles the convergence of a routing protocol, and after a number of iterations (here, indicated by , which is of the order of the average shortest path length), each node holds full information from the whole network. These steps represent convergence to the fixed point of a function. There are two such functions, and . At this time point, sparse node and edge features are compressed into a dense representation of the hidden state vectors. This dense representation is fed to the readout part of MPNN.
Messagepassing and update are exercised alternately times according to the formulas given in lines 78 of the Algorithm. Both (message function) and (update function) are trainable ANNs. They can be parametrized in different ways, e.g., may also depend on
. Different functions can also be used for incoming and outgoing messages. The hidden representation is initialized as the zeropadded node feature vector:
. During the learning process, these zeroes are likely to be replaced with some meaningful values. The readout is obtained from another ANN, and generally equals , i.e., it is found on the basis of the stacked node embeddings . The simplest readout function preserving basic topological relations between node adjacencies (graph isomorphism) takes the following form:(1) 
where is a function approximated by an ANN. Nevertheless, a simple summation is likely to lose a lot of information, and more sophisticated readouts are used in practice [5, 4].
IiB MPNNbased Representation of COMNETs
The approach described in the previous subsection is just one of many possible variations of MPNNs. A broad description of different graph neural architectures is presented in [6]. To select a version appropriate to our case, we have to remember that in COMNETs the interest is often focused more on links rather than nodes. Then, MPNN can be defined such that the hidden representation is associated with the link (or link and node) and the messages are sent to connected links (i.e., all the links incident with and ). If their direction is important, the messages can be differentiated between both directions of the arcs.
As concerns the messagepassing operation, we decided to follow [4] and use a parametrized affine message function to find :
(2) 
where and are matrix and vectorvalued ANNs with SELU activations [7], respectively. We avoid gradient vanishing by using the Gated Recurrent Unit [9] for update during the learning process: . GRU is simpler than the popular LSTM cell, but provides good performance parameters, too [10]. Additionally, to reduce the number of parameters we use weight tying for all steps , i.e., , .
The most complex aspect of our architecture relates to the readout. It is obtained with ANN consisting of three parts: (1) graph level embedding, (2) batch normalization layer, and (3) inference layer. The graph level embedding takes the following form:
(3) 
where and are the learned functions (ANNs), and represents the Hadamard product of matrices. The ANN learns to map the node embedding into an additive representation, while the ANN
learns to select the most important nodes. The value is mapped with the sigmoid function
into the range . The result reflects the importance of a given node and the dimension of a hidden representation which is learned during the training process. The calculation of the embedding value is performed such that only the most important node embeddings get multiplied by a number close to and, in consequence, contribute to the final sum determined by Eq. (3). This effect is known as the attention mechanism [11].Having the graph level representation, it is now straightforward to add the final two layers of the ANN. The batch normalization layer
improves the training process by learning the average and standard deviation of elements in
and normalizing them, so that finally we obtain zero means and unit standard deviations [12]. The inference layer provides the final value of the estimate, so that: .Iii Example: Delays in Queuing Networks
Here, we show that the proposed approach can be used as a powerful performance evaluation tool providing effective calculation of network parameters. The MPNN model can be used to approximate various network performance indicators (e.g., traffic prediction, anomaly detection, delay,
estimation). We decided to show the efficiency of the MPNN prediction for average delays in Jackson networks of queues [13]. Contrary to the used example, for real networks the exact relations are frequently unknown or inherently complex, thus in practice the application of ML will considerably simplify network operations. However, in the case of the example presented here, there are a few reasons for using wellknown analytical relationships: (a) the example is simple and easy to generalize for different domains; (b) the theoretical formulas for the delay are known, i.e., we can analytically find the values used for training and testing (evaluation) and present a clear interpretation of some aspects of the MPNN used.Iiia MessagePassing Structure for Queuing Networks
A Jackson network is a network of queues with an arbitrary topology. Traffic can enter and leave the network at each node. The external input intensity at node is given by and the service rate at that node is denoted as . Within the network, the traffic is routed according to the routing matrix . The routing is random and independent, i.e., every packet is routed randomly to the neighboring nodes or leaves the network according to . The destination of each packet is independent of the previously taken route. To find the average delay in such a network, we start with the traffic balance equations:
(4) 
The solution to the system given by Eq. (4) determines the intensity at each node. Then, the average delay in the network can be computed using Little’s law according to the following classical formulas:
(5) 
where is the average queue length in node .
Let us now assume we do not know the relation given by Eq. (5), and want to use the proposed MPNN to learn the delay from the experimental data. To perform such a task, we assume the following regression problem: (a) Node features: their traffic intensity and service rates . (b) Edge features are related to routing only: . (c) The predicted readout : the sought network delay .
Since the queuing network can be solved analytically, we can provide some insights into why MPNN fits this task. First, let us assume that , i.e., the hidden vector explicitly contains intensities. Second, setting allows us to transform Eq. (4) into a form analogous to that given in lines 78 of Algorithm 1: and , with and being functions. In the derivation, we use a fixed point approximation to the solution of Eq. (4), while the nonlinear relationships given by Eq. (5) are approximated by a readout function. This way also explains why — contrary to [6] — in Eq. (2) we parametrize by , instead of .
IiiB Random Networks
To obtain a sufficient level of confidence of the training results, the MPNN was trained on a high number of different network topologies. We used three types of random networks. (1) The first is the ErdősRényi (ER) random graph model [14]
, where two nodes are connected randomly with probability
. Since such a construction can produce a disconnected graph, we use the largest connected component of that graph. We set the probability , where is the number of nodes. The training set was created using . (2) The second model is based on BarabásiAlbert (BA) random graphs [15], where the construction begins with some connected nodes. Each new node is randomly connected to nodes with connection probabilities proportional to their degrees. This model is known to provide graphs with a long tail nodal degree distribution. We set the number of nodes as a random integer , and each new node is initially connected to two other nodes () to represent primary and backup connections. (3) The last type of topology we tested was based on nonsynthetic real networks containing up to 38 nodes. They are taken from the SNDlib collection [16], plus the most popular ones: janosus, janosusca, cost266, germany50. These topologies were retrieved as provided, and used only to evaluate our model.As well as topologies, other network parameters were also randomized. The external traffic demands
were randomly sampled from a uniform distribution and normalized to add up to 1. This normalization stabilizes training without loss of generality, since any traffic distribution can be normalized to fall in the range
by changing the time unit. The routing matrix was generated by assuming an equal probability for each route as well as for leaving the network. In this simple model, nodes of a high degree are more likely to route packets towards the inside of the network. On the other hand, lowdegree nodes typically route traffic towards the outside of the network. Using external demands and the routing matrix, intensities at each node were computed using Eq. (4). Then, node utilizations were selected randomly from the uniform distribution , and the service rate was obtained. While the service rate is an independent feature, here we decided to base it on randomized utilizations, since this way we avoid instabilities appearing when .The example MPNNs were trained on a collection of random graphs. During the training, we used an additional test set of for periodic testing to avoid overfitting. The final model was evaluated on random networks. All three sets were disjoint and topologies were randomly selected to emphasize the structureindependent learning ability of the proposed MPNN model. The source code is available at [17].
IiiC Discussion of Results
Evaluation of MPNN models trained on random networks shows that to some extent the model is independent of network topology. The best results were obtained for the BA random graphs in both training and evaluation sets. Having said that, it should be stressed that the MPNN trained on the BA model does not generalize as well as the one trained with the ER model. The evaluation results presented in Tab. I and extended in [17] show that the change from the BA to ER model in the evaluation set results in a substantial systematic error (measured as a mean squared error, ). However, the Pearson correlation coefficient between predictions and true labels indicates a satisfactory level of dependence.
Training set model  Evaluation set model  
ER  ER  0.0204  0.9802  0.9937 
ER  BA  0.1251  0.9328  0.9745 
BA  BA  0.0075  0.9923  0.9972 
BA  ER  11.62  —  0.8494 
ER  SNDlib  0.0777  0.9066  0.9752 
ER  janosus  0.0222  0.9434  0.9884 
ER  janosusca  0.0454  0.9215  0.9833 
ER  cost266  0.0374  0.9321  0.9861 
ER  germany50  0.2131  0.7161  0.9434 
ER  ER ()  0.1420  0.9067  0.9636 
The table contains the worst limit of 95% confidence intervals, i.e., a lower 

bound on and , as well as an upper limit on .  
Confidence intervals: up to 975 permille for and 25 for and 
On the other hand, the model trained on the ER networks performs well on both synthetic networks and — more importantly — on real topologies. The reason seems to be related to the fact that the ER random graphs cover a much wider range of graph distributions than the BA models. This is an important result, since the former is not perceived as reflecting the structure of existing COMNETs.
Training characteristics of the ER model are depicted in Fig. 1. The loss for training and testing is almost the same without any regularization techniques being applied. A similar pattern was observed for the BA model, although the final value of was lower. The visible robustness against overfitting is present most likely due to a very simple network architecture consisting of 16 nodes in the network embedding layer. Such a small network is capable of generalizing even on larger graphs, i.e., those of sizes never seen during the training process. The values of for the ER networks generated for nodes or germany50 SNDlib network () are not as good as for the synthetic evaluation set or SNDlib networks containing
nodes (range of most training samples); however, the model provides high percentage values of explained variance
and correlation .It is especially notable that these results are promising from the perspective of applying MPNN for transfer learning. In this case, a network once trained on a large dataset of synthetic random topologies can be used as a network feature extractor. The network embedding layer or the final fully connected layer of MPNN can be used as a dense representation of the COMNET relevant to one’s application. Hence, this representation can be used as an input for any ML model designed for a particular COMNET. This approach substantially simplifies the training of new models or finetuning of models for specific problem domains.
Iv Conclusions and Future Work
The large variety of COMNET sizes and topologies makes it nontrivial to construct a general machine learning model of sparse graph structured data. In this paper, we show how to apply stateoftheart graph ANNs (namely, messagepassing neural architecture) to simplify learning from existing COMNETs. We train the model on random graphs and obtain the ANN structure, which — to some extent — is invariant under COMNET size or topology changes.
Numerical evaluations based on predictions of queuing delays in real COMNET topologies gave the best results when learning was performed with ER random graphs, making them good candidates for training samples in future applications of our approach. These applications include networkwise traffic prediction, anomaly detection, and reinforcement learning for network control. The proposed method can be applied to a variety of management cases relevant to contemporary networks, especially those involving programmability (such as Software Defined Networks, SDN) where the power of machine intelligence can be fully embraced.
Acknowledgments
This work was supported by AGH University of Science and Technology grant, under contract no. 15.11.230.400. The research was also supported in part by PLGrid Infrastructure.
References
 [1] P. V. Klaine et al., “A Survey of Machine Learning Techniques Applied to SelfOrganizing Cellular Networks,” IEEE Communications Surveys and Tutorials, vol. 19, no. 4, pp. 2392–2431, 2017.
 [2] Z. M. Fadlullah et al., “StateoftheArt Deep Learning: Evolving Machine Intelligence Toward Tomorrow’s Intelligent Network Traffic Control Systems,” IEEE Communications Surveys and Tutorials, vol. 19, no. 4, pp. 2432–2455, 2017.
 [3] A. Mestres et al., “KnowledgeDefined Networking,” ACM SIGCOMM Computer Communication Review, vol. 47, no. 3, pp. 2–10, Sep. 2017.
 [4] F. Scarselli et al., “The Graph Neural Network Model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, Jan. 2009.
 [5] S. Kearnes et al., “Molecular Graph Convolutions: Moving Beyond Fingerprints,” Journal of ComputerAided Molecular Design, vol. 30, pp. 595–608, Aug. 2016.
 [6] J. Gilmer et al., “Neural Message Passing for Quantum Chemistry,” 2017. [Online]. Available: http://arxiv.org/abs/1704.01212
 [7] G. Klambauer et al., “SelfNormalizing Neural Networks,” in Proc. NIPS 2017, Dec. 2017.
 [8] F. Geyer and G. Carle, “Learning and Generating Distributed Routing Protocols Using GraphBased Deep Learning,” in Proc. BigDAMA 2018, Aug. 2018.
 [9] K. Cho et al., “Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation,” June 2014. [Online]. Available: http://arxiv.org/abs/1406.1078

[10]
J. Chung et al.
, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” in
Proc. NIPS 2014, Dec. 2014.  [11] Y. Li et al., “Gated Graph Sequence Neural Networks,” in Proc. ICLR’16, Apr. 2016.
 [12] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” Feb. 2015. [Online]. Available: http://arxiv.org/abs/1502.03167
 [13] F. P. Kelly, Reversibility and Stochastic Networks. Cambridge University Press, 2011.
 [14] P. Erdős and A. Rényi, “On Random Graphs I,” Publicationes Mathematicae Debrecen, vol. 6, pp. 290–297, 1959.
 [15] A.L. Barabási and R. Albert, “Emergence of Scaling in Random Networks,” Science, vol. 286, no. 5439, pp. 509–512, Oct. 1999.
 [16] S. Orlowski et al., “SNDlib 1.0—Survivable Network Design Library,” Networks, vol. 55, no. 3, pp. 276–286, May 2010.
 [17] K. Rusek, “net2vec,” https://github.com/krzysztofrusek/net2vec, 2018.
Comments
There are no comments yet.