I Introduction
Recent events have underscored the increasing importance of the Internet to our civilization, necessitating a scientific understanding of this virtual universe [20, 28]. The pandemic induced a drastic increase in Internet usage due to activities such as remote education and work, streaming, social media, online shopping, and video games [4, 6, 12, 34]. However, with this rise of online activity comes a potential surge of problematic internet usage that is made more urgent by the rising influence of adversarial Internet robots (botnets) on society [5, 22, 19, 1]. Thus, for scientific, economic, and security reasons observing, analyzing, and modeling the Internet is essential.
The two largest efforts to capture, curate, and share Internet packet traffic data for scientific analysis are the Widely Integrated Distributed Environment (WIDE) project [14] and the Center for Applied Internet Data Analysis (CAIDA) [16]. These data have supported a variety of research projects resulting in hundreds of peerreviewed publications [21], ranging from characterizing the global state of Internet traffic, to specific studies of the prevalence of peertopeer filesharing, to testing prototype software designed to stop the spread of Internet worms. More recently, novel analysis of trillions of network observations enabled by high performance sparse matrix mathematics and interactive supercomputing has reinforced the ubiquity of powerlaw distributions in these observations and enabled precision measurements of the distribution parameters [18, 35, 25, 24]. The increased accuracy of these measurements allows new underlying generative network models to be explored.
Prior modeling studies primarilyly relied on collecting data through web crawls that naturally produced connected powerlaw networks [26, 3, 7, 33]. The foundational preferential attachment (PA) random network model emerged naturally from these observations [32, 17, 10, 9, 37]. PA models rely on the connectivity of a large component to generate subsequent iterations of the model [8, 2, 15, 11]. These large components, and other network cores have been studied extensively [40, 39, 32, 25]. WIDE, CAIDA, and other Internet observatories provide different vantage points that can see complete streams of traffic (Figure 1) that reveal significant rare leaves of connected components as well as entirely unattached links (Figure 2) that create deviations from standard traditional powerlaw models (Figure 3).
At a qualitative level, it is suspected that many of these leaves and unattached links are formed by bot traffic. These connections often behave in unpredictable manners, and tend to form other links only with similar (botlike) connections. Looking beyond the noise from these bots, the predictable connections still grow and behave in a manner that satisfies the preferential attachment model. Crucially, the internet traffic observed in pipeline is distinct from the actual longterm traffic network of the Internet. Rather, the observed traffic is a random subnetwork of the actual network of “who in general feels like talking with whom”. Even though over large time scales, big chunks of Internet traffic might look like preferential attachment, the observed behaviors from the data collection methods will be a random subnetwork of the underlying model.
Our approach to understanding this subnetwork will be with a new model that considers the time iterative selection of random subnetworks of a larger data subset. This model amplifies the randomness of the unattached links and improves our understanding of the internet beyond a preferential attachment setting. This paper will cover the introduction and explanation of the new fiveparameter PALU (PA Leaves Unattached links) model, a basic analysis of the model including the degrees distribution and analysis of the variability of the model, and conclude with the results and opportunities for future work.
Ii Network Observations
The stochastic network structure of Internet traffic is a core property of great interest to Internet stakeholders and network scientists. Of particular interest is the probability distribution
where is the degree (or count) of one of several network quantities depicted in Figure 1: source packets, source fanout, packets over a unique sourcedestination pair (or link), destination fanin, and destination packets [24]. Amongst the earliest and most widely cited results of virtual Internet topology analysis has been the observation of the powerlaw relationshipwith a model exponent for large values of [7, 3, 27]. In our work network topology refers to the network theoretic virtual topology of sources and destinations and not the underlying physical topology of the Internet. These early observations demonstrated the importance of a few supernodes in the Internet (see Figure 2)[13]. Measurements of powerlaws in Internet data stimulated investigations into a wide range of network phenomena in many domains and lay the foundation for the field of network science [9].
Classification of Internet phenomena is often based on data obtained from crawling the network from a number of starting points [33]. These webcrawls naturally sample the supernodes of the network [13] and their resulting are accurately fit at large values of by singleparameter powerlaw models. Characterizing a network by a single powerlaw exponent provides one view of Internet phenomena, but more accurate and complex models are required to understand the diverse topologies seen in streaming samples of the Internet.
At a given time , consecutive valid packets are aggregated from the traffic into a sparse matrix , where is the number of valid packets between the source and destination [31]. The sum of all the entries in is equal to
All the network quantities depicted in Figure 1 can be readily computed from using the formulas listed in Table I. An essential step for increasing the accuracy of the statistical measures of Internet traffic is using windows with the same number of valid packets . Using packet windows with the same number of valid packets produces aggregates that are consistent over a wide range of windows from to . While weights are important to study, in this initial work we are studying the unweighted model. The common weights to study subsequently could be the number of packets or number of bytes sent over a link.
Formulas for computing aggregates from a sparse network image at time in both summation and matrix notation.
is a column vector of all 1’s,
is the transpose operation, and is the zeronorm that sets each nonzero value of its argument to 1[23].Aggregate  Summation  Matrix 

Property  Notation  Notation 
Valid packets  
Unique links  
Unique sources  
Unique destinations 
Iia Logarithmic Pooling
A network quantity computed from produces a corresponding histogram denoted by
, with corresponding probability
and cumulative probability
Due to the relatively large values of observed due to a single supernode, the measured probability at large often exhibits large fluctuations. However, the cumulative probability lacks sufficient detail to see variations around specific values of , so it is typical to pool the differential cumulative probability with logarithmic bins in
where [17]. All computed probability distributions use the same binary logarithmic pooling (binning) to allow for consistent statistical comparison across data sets [17, 9]
. The corresponding mean and standard deviation of
over many different consecutive values of for a given data set are denoted and .IiB Modified ZipfMandelbrot Model
Measurements of can reveal many properties of network traffic, such as the fraction of nodes with only one connection and the size of the supernode
(1) 
Effective classification of a network with a low parameter model allows these and many other properties to be summarized and computed efficiently. In the standard ZipfMandelbrot model typically used in linguistic contexts, is a ranking with corresponding to the most popular value [29, 30, 36]
. To accurately classify the network data using the full range of
, the ZipfMandelbrot model is modified so that is a measured network quantity instead of a rank indexThe inclusion of a second model offset parameter allows the model to accurately fit small values of , in particular , which has the highest observed probability in these streaming data. The model exponent has a larger impact on the model at large values of while the model offset has a larger impact on the model at small values of and in particular at .
The unnormalized modified ZipfMandelbrot model is denoted
with corresponding gradient
The normalized model probability is given by
where is the largest value of the network quantity . The cumulative model probability is the sum
The corresponding differential cumulative model probability is
where .
Minimizing the differences between the observed differential cumulative distributions allows accurate ZipfMandelbrot parameters to be selected for a given set of observations. Figure 3 provides a representative sample of the hundreds of fits from [24], illustrating the effectiveness of ZipfMandelbrot model in describing these observations.
Iii PA + Leaves + Unattached (PALU) Model
The effectiveness of the empirical ZipfMandelbrot model drives the need to explore new underlying generative network models that can produce these more complex probability distributions. The new model extends the PA model with explicit terms accounting for the leaves attached to the PA core as well as unattached links.
The model can be conceptualized in two parts: the underlying network, and the observed network. The underlying network can be considered as the “true” information of the traffic connections that occur and how frequent these connections are. With our current methods, the full image of this network cannot be detected, and our consideration of this underlying network is theoretical. There are three main pieces that make up this network: the core which is constructed by preferential attachment; a set of degree 1 nodes called leaves that are adjacent to nodes in the core; and unattached nodes that are not connected to the core, and have very low connection within the set itself.
Looking through a window of a certain size with respect to the number of observed connections (packets), a random subnetwork of the underlying network is witnessed, which we call the observed network. In this particular set of data, we observe all possible data that passes from one gateway to another until a given number of connections are observed, as described in Section II. For further details on the collection of the data see [24]. Conceptually, we can consider selecting this subnetwork by randomly deleting edges and nodes of the underlying network.
In reality these edge connections are directed since internet communication is directional, however for the sake of the model we will consider this undirected. Using a directed model has a small impact on overall the degree distribution analysis [7]. We will leave this discussion for later in the paper.
A webcrawl is more likely to detect nodes in the core, but less likely to detect the leaves and the unattached nodes. The measurements taken by MAWI, CAIDA, and others through trunkline observations of trillions of connections reveal these leaves and unattached nodes included in this new model [24].
Iiia Defining the Model
The PALU model requires four parameters: the clustering of the unattached nodes, the proportions of nodes in each section of the underlying network, the preferential attachment of the core in the underlying network, and the size of the sampled network from the underlying network.

Let define the average degree of the unattached nodes in the underlying network

The parameters represent the proportions of nodes in each of the core, the leaves, and the unattached nodes in the underlying network, conforming to the relationship

To describe the preferential attachment of the core, we require the parameter as the exponent of powerlaw decay of the degree distribution.

The window size is described by the parameter as the proportion of the underlying network that is being observed. As the window size increases, will get closer to 1. Specifically, this parameter is the probability that an edge in the underlying network will appear (be selected) in the observed network.
The model is completely determined by five parameters, and the relationship in (1) determines the sixth (). Importantly, for a given network, the parameters and should be the same regardless of the window size. As the window size increases, the only parameter that will change is , as it is more likely to see more edges.
Iv Preliminary Analysis
In the following, we reference the Riemann zeta function, which is a wellknown function given by
In this paper, was determined to be experimentally from the loglog plot of degree distribution in [24]. This function is supported in MATLAB with the builtin function zeta(x), and was determined here to be , where this will assist in the probabilistic analysis.
Given a window size , the fraction of nodes in the underlying network that we expect to see in our observed network is
Our model predicts the following about the observed network, where is taken to be some integer greater than 1
The last two approximations are very good when , in which case we have
where is a constant independent of
. This provides an effective estimation for
via linear regression in a
 plot.Iva Logarithmic Pooling
In order to consider a selection of nodes over a logarithmically binned degree interval, simply find the cumulative sum of the corresponding estimate for all the values of within the interval. For example with , the number of nodes of degree between and , we have the following summation:
Taking logs, where is some constant that does not depend on , we get this is approximately
Importantly, taking intervals of degrees for large , a plot will have the slope of the regression line as , and not as it would be in the noninterval case. When is small (e.g., ), the approximation for the number of nodes between degree and degree with becomes
and the righthandside can be computed by summing up all the terms,carefully approximating with the term and the term when . This differs from the large estimate in that for large , we can discard small terms on the righthandside and safely estimate the sum by an integral. In both Fig 3 and Fig 4 our parameters are given for the underlying probability distribution, but we are plotting the differential cumulative distribution which will result in power law exponent one unit higher [38].
IvB Simplified Degree Distributions
The ratios for distinct degrees can be simplified using the following parameters
(2)  
(3)  
(4) 
where , , , and are constants that do not depend on
For , , and , each is proportional to the number of nodes in the core, unattached, and leaves respectively. These three constants do depend on the parameter , related to the window size. The parameter depends on as well and is related to the clustering. The constant is the same as the above section. All of the parameters above should be positive.
To fit these parameters, all of which are in terms of the discrete parameters, we can consider the following

One first fits (4) to the longterm behavior of the degree distribution. A  plot will have a linear behavior whose slope is equal to and constant term equal to , as we will see in the discussion in the following section. This will result in fitting and .

Subsequently (3) will fit small values of () estimating and .

It is then possible to solve for exactly by using (2).
In (b) it is possible to fit the parameters by subtracting the term from both sides of (3) and then summing up both sides will give a value of roughly , which would be a more robust estimate than the pointwise estimates of (3).
After having computed and from (4), computing
should be equal to roughly . Thus, resulting in
Computing the two summations on the lefthandside from the data can be used to to approximate by numerically solving the above.
The advantage to this methodology is that it presumably reduces the estimate to one with substantially less variance. Estimating the righthand side analytically, would provide roughly accurate estimates for large
, but for the estimates would become roughly , by expanding its Taylor series. After having an estimate of , , and , (3) can be used to estimate by a linear regression.A benefit to the efficacy of the model is that an arbitrary choice of and would not lead to an inaccurate computation. In choosing these variables arbitrarily, the network would maintain the structural topology for since the information provided in choosing will mostly determine the structure. The model would be even more accurate for large values of by (3), and for the very few values of in between, the chosen values and could be modified until the desired results were acquired.
V Derivation of the estimates
The estimates from the simplistic model follow immediately from the corresponding estimates in the refined model. For the derivation, the aim is to count nodes rather than the ratios determined with the variables and . Let be the number of nodes in the underlying network. Rigorously, we let be the number of nodes in the core and be the number of leaf nodes in the underlying network. We will carefully define later. We begin the model of the observed network by beginning with the underlying network. We obtain our observed subnetwork by retaining each edge independently with probability , creating an Erdős–Rényi random subnetwork of the underlying network [15].
The core nodes of the underlying network are assumed to be generated according to a preferential attachment model, independent of the other parts of the underlying network. The number of core nodes of the underlying network having degree follows a powerlaw distribution of the form . Thus, the number of degree nodes in the observed network is wellapproximated by
Any degree
node in the underlying network will have degree distributed in the observed network according to the binomial distribution
To estimate the number of core nodes with degree greater than in our observed network, we sum these estimates, which we in turn approximate as a Riemann integral, at the cost of a negligible error term. The number of leaf nodes in the observed network has a binomial distribution with mean and variance
, and the random variable is approximated by its mean
[15].Returning to the rigorous derivation of the unattached nodes. In this part of the model we generate many stars, each of which has a random number of noncentral nodes, where the number of noncentral nodes is given by independent identically distributed Poisson random variables with mean . The Poisson random variable represents the modeling of the central nodes and a large number of potential leaves, each of which chooses to attach to a central node with some sufficiently low probability. With this distribution, the total number of nodes in the unattached portion of the underlying network has a distribution precisely given by
Since the sum of independent Poisson random variables is again Poisson, this simplifies to
However, of these central nodes, there will be of them which are isolated nodes. As these cannot be seen by examining traffic between nodes, we remove these nodes from this model. The model demonstrating a nonzero number of isolated nodes gives significant evidence to the existence of these nodes in the true picture of the internet, regardless that they cannot be observed in our current methods of data collection.
Estimating the degree sequence within our observed network amounts to the following. If we first sample and then we take the sum of independent Bernoulli random variables of mean , we seek to find the resulting distribution, that is to analyze . This is wellknown to be simply —as seen from an elementary computation—and additionally motivated our choice of Poisson random variables.
All to say, the unattached version of our observed subnetwork is distributed as stars each of which have noncentral leaves, from which the results follow immediately by independence and estimating the terms by their means, which is valid large values of .
Vi ZipfMandelbrot Connection
Finally, a oneparameter (unnormalized) approximation of Equation (3) can be used to compare with the empirical ZipfMandelbrot distributions. Using the approximation
where
is a positive number to be fit to the data. This approximation effectively changes the underlying distribution from Poisson to an equally valid Geometric distribution, resulting in the distribution
which can be rescaled as
The above expressions can be aligned with the ZipfMandelbrot parameters by setting
resulting in
(5) 
In Figure 4, the degree distribution is shown with a selection of curve families that demonstrate the the PALU model can be made to fit a ZipfMandlebrot distribution for . For any given power law exponent and offset parameter , the ZipfMandlebrot distribution can be wellapproximated by Equation (5) by varying . In general, the model tends towards ZipfMandlebrot. In addition, the model has the potential to explain some observations that deviate from the ZipfMandlebrot distribution (see Figure 3 upper right).
Connecting back to the original values of and , we have
and combining with the same expression in gives
Thus, the model described in Equation 5 captures the PA aspect of the model mostly in the first term and the rest of the model is captured via and through the second term.
Vii Conclusions
In this paper, we presented a modified preferential attachment model to better represent new observations of network topology observed from streaming data. From the figures, we found the PALU model to fit the ZipfMandelbrot distribution very well given the right parameters.
With the PALU model as a stepping stone, we can continue to examine the space of the network and more effective models, including comparing the data from [24] to the model itself, extended the analysis done in Fig 4. Some opportunities for future work that we are interested in include further investigation of the PALU model such as its various applications in other big data environments and deeper study into the degree distribution and clustering coefficients. The PALU model research can also extend to the case of weighted edges where potential weights could be the number of packets or number of bytes sent along a link. Exploration of other models are also possible, such as combining preferential attachment with the ErdosRenyi model and determining if there is a better fitting model than the ZipfMandelbrot distribution. It would be worthwhile to explore the existence and importance of isolated nodes, and analyzing aspects of the network by studying the behaviors of a random sampling or a preferential attachment graph. Finally, extrapolating the results of the PALU model to observe and define the large clusters of small disconnected components may also be of interest.
Acknowledgments
The authors wish to acknowledge the following individuals for their contributions and support: Bob Bond, David Clark, Nora Devlin, Alan Edelman, Jeff Gottschalk, Charles Leiserson, Mimi McClure, Sandeep Pisharody, Steve Rejto, Daniela Rus, Douglas Stetson, Allan Vanterpool, Marc Zissman, and the MIT SuperCloud team: Bill Arcand, Bill Bergeron, David Bestor, Chansup Byun, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones, Anna Klein, Peter Michaleas, Lauren Milechin, Julie Mullen, Andrew Prout, Antonio Rosa, Albert Reuther, Charles Yee.
References
 [1] (2019) Why botnets persist: designing effective technical and policy interventions. Cited by: §I.

[2]
(2000)
A random graph model for massive graphs.
In
Proceedings of the thirtysecond annual ACM symposium on Theory of computing
, pp. 171–180. Cited by: §I.  [3] (1999) Internet: diameter of the worldwide web. Nature 401 (6749), pp. 130. Cited by: §I, §II.
 [4] (2020) Online and remote learning in higher education institutes: a necessity in light of covid19 pandemic.. Higher Education Studies 10 (3), pp. 16–25. Cited by: §I.
 [5] (2017) Social media and fake news in the 2016 election. Journal of Economic Perspectives 31 (2), pp. 211–36. Cited by: §I.
 [6] (2020) Exploring the critical challenges and factors influencing the elearning system usage during covid19 pandemic. Education and Information Technologies 25, pp. 5261–5280. Cited by: §I.
 [7] (1999) Emergence of scaling in random networks. Science 286 (5439), pp. 509–512. Cited by: §I, §II, §III.
 [8] (1999) Emergence of scaling in random networks. science 286 (5439), pp. 509–512. Cited by: §I.
 [9] (2016) Network science. Cambridge university press. Cited by: §I, §IIA, §II.
 [10] (2009) Scalefree networks: a decade and beyond. science 325 (5939), pp. 412–413. Cited by: §I.
 [11] (2020) The iterated local model for social networks. Discrete Applied Mathematics. Cited by: §I.
 [12] (2020) Impact of the covid19 pandemic on the internet latency: a largescale study. Computer Networks 182, pp. 107495. Cited by: §I.
 [13] (2009) Identifying high cardinality internet hosts. In INFOCOM 2009, IEEE, pp. 810–818. Cited by: §II, §II.
 [14] (2000) Traffic data repository at the wide project. In Proceedings of USENIX 2000 Annual Technical Conference: FREENIX Track, pp. 263–270. Cited by: §I.
 [15] (2006) Complex graphs and networks. American Mathematical Soc.. Cited by: §I, §V, §V.
 [16] (1999) Internet tomography. Nature, Web Matter. Cited by: §I.
 [17] (2009) Powerlaw distributions in empirical data. SIAM review 51 (4), pp. 661–703. Cited by: §I, §IIA.
 [18] (2018) Hyperscaling internet graph analysis with d4m on the mit supercloud. IEEE High Performance Extreme Computing Conference (HPEC). Cited by: §I.
 [19] (2018) Zero botnets: building a global effort to clean up the internet. Council on Foreign Relations. Cited by: §I.
 [20] (2011) The world’s technological capacity to store, communicate, and compute information. Science, pp. 1200970. Cited by: §I.
 [21] Http://www.caida.org/data/publications/. Cited by: §I.
 [22] Https://go.neosit.com/botreport2019. Cited by: §I.

[23]
(2003)
Measuring sparseness of noisy signals.
In
4th International Symposium on Independent Component Analysis and Blind Signal Separation
, pp. 125–130. Cited by: TABLE I.  [24] (2019) New phenomena in largescale internet traffic. arXiv preprint cs.NI/1904.04396. Cited by: §I, Fig. 1, Fig. 2, §IIB, §II, §III, §III, §IV, §VII.
 [25] (2018) Mathematics of big data: spreadsheets, databases, matrices, and graphs. MIT Press. Cited by: §I, §I.
 [26] (1994) On the selfsimilar nature of ethernet traffic (extended version). IEEE/ACM Transactions on Networking (ToN) 2 (1), pp. 1–15. Cited by: §I.
 [27] (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 177–187. Cited by: §II.
 [28] (2013) A survey of network flow applications. Journal of Network and Computer Applications 36 (2), pp. 567–581. Cited by: §I.
 [29] (1953) An informational theory of the statistical structure of language. Communication theory 84, pp. 486–502. Cited by: §IIB.
 [30] (2001) Beyond the zipf–mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications 300 (34), pp. 567–578. Cited by: §IIB.
 [31] (2010) Community structure in timedependent, multiscale, and multiplex networks. science 328 (5980), pp. 876–878. Cited by: §II.
 [32] (2001) Clustering and preferential attachment in growing networks. Physical review E 64 (2), pp. 025102. Cited by: §I.
 [33] (2010) Web crawling. Foundations and Trends® in Information Retrieval 4 (3), pp. 175–246. Cited by: §I, §II.
 [34] (2020) Impact of digital surge during covid19 pandemic: a viewpoint on research and practice. International Journal of Information Management 55, pp. 102171. Cited by: §I.

[35]
(2018)
Interactive supercomputing on 40,000 cores for machine learning and data analysis
. IEEE High Performance Extreme Computing Conference (HPEC). Cited by: §I.  [36] (2006) Modeling and caching of peertopeer traffic. In Network Protocols, 2006. ICNP’06. Proceedings of the 2006 14th IEEE International Conference on, pp. 249–258. Cited by: §IIB.
 [37] (2018) A preferential attachment paradox: how preferential attachment combines with growth to produce networks with lognormal indegree distributions. Scientific reports 8 (1), pp. 2811. Cited by: §I.
 [38] (2008) On estimating the exponent of powerlaw frequency distributions. Ecology 89 (4), pp. 905–912. Cited by: §IVA.
 [39] (2017) Reliable multifractal characterization of weighted complex networks: algorithms and implications. Scientific reports 7 (1), pp. 1–22. Cited by: §I.
 [40] (2020) Controlling the multifractal generating measures of complex networks. Scientific reports 10 (1), pp. 1–13. Cited by: §I.
Comments
There are no comments yet.