Recent events have underscored the increasing importance of the Internet to our civilization, necessitating a scientific understanding of this virtual universe [20, 28]. The pandemic induced a drastic increase in Internet usage due to activities such as remote education and work, streaming, social media, online shopping, and video games [4, 6, 12, 34]. However, with this rise of online activity comes a potential surge of problematic internet usage that is made more urgent by the rising influence of adversarial Internet robots (botnets) on society [5, 22, 19, 1]. Thus, for scientific, economic, and security reasons observing, analyzing, and modeling the Internet is essential.
The two largest efforts to capture, curate, and share Internet packet traffic data for scientific analysis are the Widely Integrated Distributed Environment (WIDE) project  and the Center for Applied Internet Data Analysis (CAIDA) . These data have supported a variety of research projects resulting in hundreds of peer-reviewed publications , ranging from characterizing the global state of Internet traffic, to specific studies of the prevalence of peer-to-peer filesharing, to testing prototype software designed to stop the spread of Internet worms. More recently, novel analysis of trillions of network observations enabled by high performance sparse matrix mathematics and interactive supercomputing has reinforced the ubiquity of power-law distributions in these observations and enabled precision measurements of the distribution parameters [18, 35, 25, 24]. The increased accuracy of these measurements allows new underlying generative network models to be explored.
Prior modeling studies primarilyly relied on collecting data through web crawls that naturally produced connected power-law networks [26, 3, 7, 33]. The foundational preferential attachment (PA) random network model emerged naturally from these observations [32, 17, 10, 9, 37]. PA models rely on the connectivity of a large component to generate subsequent iterations of the model [8, 2, 15, 11]. These large components, and other network cores have been studied extensively [40, 39, 32, 25]. WIDE, CAIDA, and other Internet observatories provide different vantage points that can see complete streams of traffic (Figure 1) that reveal significant rare leaves of connected components as well as entirely unattached links (Figure 2) that create deviations from standard traditional power-law models (Figure 3).
At a qualitative level, it is suspected that many of these leaves and unattached links are formed by bot traffic. These connections often behave in unpredictable manners, and tend to form other links only with similar (bot-like) connections. Looking beyond the noise from these bots, the predictable connections still grow and behave in a manner that satisfies the preferential attachment model. Crucially, the internet traffic observed in pipeline is distinct from the actual long-term traffic network of the Internet. Rather, the observed traffic is a random subnetwork of the actual network of “who in general feels like talking with whom”. Even though over large time scales, big chunks of Internet traffic might look like preferential attachment, the observed behaviors from the data collection methods will be a random subnetwork of the underlying model.
Our approach to understanding this subnetwork will be with a new model that considers the time iterative selection of random subnetworks of a larger data subset. This model amplifies the randomness of the unattached links and improves our understanding of the internet beyond a preferential attachment setting. This paper will cover the introduction and explanation of the new five-parameter PALU (PA Leaves Unattached links) model, a basic analysis of the model including the degrees distribution and analysis of the variability of the model, and conclude with the results and opportunities for future work.
Ii Network Observations
The stochastic network structure of Internet traffic is a core property of great interest to Internet stakeholders and network scientists. Of particular interest is the probability distributionwhere is the degree (or count) of one of several network quantities depicted in Figure 1: source packets, source fan-out, packets over a unique source-destination pair (or link), destination fan-in, and destination packets . Amongst the earliest and most widely cited results of virtual Internet topology analysis has been the observation of the power-law relationship
with a model exponent for large values of [7, 3, 27]. In our work network topology refers to the network theoretic virtual topology of sources and destinations and not the underlying physical topology of the Internet. These early observations demonstrated the importance of a few supernodes in the Internet (see Figure 2). Measurements of power-laws in Internet data stimulated investigations into a wide range of network phenomena in many domains and lay the foundation for the field of network science .
Classification of Internet phenomena is often based on data obtained from crawling the network from a number of starting points . These webcrawls naturally sample the supernodes of the network  and their resulting are accurately fit at large values of by single-parameter power-law models. Characterizing a network by a single power-law exponent provides one view of Internet phenomena, but more accurate and complex models are required to understand the diverse topologies seen in streaming samples of the Internet.
At a given time , consecutive valid packets are aggregated from the traffic into a sparse matrix , where is the number of valid packets between the source and destination . The sum of all the entries in is equal to
All the network quantities depicted in Figure 1 can be readily computed from using the formulas listed in Table I. An essential step for increasing the accuracy of the statistical measures of Internet traffic is using windows with the same number of valid packets . Using packet windows with the same number of valid packets produces aggregates that are consistent over a wide range of windows from to . While weights are important to study, in this initial work we are studying the unweighted model. The common weights to study subsequently could be the number of packets or number of bytes sent over a link.
Formulas for computing aggregates from a sparse network image at time in both summation and matrix notation.
is a column vector of all 1’s,is the transpose operation, and is the zero-norm that sets each nonzero value of its argument to 1.
Ii-a Logarithmic Pooling
A network quantity computed from produces a corresponding histogram denoted by
, with corresponding probability
and cumulative probability
Due to the relatively large values of observed due to a single supernode, the measured probability at large often exhibits large fluctuations. However, the cumulative probability lacks sufficient detail to see variations around specific values of , so it is typical to pool the differential cumulative probability with logarithmic bins in
. The corresponding mean and standard deviation ofover many different consecutive values of for a given data set are denoted and .
Ii-B Modified Zipf-Mandelbrot Model
Measurements of can reveal many properties of network traffic, such as the fraction of nodes with only one connection and the size of the supernode
Effective classification of a network with a low parameter model allows these and many other properties to be summarized and computed efficiently. In the standard Zipf-Mandelbrot model typically used in linguistic contexts, is a ranking with corresponding to the most popular value [29, 30, 36]
. To accurately classify the network data using the full range of, the Zipf-Mandelbrot model is modified so that is a measured network quantity instead of a rank index
The inclusion of a second model offset parameter allows the model to accurately fit small values of , in particular , which has the highest observed probability in these streaming data. The model exponent has a larger impact on the model at large values of while the model offset has a larger impact on the model at small values of and in particular at .
The unnormalized modified Zipf-Mandelbrot model is denoted
with corresponding gradient
The normalized model probability is given by
where is the largest value of the network quantity . The cumulative model probability is the sum
The corresponding differential cumulative model probability is
Minimizing the differences between the observed differential cumulative distributions allows accurate Zipf-Mandelbrot parameters to be selected for a given set of observations. Figure 3 provides a representative sample of the hundreds of fits from , illustrating the effectiveness of Zipf-Mandelbrot model in describing these observations.
Iii PA + Leaves + Unattached (PALU) Model
The effectiveness of the empirical Zipf-Mandelbrot model drives the need to explore new underlying generative network models that can produce these more complex probability distributions. The new model extends the PA model with explicit terms accounting for the leaves attached to the PA core as well as unattached links.
The model can be conceptualized in two parts: the underlying network, and the observed network. The underlying network can be considered as the “true” information of the traffic connections that occur and how frequent these connections are. With our current methods, the full image of this network cannot be detected, and our consideration of this underlying network is theoretical. There are three main pieces that make up this network: the core which is constructed by preferential attachment; a set of degree 1 nodes called leaves that are adjacent to nodes in the core; and unattached nodes that are not connected to the core, and have very low connection within the set itself.
Looking through a window of a certain size with respect to the number of observed connections (packets), a random subnetwork of the underlying network is witnessed, which we call the observed network. In this particular set of data, we observe all possible data that passes from one gateway to another until a given number of connections are observed, as described in Section II. For further details on the collection of the data see . Conceptually, we can consider selecting this subnetwork by randomly deleting edges and nodes of the underlying network.
In reality these edge connections are directed since internet communication is directional, however for the sake of the model we will consider this undirected. Using a directed model has a small impact on overall the degree distribution analysis . We will leave this discussion for later in the paper.
A webcrawl is more likely to detect nodes in the core, but less likely to detect the leaves and the unattached nodes. The measurements taken by MAWI, CAIDA, and others through trunkline observations of trillions of connections reveal these leaves and unattached nodes included in this new model .
Iii-a Defining the Model
The PALU model requires four parameters: the clustering of the unattached nodes, the proportions of nodes in each section of the underlying network, the preferential attachment of the core in the underlying network, and the size of the sampled network from the underlying network.
Let define the average degree of the unattached nodes in the underlying network
The parameters represent the proportions of nodes in each of the core, the leaves, and the unattached nodes in the underlying network, conforming to the relationship
To describe the preferential attachment of the core, we require the parameter as the exponent of power-law decay of the degree distribution.
The window size is described by the parameter as the proportion of the underlying network that is being observed. As the window size increases, will get closer to 1. Specifically, this parameter is the probability that an edge in the underlying network will appear (be selected) in the observed network.
The model is completely determined by five parameters, and the relationship in (1) determines the sixth (). Importantly, for a given network, the parameters and should be the same regardless of the window size. As the window size increases, the only parameter that will change is , as it is more likely to see more edges.
Iv Preliminary Analysis
In the following, we reference the Riemann zeta function, which is a well-known function given by
In this paper, was determined to be experimentally from the log-log plot of degree distribution in . This function is supported in MATLAB with the built-in function zeta(x), and was determined here to be , where this will assist in the probabilistic analysis.
Given a window size , the fraction of nodes in the underlying network that we expect to see in our observed network is
Our model predicts the following about the observed network, where is taken to be some integer greater than 1
The last two approximations are very good when , in which case we have
where is a constant independent of
. This provides an effective estimation for
via linear regression in a- plot.
Iv-a Logarithmic Pooling
In order to consider a selection of nodes over a logarithmically binned degree interval, simply find the cumulative sum of the corresponding estimate for all the values of within the interval. For example with , the number of nodes of degree between and , we have the following summation:
Taking logs, where is some constant that does not depend on , we get this is approximately
Importantly, taking intervals of degrees for large , a plot will have the slope of the regression line as , and not as it would be in the non-interval case. When is small (e.g., ), the approximation for the number of nodes between degree and degree with becomes
and the right-hand-side can be computed by summing up all the terms,carefully approximating with the term and the term when . This differs from the large- estimate in that for large , we can discard small terms on the right-hand-side and safely estimate the sum by an integral. In both Fig 3 and Fig 4 our parameters are given for the underlying probability distribution, but we are plotting the differential cumulative distribution which will result in power law exponent one unit higher .
Iv-B Simplified Degree Distributions
The ratios for distinct degrees can be simplified using the following parameters
where , , , and are constants that do not depend on
For , , and , each is proportional to the number of nodes in the core, unattached, and leaves respectively. These three constants do depend on the parameter , related to the window size. The parameter depends on as well and is related to the clustering. The constant is the same as the above section. All of the parameters above should be positive.
To fit these parameters, all of which are in terms of the discrete parameters, we can consider the following
One first fits (4) to the long-term behavior of the degree distribution. A - plot will have a linear behavior whose slope is equal to and constant term equal to , as we will see in the discussion in the following section. This will result in fitting and .
Subsequently (3) will fit small values of () estimating and .
It is then possible to solve for exactly by using (2).
In (b) it is possible to fit the parameters by subtracting the term from both sides of (3) and then summing up both sides will give a value of roughly , which would be a more robust estimate than the point-wise estimates of (3).
After having computed and from (4), computing
should be equal to roughly . Thus, resulting in
Computing the two summations on the left-hand-side from the data can be used to to approximate by numerically solving the above.
The advantage to this methodology is that it presumably reduces the estimate to one with substantially less variance. Estimating the right-hand side analytically, would provide roughly accurate estimates for large, but for the estimates would become roughly , by expanding its Taylor series. After having an estimate of , , and , (3) can be used to estimate by a linear regression.
A benefit to the efficacy of the model is that an arbitrary choice of and would not lead to an inaccurate computation. In choosing these variables arbitrarily, the network would maintain the structural topology for since the information provided in choosing will mostly determine the structure. The model would be even more accurate for large values of by (3), and for the very few values of in between, the chosen values and could be modified until the desired results were acquired.
V Derivation of the estimates
The estimates from the simplistic model follow immediately from the corresponding estimates in the refined model. For the derivation, the aim is to count nodes rather than the ratios determined with the variables and . Let be the number of nodes in the underlying network. Rigorously, we let be the number of nodes in the core and be the number of leaf nodes in the underlying network. We will carefully define later. We begin the model of the observed network by beginning with the underlying network. We obtain our observed subnetwork by retaining each edge independently with probability , creating an Erdős–Rényi random subnetwork of the underlying network .
The core nodes of the underlying network are assumed to be generated according to a preferential attachment model, independent of the other parts of the underlying network. The number of core nodes of the underlying network having degree follows a power-law distribution of the form . Thus, the number of degree nodes in the observed network is well-approximated by
node in the underlying network will have degree distributed in the observed network according to the binomial distribution
To estimate the number of core nodes with degree greater than in our observed network, we sum these estimates, which we in turn approximate as a Riemann integral, at the cost of a negligible error term. The number of leaf nodes in the observed network has a binomial distribution with mean and variance
, and the random variable is approximated by its mean.
Returning to the rigorous derivation of the unattached nodes. In this part of the model we generate -many stars, each of which has a random number of non-central nodes, where the number of non-central nodes is given by independent identically distributed Poisson random variables with mean . The Poisson random variable represents the modeling of the central nodes and a large number of potential leaves, each of which chooses to attach to a central node with some sufficiently low probability. With this distribution, the total number of nodes in the unattached portion of the underlying network has a distribution precisely given by
Since the sum of independent Poisson random variables is again Poisson, this simplifies to
However, of these central nodes, there will be of them which are isolated nodes. As these cannot be seen by examining traffic between nodes, we remove these nodes from this model. The model demonstrating a non-zero number of isolated nodes gives significant evidence to the existence of these nodes in the true picture of the internet, regardless that they cannot be observed in our current methods of data collection.
Estimating the degree sequence within our observed network amounts to the following. If we first sample and then we take the sum of independent -Bernoulli random variables of mean , we seek to find the resulting distribution, that is to analyze . This is well-known to be simply —as seen from an elementary computation—and additionally motivated our choice of Poisson random variables.
All to say, the unattached version of our observed subnetwork is distributed as stars each of which have non-central leaves, from which the results follow immediately by independence and estimating the terms by their means, which is valid large values of .
Vi Zipf-Mandelbrot Connection
Finally, a one-parameter (unnormalized) approximation of Equation (3) can be used to compare with the empirical Zipf-Mandelbrot distributions. Using the approximation
is a positive number to be fit to the data. This approximation effectively changes the underlying distribution from Poisson to an equally valid Geometric distribution, resulting in the distribution
which can be rescaled as
The above expressions can be aligned with the Zipf-Mandelbrot parameters by setting
In Figure 4, the degree distribution is shown with a selection of curve families that demonstrate the the PALU model can be made to fit a Zipf-Mandlebrot distribution for . For any given power law exponent and offset parameter , the Zipf-Mandlebrot distribution can be well-approximated by Equation (5) by varying . In general, the model tends towards Zipf-Mandlebrot. In addition, the model has the potential to explain some observations that deviate from the Zipf-Mandlebrot distribution (see Figure 3 upper right).
Connecting back to the original values of and , we have
and combining with the same expression in gives
Thus, the model described in Equation 5 captures the PA aspect of the model mostly in the first term and the rest of the model is captured via and through the second term.
In this paper, we presented a modified preferential attachment model to better represent new observations of network topology observed from streaming data. From the figures, we found the PALU model to fit the Zipf-Mandelbrot distribution very well given the right parameters.
With the PALU model as a stepping stone, we can continue to examine the space of the network and more effective models, including comparing the data from  to the model itself, extended the analysis done in Fig 4. Some opportunities for future work that we are interested in include further investigation of the PALU model such as its various applications in other big data environments and deeper study into the degree distribution and clustering coefficients. The PALU model research can also extend to the case of weighted edges where potential weights could be the number of packets or number of bytes sent along a link. Exploration of other models are also possible, such as combining preferential attachment with the Erdos-Renyi model and determining if there is a better fitting model than the Zipf-Mandelbrot distribution. It would be worthwhile to explore the existence and importance of isolated nodes, and analyzing aspects of the network by studying the behaviors of a random sampling or a preferential attachment graph. Finally, extrapolating the results of the PALU model to observe and define the large clusters of small disconnected components may also be of interest.
The authors wish to acknowledge the following individuals for their contributions and support: Bob Bond, David Clark, Nora Devlin, Alan Edelman, Jeff Gottschalk, Charles Leiserson, Mimi McClure, Sandeep Pisharody, Steve Rejto, Daniela Rus, Douglas Stetson, Allan Vanterpool, Marc Zissman, and the MIT SuperCloud team: Bill Arcand, Bill Bergeron, David Bestor, Chansup Byun, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones, Anna Klein, Peter Michaleas, Lauren Milechin, Julie Mullen, Andrew Prout, Antonio Rosa, Albert Reuther, Charles Yee.
-  (2019) Why botnets persist: designing effective technical and policy interventions. Cited by: §I.
A random graph model for massive graphs.
Proceedings of the thirty-second annual ACM symposium on Theory of computing, pp. 171–180. Cited by: §I.
-  (1999) Internet: diameter of the world-wide web. Nature 401 (6749), pp. 130. Cited by: §I, §II.
-  (2020) Online and remote learning in higher education institutes: a necessity in light of covid-19 pandemic.. Higher Education Studies 10 (3), pp. 16–25. Cited by: §I.
-  (2017) Social media and fake news in the 2016 election. Journal of Economic Perspectives 31 (2), pp. 211–36. Cited by: §I.
-  (2020) Exploring the critical challenges and factors influencing the e-learning system usage during covid-19 pandemic. Education and Information Technologies 25, pp. 5261–5280. Cited by: §I.
-  (1999) Emergence of scaling in random networks. Science 286 (5439), pp. 509–512. Cited by: §I, §II, §III.
-  (1999) Emergence of scaling in random networks. science 286 (5439), pp. 509–512. Cited by: §I.
-  (2016) Network science. Cambridge university press. Cited by: §I, §II-A, §II.
-  (2009) Scale-free networks: a decade and beyond. science 325 (5939), pp. 412–413. Cited by: §I.
-  (2020) The iterated local model for social networks. Discrete Applied Mathematics. Cited by: §I.
-  (2020) Impact of the covid-19 pandemic on the internet latency: a large-scale study. Computer Networks 182, pp. 107495. Cited by: §I.
-  (2009) Identifying high cardinality internet hosts. In INFOCOM 2009, IEEE, pp. 810–818. Cited by: §II, §II.
-  (2000) Traffic data repository at the wide project. In Proceedings of USENIX 2000 Annual Technical Conference: FREENIX Track, pp. 263–270. Cited by: §I.
-  (2006) Complex graphs and networks. American Mathematical Soc.. Cited by: §I, §V, §V.
-  (1999) Internet tomography. Nature, Web Matter. Cited by: §I.
-  (2009) Power-law distributions in empirical data. SIAM review 51 (4), pp. 661–703. Cited by: §I, §II-A.
-  (2018) Hyperscaling internet graph analysis with d4m on the mit supercloud. IEEE High Performance Extreme Computing Conference (HPEC). Cited by: §I.
-  (2018) Zero botnets: building a global effort to clean up the internet. Council on Foreign Relations. Cited by: §I.
-  (2011) The world’s technological capacity to store, communicate, and compute information. Science, pp. 1200970. Cited by: §I.
-  Http://www.caida.org/data/publications/. Cited by: §I.
-  Https://go.neosit.com/bot-report-2019. Cited by: §I.
Measuring sparseness of noisy signals.
4th International Symposium on Independent Component Analysis and Blind Signal Separation, pp. 125–130. Cited by: TABLE I.
-  (2019) New phenomena in large-scale internet traffic. arXiv preprint cs.NI/1904.04396. Cited by: §I, Fig. 1, Fig. 2, §II-B, §II, §III, §III, §IV, §VII.
-  (2018) Mathematics of big data: spreadsheets, databases, matrices, and graphs. MIT Press. Cited by: §I, §I.
-  (1994) On the self-similar nature of ethernet traffic (extended version). IEEE/ACM Transactions on Networking (ToN) 2 (1), pp. 1–15. Cited by: §I.
-  (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 177–187. Cited by: §II.
-  (2013) A survey of network flow applications. Journal of Network and Computer Applications 36 (2), pp. 567–581. Cited by: §I.
-  (1953) An informational theory of the statistical structure of language. Communication theory 84, pp. 486–502. Cited by: §II-B.
-  (2001) Beyond the zipf–mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications 300 (3-4), pp. 567–578. Cited by: §II-B.
-  (2010) Community structure in time-dependent, multiscale, and multiplex networks. science 328 (5980), pp. 876–878. Cited by: §II.
-  (2001) Clustering and preferential attachment in growing networks. Physical review E 64 (2), pp. 025102. Cited by: §I.
-  (2010) Web crawling. Foundations and Trends® in Information Retrieval 4 (3), pp. 175–246. Cited by: §I, §II.
-  (2020) Impact of digital surge during covid-19 pandemic: a viewpoint on research and practice. International Journal of Information Management 55, pp. 102171. Cited by: §I.
Interactive supercomputing on 40,000 cores for machine learning and data analysis. IEEE High Performance Extreme Computing Conference (HPEC). Cited by: §I.
-  (2006) Modeling and caching of peer-to-peer traffic. In Network Protocols, 2006. ICNP’06. Proceedings of the 2006 14th IEEE International Conference on, pp. 249–258. Cited by: §II-B.
-  (2018) A preferential attachment paradox: how preferential attachment combines with growth to produce networks with log-normal in-degree distributions. Scientific reports 8 (1), pp. 2811. Cited by: §I.
-  (2008) On estimating the exponent of power-law frequency distributions. Ecology 89 (4), pp. 905–912. Cited by: §IV-A.
-  (2017) Reliable multi-fractal characterization of weighted complex networks: algorithms and implications. Scientific reports 7 (1), pp. 1–22. Cited by: §I.
-  (2020) Controlling the multifractal generating measures of complex networks. Scientific reports 10 (1), pp. 1–13. Cited by: §I.