Temporal Correlation of Internet Observatories and Outposts

03/19/2022
by   Jeremy Kepner, et al.
0

The Internet has become a critical component of modern civilization requiring scientific exploration akin to endeavors to understand the land, sea, air, and space environments. Understanding the baseline statistical distributions of traffic are essential to the scientific understanding of the Internet. Correlating data from different Internet observatories and outposts can be a useful tool for gaining insights into these distributions. This work compares observed sources from the largest Internet telescope (the CAIDA darknet telescope) with those from a commercial outpost (the GreyNoise honeyfarm). Neither of these locations actively emit Internet traffic and provide distinct observations of unsolicited Internet traffic (primarily botnets and scanners). Newly developed GraphBLAS hyperspace matrices and D4M associative array technologies enable the efficient analysis of these data on significant scales. The CAIDA sources are well approximated by a Zipf-Mandelbrot distribution. Over a 6-month period 70% of the brightest (highest frequency) sources in the CAIDA telescope are consistently detected by coeval observations in the GreyNoise honeyfarm. This overlap drops as the sources dim (reduce frequency) and as the time difference between the observations grows. The probability of seeing a CAIDA source is proportional to the logarithm of the brightness. The temporal correlations are well described by a modified Cauchy distribution. These observations are consistent with a correlated high frequency beam of sources that drifts on a time scale of a month.

READ FULL TEXT VIEW PDF

Authors

page 2

08/15/2021

Spatial Temporal Analysis of 40,000,000,000,000 Internet Darkspace Packets

The Internet has never been more important to our society, and understan...
04/08/2019

New Phenomena in Large-Scale Internet Traffic

The Internet is transforming our society, necessitating a quantitative u...
11/05/2020

Stochastic Approximation for High-frequency Observations in Data Assimilation

With the increasing penetration of high-frequency sensors across a numbe...
05/09/2021

Estimating the Causal Effects of Cruise Traffic on Air Pollution using Randomization-Based Inference

Local environmental organizations and media have recently expressed conc...
01/11/2021

Evolutionary Map of the Universe (EMU):Compact radio sources in the SCORPIO field towards the Galactic plane

We present observations of a region of the Galactic plane taken during t...
05/20/2020

A Parallelizable Method for Missing Internet Traffic Tensor Data

Recovery of internet network traffic data from incomplete observed data ...
05/20/2020

A Parallelizable Optimization Method for Missing Internet Traffic Tensor Data

Recovery of internet network traffic data from incomplete observed data ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Data Sets

The CAIDA Telescope monitors an Internet darkspace (also referred to as a black hole, Internet sink, or darknet) that is a globally routed /8 network that carries almost no legitimate traffic because there are few allocated addresses in this Internet prefix. After discarding the small amount of legitimate traffic from the incoming packets, the remaining data represent a continuous view of anomalous unsolicited traffic, or Internet background radiation. Almost every computer on the Internet will receive some form of this background traffic. This unsolicited traffic results from a wide range of events, such as backscatter from randomly spoofed sources used in denial-of-service attacks, the automated spread of Internet worms and viruses, scanning of address space by attackers or malware looking for vulnerable targets, and various misconfigurations (e.g. mistyping an IP address). In recent years, traffic destined to darkspace has evolved to include longer-duration, low-intensity events intended to establish and maintain botnets. CAIDA personnel maintain and expand the telescope instrumentation, collecting, curating, archiving, and analyzing the data to enable data access for vetted researchers.

Data collection start time, collection duration, and number of unique sources from the GreyNoise and CAIDA data sets. GreyNoise data was collected for each month. The sharp increases in 2020-03 and 2021-04 are a result of configuration changes. packets of CAIDA data were selected approximately every 6 weeks on Wednesdays either at noon or midnight. Constant packet, variable time samples simplify the statistical analysis of the heavy-tail distributions commonly found in network traffic quantities [25, 48, 28].

TABLE I: GreyNoise and CAIDA Data Sets.

The CAIDA Telescope monitors the traffic into and out of a set of network addresses providing a natural observation point of network traffic. These data can be viewed as a traffic matrix where each row is a source and each column is a destination. The CAIDA Telescope traffic matrix can be partitioned into four quadrants (see Figure 1). These quadrants represent different flows between nodes internal and external to the set of monitored addresses. Because the CAIDA Telescope network addresses are a darkspace, only the upper left (external internal) quadrant will have data.

During 2020 over 20,000,000,000,000 unique packets were collected by the CAIDA Telescope. This data set represents the largest ever assembled public corpus of Internet traffic, and is perhaps the largest public collection of streaming network events of any type. Analysis of such a large network data set is computationally challenging [42, 36, 21]. Using the combined resources of the Supercomputing Centers at UC San Diego, Lawrence Berkeley National Laboratory, and MIT, the spatial temporal structure of anonymized source-destination pairs from the CAIDA Telescope data has been analyzed leveraging prior work on massively parallel GraphBLAS and D4M hierarchical hypersparse matrices [35, 31, 32, 52, 18, 26, 30, 29] to reveal a wide range of scaling relations [33]. For this study 5 contiguous subsets of CAIDA Telescope packets were selected and formed into GraphBLAS hypersparse traffic matrices at approximately 6-week intervals (see Table I). Prior work has shown that constant packet, variable time samples simplify the statistical analysis of the heavy-tail distributions commonly found in network traffic quantities [25, 48, 28]. Within each of these packet windows there are 500,000 to 800,000 unique sources.

The GreyNoise honeyfarm consists of hundreds of servers that passively collect packets from hundreds of thousands of IPs seen scanning the internet every day. GreyNoise servers converse with these sources and analyze and enrich these observations to identify behavior, methods and intent. The commercial goal of GreyNoise is to analyze and label data on IPs that saturate security tools with noise. This perspective helps analysts ignore irrelevant or harmless activity, creating more time to uncover and investigate true threats. The GreyNoise honeyfarm outpost deduces more information about sources by responding to traffic to determine its nature and so exists in both the upper left (external internal) quadrant and the lower right (internal external) quadrant of the corresponding traffic matrix (see Figure 1). For this study, GreyNoise provided data over a 15-month period which has been divided into 1-month windows (see Table I). Within each of these 1-month windows there are 1,000,000 to 14,000,000 uniques sources.

Internet data must be handled with care, and CAIDA has pioneered standard trusted data sharing best practices that include [27]

  • Data is made available in curated repositories

  • Using standard anonymization methods where needed: hashing, sampling, and/or simulation

  • Registration with a repository and demonstration of legitimate research need

  • Recipients legally agree to neither repost a corpus nor deanonymize data

  • Recipients can publish analysis and data examples necessary to review research

  • Recipients agree to cite the repository and provide publications back to the repository

  • Repositories can curate enriched products developed by researchers

Within the above trusted sharing framework there are three main ways that subsets of anonymized data from multiple sources can be correlated [50]

  1. If the subset is small and the risk is low, then anonymized data can be sent back to the sources for deanonymization.

  2. If the subset is small, a third common anonymization scheme can be used and the data can be sent back to the sources for them to reanonymize in the common scheme.

  3. For larger sets, an anonymization transformation table provided by the sources allows direct mapping from anonymized data to the common scheme.

For this work, the first approach was used.

Ii Network Quantities

Fig. 2: Streaming network traffic quantities. Internet traffic streams of valid packets are divided into a variety of quantities for analysis: source packets, source fan-out, unique source-destination pair packets (or links), destination fan-in, and destination packets. All of these quantities can be readily computed from anonymized hypersparse traffic matrices (see Table II).

Streams of interactions between entities are found in many domains. For Internet traffic these interactions are referred to as packets [22]. Figure 2 illustrates essential quantities found in all streaming dynamic networks. These quantities are all computable from anonymized traffic matrices created from the source and destinations found in packet headers. These sources and destinations are referred as Internet Protocol (IP) addresses.

Formulas for computing network quantities from the traffic matrix at time in both summation and matrix notation.

is a column vector of all 1’s,

is the transpose operation, and is the zero-norm that sets each nonzero value of its argument to 1[23]. These formulas are unaffected by matrix permutations and will work on anonymized data and are readily computed using GraphBLAS hypersparse matrices or D4M associative arrays [14, 32].

Aggregate     Summation  Matrix
Property       Notation Notation
Valid packets
Unique links
Link packets from to    
Max link packets ()
Unique sources
Packets from source   
Max source packets ()
Source fan-out from    
Max source fan-out ()
Unique destinations
Destination packets to
Max destination packets ()
Destination fan-in to
Max destination fan-in ()
TABLE II: Network Quantities from Traffic Matrices

The network quantities depicted in Figure 2 are computable from anonymized origin-destination matrices that are widely used to represent network traffic [54, 59, 47, 56]. It is common to filter the packets down to a valid set for any particular analysis. Such filters may limit particular sources, destinations, protocols, and time windows. To reduce statistical fluctuations, the streaming data should be partitioned so that for any chosen time window all data sets have the same number of valid packets [26]. At a given time , consecutive valid packets are aggregated from the traffic into a sparse matrix , where is the number of valid packets between the source and destination . The sum of all the entries in is equal to

All the network quantities depicted in Figure 2 can be readily computed from using the formulas listed in Table II. Because matrix operations are generally invariant to permutation (reordering of the rows and columns), these quantities can readily be computed from anonymized data using GraphBLAS hypersparse matrices or D4M associative arrays [14, 32].

Fig. 3: CAIDA Source Packet Degree Distribution. Differential cumulative probability (normalized histogram) for the number (degree) of source packets from each source using logarithmic bins for packet CAIDA samples collected over several months and at different times of day. The observed power-law distribution can be approximated by the two parameter Zipf-Mandelbrot distribution .

Processing the large volumes of data from observatories like the CAIDA Telescope requires additional computational innovations. The advent of GraphBLAS hypersparse hierarchical traffic matrices has enabled the processing of hundreds of billions of packets in minutes [24, 6, 14, 30]. The CAIDA Telescope archives its trillions of collected packets at the supercomputing center at Lawrence Berkeley National Laboratory (LBNL) where the packets are aggregated into CryptoPAN [17] anonymized GraphBLAS traffic matrices of valid contiguous packets. The traffic matrices used in this study are constructed by hierarchically summing of these traffic smaller matrices.

Because of the large volume of CAIDA Telescope data, traffic matrices were constructed using hypersparse GraphBLAS matrices using uint32 indices and floating point values, so 3 packets from IPv4 source 1.1.1.1 to IPv4 destination 2.2.2.2 in time-window would be represented as

The GreyNoise data was smaller and contains additional metadata represented as strings, so the GreyNoise data was represented using D4M associative arrays, which for the aforementioned example would be

After the unique sources and packet counts are computed from the CAIDA Telescope GraphBLAS matrices, the reduced results are converted to D4M associative arrays to facilitate correlation with the GrayNoise D4M associative arrays.

Each network quantity computed from will produce a distribution of values whose magnitude is often called the degree . The corresponding histogram of the network quantity is denoted by . The largest observed value in the distribution is denoted . The normalization factor of the distribution is given by

with corresponding probability

and cumulative probability

Because of the relatively large values of observed, the measured probability at large often exhibits large fluctuations. However, the cumulative probability lacks sufficient detail to see variations around specific values of , so it is typical to pool the differential cumulative probability with logarithmic bins in

where [13]

. All computed probability distributions use the same binary logarithmic binning to allow for consistent statistical comparison across data sets

[13, 4].

Fig. 4: Peak Correlation. Correlation of CAIDA source IPs with GreyNoise source IPs during the same month as a function of CAIDA source packets . CAIDA sources with packets in the packet window are very likely to appear in the GreyNoise data of the same month. CAIDA sources with appear with a probability .

Figure 3 shows the distribution of external internal source packets for 5 CAIDA samples with . The resulting distribution has the power-law shape frequently observed in network data [38, 16, 2, 3, 1, 5, 43]. The power-law distribution in Figure 3 can be approximated by the two parameter Zipf-Mandelbrot distribution that is widely seen in network data [25, 28]

Iii Correlation Results

An important objective of correlating observations from different locations is to determine how the observations are similar and different. A first step is to ask what fraction of the CAIDA Telescope sources are also seen in the GreyNoise observations during the same month. Figure 4 plots the fraction of CAIDA sources seen in the GrayNoise data as a function of the number of CAIDA source packets binned logarithmically. Figure 4 shows that bright CAIDA sources with are nearly always also seen by the GreyNoise observations during the same month. Likewise, for fainter sources with packets the probability of the CAIDA source being seen in the GreyNoise data can be empirically approximated as

Fig. 5: Temporal Correlation. Fraction of CAIDA 2020-06-17 sources with source packets found in GreyNoise sources over a 15-month period. Data are fit to Gaussian, Cauchy, and the modified Cauchy distribution .

Another important comparison is how the correlations change as the time between CAIDA and GreyNoise measurements increases. Figure 5 shows the correlation of CAIDA sources with packets with GreyNoise data over a 15 month span. The correlation between the CAIDA and GreyNoise sources drops quickly and then levels off to a background level. The data have been fit to Gaussian (Normal), Cauchy [55, 37], and modified Cauchy distributions. Specifically, the following function that we will refer to as the modified Cauchy distribution

where is the CAIDA measurement time, is the GreyNoise measurement time, with exponent , and scale factor . Setting and results in the standard Cauchy distribution

Figure 5 is well approximated by the modified Cauchy distribution.

Fig. 6: Temporal Correlation and Packet Degree. Fraction of CAIDA sources found in GreyNoise sources over a 15-month period for source packets . Black lines show the best fit modified Cauchy distributions ).

Figure 6 shows the CAIDA GreyNoise temporal correlations for all the CAIDA samples for selected source packet ranges. All the curves are fit to the modified Cauchy distribution by generating all distributions over a range of possible and values, normalizing to the peak in the data, and then selecting the and that minimize the norm. The best-fit scaling exponent for all source packet ranges is shown in Figure 7. The quantity provides the relative one month drop from the peak and is shown in Figure 8.

Fig. 7: Modified Cauchy Distribution . Best fit parameter from the modified Cauchy distribution models as a function of the number of CAIDA source packets .
Fig. 8: One Month Drop. One month from peak drop as derived from the fit parameter of the modified Cauchy distributions as a function of the number of CAIDA source packets .

Iv Discussion

Understanding the baseline statistical distributions of traffic are essential to the scientific understanding of the Internet. These data lend themselves to a number of observations about the statistical distributions of Internet traffic, the stability of these distributions over time, the correlation of measurements from different locations, and mathematical models of these approximations. Each of these observations provides a basis for predictions for future measurements and for theoretical modeling of the underlying generative processes.

Figure 3 shows that the source packet distributions of the CAIDA samples collected at different times have similar statistical distributions with small variations. Furthermore, the packet distribution is well approximated by a Zipf-Mandelbrot distribution. The temporal consistency of the observations with a stable Zipf-Mandelbrot model agrees with prior observations of the CAIDA Telescope [33], the CAIDA Chicago A & B Internet traffic collection [25, 28], the MAWI Internet traffic collection [25, 28], and other network gateways [34]. These observations have led to the development of new generative models of network traffic that extend prior preferential attachment models with parameters to describe adversarial traffic [15].

Figure 4 shows that the temporal statistical consistency can extend to the correlation of sources seen at separate locations. It also suggests that sources above a certain brightness are very likely to be seen, in contrast to prior observations [49]. In this case sources brighter than packets or whose fraction of the total packets is greater than . Below this threshold the probability of being seen in both CAIDA and GreyNoise during the same month is approximated by . This purely empirical logarithmic distribution and the role of the empirical value should be tested with additional comparative observations. Perhaps is connected to the fact that the number of unique sources seen at the CAIDA Telescope and other locations is approximately proportional to [34, 33]. Likewise the logarithmic distribution could be an interesting target for new theoretical models.

Figures 5 and 6 show the temporal correlation between the GreyNoise and CAIDA sources of various brightness showing that the temporal statistical consistency extends over significant time. The correlation as a function of time is well approximated by the modified Cauchy distribution. While it is certainly expected that brighter sources seen at one location on the Internet are more likely to be seen in another location at the same time, the simple empirical relation connecting the CAIDA and GreyNoise observations is intriguing. A common geometric interpretation of the Cauchy distribution is the probability of a randomly blinking rotating beam positioned above the point at a distance from a wall hitting a point on a wall. Such a geometric analogy may represent possible direction for theoretical exploration.

Figure 7 shows the best fit -exponent from the modified Cauchy distribution as a function of CAIDA source packets. These observations suggest that 1 is a typical value of . The scale factor is shown in Figure 8 via the expression which measures the relative 1-month drop off of the modified Cauchy distributions. The typical 1-month drop off is above 20% and increases to 50% for source packets. These observations suggest the modified Cauchy distributions for source packets around are typically

For other values of source packet a typical modified Cauchy distribution would be

These empirical observations offer potential starting points for further theoretical observations.

V Conclusions and Future Work

Scientific exploration of the Internet now requires endeavors akin to those used to understand the land, sea, air, and space environments. Understanding what is the normal statistical behavior of Internet traffic is a critical first step. Comparing observations from different locations on the Internet is an effective means for determining which network quantities vary or change. Using data from the largest Internet telescope (the CAIDA darknet telescope) and a commercial outpost (the GreyNoise honeyfarm) this work explores the correlation of the sources seen using GraphBLAS hyperspace matrices and D4M associative arrays. The CAIDA sources are well approximated by a Zipf-Mandelbrot distribution. Over a 6-month period 70% of the brightest (highest frequency) sources in the CAIDA telescope are consistently detected by coeval observations in the GreyNoise honeyfarm. This overlap drops as the sources dim (reduce frequency) and as the time difference between the observations grows. The probability of seeing a CAIDA source is proportional to the logarithm of the brightness. The temporal correlations are well described by a modified Cauchy distribution. These observations are consistent with a correlated high frequency beam of sources that drifts on time scales of months. Each of these observations provides a basis for predictions for future measurements and for theoretical modeling of the underlying generative processes.

Acknowledgments

The authors wish to acknowledge the following individuals for their contributions and support: Bob Bond, Ronisha Carter, Cary Conrad, Alan Edelman, Tucker Hamilton, Jeff Gottschalk, Nathan Frey, Chris Hill, Mike Kanaan, Tim Kraska, Andrew Morris, Charles Leiserson, Dave Martinez, Mimi McClure, Joseph McDonald, Christian Prothmann, John Radovan, Steve Rejto, Daniela Rus, Allan Vanterpool, Adam Weirman, Matthew Weiss, Marc Zissman.

References