flow_stats
Flow statistics
view repo
Efficiency of numerous floworiented solutions proposed in the literature strongly depends on traffic characteristics, and thus, should be assessed based on real traffic traces. For example, in case of traffic engineering mechanisms which base on the distinction between elephant and mice flows it is extremely important to ensure realistic distributions of flows' length (in packets) and size (in bytes). Credible data is not available in the literature. Numerous works contain only graphs presenting PDFs and CDFs of selected flow parameters, but none of these papers provides reusable data, like parameters of distributions mixtures fitting data being analyzed or even sources of presented plots. In this paper we provided flows, packets and bytes distributions in function of flows' length and size. We also fitted mixture of distributions to data plots and provided parameters of those distributions along with source code in the GitHub repository. The statistics were calculated based on the real traffic traces comprising 4 billions of flows and collected at the Internetfacing interface of campus network. Such a comprehensive analyses and precise parameters enable credible assessment of numerous novel mechanisms proposed by researchers and addressing resource provisioning, traffic engineering or reliability issues.
READ FULL TEXT VIEW PDF
The majority of Internet traffic is caused by a relatively small number ...
read it
Artificial Neural Networks (ANNs) were used to classify neural network f...
read it
Network performance problems are notoriously difficult to diagnose. Prio...
read it
Monitoring the interaction behaviors of network traffic flows and detect...
read it
This paper addresses the following question: Is it possible to migrate T...
read it
The Internet is transforming our society, necessitating a quantitative
u...
read it
Collecting flow records is a common practice of network operators and
re...
read it
Flow statistics
Floworiented switching and routing has been gaining attention of researchers for a long time. It can be advantageous in comparison to perpacket switching, especially in terms of traffic engineering, QoS or security. For example, flowbased routing enables multipath and adaptive approaches, impossible to achieve in perpacket routing due to routing loops constrains.
For years, flowaware networking concepts were present in academia. Recently, due to the hardware development, it finally became possible to design floworiented switches. It includes SDNbased solutions (OpenFlow, P4) and distributed mechanisms, like FAMTAR [1].
Efficiency of numerous solutions proposed in the literature strongly depends on traffic characteristics, and thus, should be assessed based on real traffic traces. For example, efficiency of traffic engineering mechanisms exploiting longtailed nature of IP flows strictly depends on elephants to mice ratio [2]. Therefore, evaluation of such a mechanisms must be performed assuming credible flows’ length and size distributions.
However, we noticed that such a data is not available in the literature. Numerous works contain only graphs presenting PDFs and CDFs of selected flow parameters, but none of these papers provides reusable data, like parameters of distributions mixtures fitting data being analyzed or even sources of presented plots. Thus the aim of this paper is to fill that gap and enable credible assessment of novel algorithms and mechanisms widely proposed by the researchers. Detailed flow statistics were calculated based on the complete 30day record of flows originating from big university network. Following distributions are analyzed and modeled:
Flows, packets and bytes distributions in function of flow length (packets).
Flows, packets and bytes distributions in function of flow size (bytes).
For all of these 6 distributions, we present PDF and CDF plots and fitted mixtures parameters. To the best of our knowledge, this is the first paper which:
In addition, our statistics are based on billions of flows, which is orders of magnitude more comparing to the previous analyses comprising mostly tens of millions of flows. We believe that developed flow statistics mixture models can be utilized to evaluate flowbased mechanisms.
We collected NetFlow records on the Internetfacing interface of the AGH University of Science and Technology network during the consecutive period of 30 days.
69% of the traffic was generated by dormitories, populated with nearly 8000 students. 31% of the traffic was generated by the rest of the university, comprising over 4000 employees. Therefore, flow statistics presented in this paper can be also useful when modeling residential traffic.
In case of dormitories, 91% of traffic was downstream traffic (from the Internet). In case of rest of the university, downstream traffic made up 73% of total traffic. Table I shows detailed traffic shares.
Dormitories  Rest of the campus  

all  down  up  all  down  up  
Flows  41.62  58.38  
50.05  49.95  50.02  49.98  
Packets  67.72  32.28  
63.88  36.12  59.09  40.91  
Bytes  68.66  31.34  
91.41  8.59  73.00  27.00 
All flows were collected which means that sampling was disabled and inactive timeout was set to 15 seconds. NetFlow records, split due to active timeout, were merged back with a tool developed by us and available in [3]. Thanks to that we were able to obtain accurate statistics for long flows.
In total, we collected over 4 billions of flows comprising 317 billions of packets. The amount of transmitted data was over 275 TB. Table II presents the statistics of collected flows.
Number of flows  4 032 376 751  flows 

Number of packets 
316 857 594 090  packets 
Number of bytes 
275 858 498 994 998  bytes 
Average flow length 
78.578370  packets 
Average flow size 
68410.894128  bytes 
Average flow duration 
7103.502983  milliseconds 
Average flow rate 
19467.119600  bps 
Average packet size 
870.607188  bytes 
Fitting of probability distribution to series of observed data is a process of finding a probability distribution and its parameters. However, due to the complexity of Internet traffic (many network applications composing traffic traces and different users’ behavior) single distribution cannot be fitted to the collected data. In such a situation mixture of distribution can be used. We say that a particular observation
is a mixture of K component distributions if:(1) 
where are mixture weights and . Such a model is a collection of other, well known distributions called mixture components.
Finding mixture components and their weights is not a trivial process. In comparison to a single distribution fitting, where maximumlikelihood estimation (MLE) can be applied to estimate parameters of statistical model, more sophisticated method must be used. One of the most commonly used algorithm for this purpose is the EM algorithm
[4]. It works based on the two steps repeated in an iterative manner. Firstly, it finds the expected value of the completedata loglikelihood (joint loglikelihood consisting of distributions and its weights) with respect to unknown data (weights) and the current values of parameter estimates and then — maximizes the expectation computed in the first step.Mixtures were fitted using Matlab framework. Therefore, distribution parameters presented in tables refers to parameters of respective Matlab functions, where Pareto means the Generalized Pareto distribution. Code of distribution mixtures implemented in the SciPy framework can be found in Jupyter notebooks published by us on GitHub [3].
Uniform  

Uniform  
Uniform  
Uniform  
Uniform  
Pareto  
Weibull 
Weibull  

Weibull  
Weibull  
Pareto  
Weibull 
Pareto  

Weibull  
Weibull  
Weibull  
Weibull 
Weibull  

Weibull  
Pareto 
Pareto  

Weibull  
Weibull  
Weibull  
Weibull  
Weibull 
Weibull  

Weibull  
Weibull  
Weibull  
Pareto 
The first paper in which, according to our knowledge, the possibility of exploiting heavytailed nature of IP flows for traffic engineering was explored, was [2]. Authors also provided distribution plots obtained from their own traffic measurements (collected in 1997), but without any data.
Article [5] is the most similar to our one. Authors also use NetFlow dumps to calculate flow statistics. They provide distributions of flows CDFs in terms of length, size and duration. Furthermore, they provide fitted distribution parameters, although they fit single distributions, not mixtures, so fittings are not accurate. Authors examine only selected applications (ports): P2P, Web and TCPbig. Therefore, fitted distributions cannot be used as a model of general Internet traffic.
Another very similar paper is [6] (based on traffic from 2004). In provides graphs of CDFs of flows length, size and duration, including values of selected graph points.
Interesting analysis were performed in [7]. Authors collected packet headers instead NetFlow (although they propose usage of NetFlow records as a data source as a potential direction for future). They only provide CDFs of flows and total bytes depending on flow duration, but the article contains an interesting description of flow records merging methodology.
Article [8] contains graphs of flow size, duration and rate distributions based on traces collected in 20012002. Next paper, [9], presents an update of the same graphs with data collected in 2008.
Campus Internet traffic was analyzed in [10]. It presents flows and total bytes CDFs in terms of flow size for CAIDA and Budapest University traces, both collected in 2012.
Another paper which shows CDFs of total flow, byte and packet lifetimes is [11]. Analysis is based on traffic observed for 24 hours at Auckland in 2006.
Article [12] although does not show any distributions, provides a ratio graph of how much traffic in the network results from what fraction of flows. Such an information can be useful for evaluation of miceelephant mechanisms. No numerical data is provided.
There are also three papers, which explore the possibility of estimation of flows distributions from sampled traffic: [13] [14] [15]. They contain detailed analysis of distribution fitting process, but do not contain resulting distribution mixtures parameters.
Articles [16] [17] [18] [19] [20] [21] [22] [23] show selected distribution graphs, but without providing any data, what makes them hardly useful in further research. The last article, in addition to the CDFs mentioned in the table, provides CDF of active flows, flow interarrival times and packets interarrival times.
In articles [24] and [25] in addition to the classification according to the table, there are figures where rate vs. size and bandwidth vs. duration presented.
There are also many papers, which models single services present in the network, not the general Internet traffic: [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43].
y CDF of:  flows  packets  bytes 
x  
length  [2][6][17] [18] [19] [21] [5]  [2] 

size  [8] [9] [6][10] [13] [16][17] [18] [22] [23] [24] [25] [5] 

[10] 
All these papers provide only graphs and (sometimes) values of selected points from distributions. Table IX shows which article provides which distribution plots. None of them provide values of fitted distribution parameters for general Internet traffic. Most of them are focused on the flows as the aim of such a analysis is usually combined with elephant flows issues.
Separate group of works regards to the adjusting traffic distributions. Most of them regards to the single service, and thus, single model for the general Internet traffic is not provided.
In this paper we statistically analyzed real life traffic traces collected in the campus network. The analyses regarding the length and size were performed for flows, packets and bytes. We proposed mixtures of distributions for each scenario accurately fitting collected data. Such a data enables credible assessment of numerous novel mechanisms proposed by researchers, regarding for example to resource provisioning, traffic engineering or reliability.
Based on the provided stateoftheart analysis it can be concluded that similar, comprehensive statistical analyzes of real traffic traces are missing in the literature. As a further study we plan to extend our research with similar results regarding the duration and rate distributions.
The authors would like to thank Bogusław Juza for providing NetFlow dumps.
The research was carried out with the support of the project ”Intelligent management of traffic in multilayer SoftwareDefined Networks” founded by the Polish National Science Centre under project no. 2017/25/B/ST6/02186.
H. ToralCruz, A.S. K. Pathan, and J. C. R. Pacheco, “Accurate modeling of voip traffic qos parameters in current and future networks with multifractal and markov models,”
Mathematical and Computer Modelling, vol. 57, no. 11, pp. 2832 – 2845, 2013, information System Security and Performance Modeling and Simulation for Future Mobile Networks.
Comments
There are no comments yet.