Flow Length and Size Distributions in Campus Internet Traffic

09/10/2018 ∙ by Piotr Jurkiewicz, et al. ∙ 0

Efficiency of numerous flow-oriented solutions proposed in the literature strongly depends on traffic characteristics, and thus, should be assessed based on real traffic traces. For example, in case of traffic engineering mechanisms which base on the distinction between elephant and mice flows it is extremely important to ensure realistic distributions of flows' length (in packets) and size (in bytes). Credible data is not available in the literature. Numerous works contain only graphs presenting PDFs and CDFs of selected flow parameters, but none of these papers provides reusable data, like parameters of distributions mixtures fitting data being analyzed or even sources of presented plots. In this paper we provided flows, packets and bytes distributions in function of flows' length and size. We also fitted mixture of distributions to data plots and provided parameters of those distributions along with source code in the GitHub repository. The statistics were calculated based on the real traffic traces comprising 4 billions of flows and collected at the Internet-facing interface of campus network. Such a comprehensive analyses and precise parameters enable credible assessment of numerous novel mechanisms proposed by researchers and addressing resource provisioning, traffic engineering or reliability issues.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Flow-oriented switching and routing has been gaining attention of researchers for a long time. It can be advantageous in comparison to per-packet switching, especially in terms of traffic engineering, QoS or security. For example, flow-based routing enables multipath and adaptive approaches, impossible to achieve in per-packet routing due to routing loops constrains.

For years, flow-aware networking concepts were present in academia. Recently, due to the hardware development, it finally became possible to design flow-oriented switches. It includes SDN-based solutions (OpenFlow, P4) and distributed mechanisms, like FAMTAR [1].

Efficiency of numerous solutions proposed in the literature strongly depends on traffic characteristics, and thus, should be assessed based on real traffic traces. For example, efficiency of traffic engineering mechanisms exploiting long-tailed nature of IP flows strictly depends on elephants to mice ratio [2]. Therefore, evaluation of such a mechanisms must be performed assuming credible flows’ length and size distributions.

However, we noticed that such a data is not available in the literature. Numerous works contain only graphs presenting PDFs and CDFs of selected flow parameters, but none of these papers provides reusable data, like parameters of distributions mixtures fitting data being analyzed or even sources of presented plots. Thus the aim of this paper is to fill that gap and enable credible assessment of novel algorithms and mechanisms widely proposed by the researchers. Detailed flow statistics were calculated based on the complete 30-day record of flows originating from big university network. Following distributions are analyzed and modeled:

  • Flows, packets and bytes distributions in function of flow length (packets).

  • Flows, packets and bytes distributions in function of flow size (bytes).

For all of these 6 distributions, we present PDF and CDF plots and fitted mixtures parameters. To the best of our knowledge, this is the first paper which:

  • Shows graphs of all 6 distributions obtained from the same traffic traces. In the literature only selected distributions were presented (see Table IX).

  • Provides parameters of mixtures fitting these distributions.

  • Makes source data and code publicly available in the repository: [3].

In addition, our statistics are based on billions of flows, which is orders of magnitude more comparing to the previous analyses comprising mostly tens of millions of flows. We believe that developed flow statistics mixture models can be utilized to evaluate flow-based mechanisms.

Ii Data collection

We collected NetFlow records on the Internet-facing interface of the AGH University of Science and Technology network during the consecutive period of 30 days.

69% of the traffic was generated by dormitories, populated with nearly 8000 students. 31% of the traffic was generated by the rest of the university, comprising over 4000 employees. Therefore, flow statistics presented in this paper can be also useful when modeling residential traffic.

In case of dormitories, 91% of traffic was downstream traffic (from the Internet). In case of rest of the university, downstream traffic made up 73% of total traffic. Table I shows detailed traffic shares.

Dormitories Rest of the campus
all down up all down up
Flows 41.62 58.38
50.05 49.95 50.02 49.98
Packets 67.72 32.28
63.88 36.12 59.09 40.91
Bytes 68.66 31.34
91.41 8.59 73.00 27.00
TABLE I: Traffic shares

All flows were collected which means that sampling was disabled and inactive timeout was set to 15 seconds. NetFlow records, split due to active timeout, were merged back with a tool developed by us and available in [3]. Thanks to that we were able to obtain accurate statistics for long flows.

In total, we collected over 4 billions of flows comprising 317 billions of packets. The amount of transmitted data was over 275 TB. Table II presents the statistics of collected flows.

Number of flows 4 032 376 751 flows

Number of packets
316 857 594 090 packets

Number of bytes
275 858 498 994 998 bytes

Average flow length
78.578370 packets

Average flow size
68410.894128 bytes

Average flow duration
7103.502983 milliseconds

Average flow rate
19467.119600 bps

Average packet size
870.607188 bytes
TABLE II: Collected flows statistics

Fitting of probability distribution to series of observed data is a process of finding a probability distribution and its parameters. However, due to the complexity of Internet traffic (many network applications composing traffic traces and different users’ behavior) single distribution cannot be fitted to the collected data. In such a situation mixture of distribution can be used. We say that a particular observation

is a mixture of K component distributions if:

(1)

where are mixture weights and . Such a model is a collection of other, well known distributions called mixture components.

Finding mixture components and their weights is not a trivial process. In comparison to a single distribution fitting, where maximum-likelihood estimation (MLE) can be applied to estimate parameters of statistical model, more sophisticated method must be used. One of the most commonly used algorithm for this purpose is the EM algorithm

[4]. It works based on the two steps repeated in an iterative manner. Firstly, it finds the expected value of the complete-data log-likelihood (joint log-likelihood consisting of distributions and its weights) with respect to unknown data (weights) and the current values of parameter estimates and then — maximizes the expectation computed in the first step.

Mixtures were fitted using Matlab framework. Therefore, distribution parameters presented in tables refers to parameters of respective Matlab functions, where Pareto means the Generalized Pareto distribution. Code of distribution mixtures implemented in the SciPy framework can be found in Jupyter notebooks published by us on GitHub [3].

Iii Length distributions

Iii-a Flows

Fig. 1: Probability density function of
Uniform
Uniform
Uniform
Uniform
Uniform
Pareto
Weibull
TABLE III: Fitted mixture

Iii-B Packets

Fig. 2: Probability density function of
Weibull
Weibull
Weibull
Pareto
Weibull
TABLE IV: Fitted mixture

Iii-C Bytes

Fig. 3: Probability density function of
Pareto
Weibull
Weibull
Weibull
Weibull
TABLE V: Fitted mixture

Iii-D CDFs

Fig. 4: Cumulative distribution functions of flows, packets and bytes

Iv Size distributions

Iv-a Flows

Fig. 5: Probability density function of
Weibull
Weibull
Pareto
TABLE VI: Fitted mixture

Iv-B Packets

Fig. 6: Probability density function of
Pareto
Weibull
Weibull
Weibull
Weibull
Weibull
TABLE VII: Fitted mixture

Iv-C Bytes

Fig. 7: Probability density function of
Weibull
Weibull
Weibull
Weibull
Pareto
TABLE VIII: Fitted mixture

Iv-D CDFs

Fig. 8: Cumulative distribution functions of flows, packets and bytes

V Related work

The first paper in which, according to our knowledge, the possibility of exploiting heavy-tailed nature of IP flows for traffic engineering was explored, was [2]. Authors also provided distribution plots obtained from their own traffic measurements (collected in 1997), but without any data.

Article [5] is the most similar to our one. Authors also use NetFlow dumps to calculate flow statistics. They provide distributions of flows CDFs in terms of length, size and duration. Furthermore, they provide fitted distribution parameters, although they fit single distributions, not mixtures, so fittings are not accurate. Authors examine only selected applications (ports): P2P, Web and TCP-big. Therefore, fitted distributions cannot be used as a model of general Internet traffic.

Another very similar paper is [6] (based on traffic from 2004). In provides graphs of CDFs of flows length, size and duration, including values of selected graph points.

Interesting analysis were performed in [7]. Authors collected packet headers instead NetFlow (although they propose usage of NetFlow records as a data source as a potential direction for future). They only provide CDFs of flows and total bytes depending on flow duration, but the article contains an interesting description of flow records merging methodology.

Article [8] contains graphs of flow size, duration and rate distributions based on traces collected in 2001-2002. Next paper, [9], presents an update of the same graphs with data collected in 2008.

Campus Internet traffic was analyzed in [10]. It presents flows and total bytes CDFs in terms of flow size for CAIDA and Budapest University traces, both collected in 2012.

Another paper which shows CDFs of total flow, byte and packet lifetimes is [11]. Analysis is based on traffic observed for 24 hours at Auckland in 2006.

Article [12] although does not show any distributions, provides a ratio graph of how much traffic in the network results from what fraction of flows. Such an information can be useful for evaluation of mice-elephant mechanisms. No numerical data is provided.

There are also three papers, which explore the possibility of estimation of flows distributions from sampled traffic: [13] [14] [15]. They contain detailed analysis of distribution fitting process, but do not contain resulting distribution mixtures parameters.

Articles [16] [17] [18] [19] [20] [21] [22] [23] show selected distribution graphs, but without providing any data, what makes them hardly useful in further research. The last article, in addition to the CDFs mentioned in the table, provides CDF of active flows, flow interarrival times and packets interarrival times.

In articles [24] and [25] in addition to the classification according to the table, there are figures where rate vs. size and bandwidth vs. duration presented.

There are also many papers, which models single services present in the network, not the general Internet traffic: [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43].

y CDF of: flows packets bytes
x
length [2][6][17] [18] [19] [21] [5] [2]
size [8] [9] [6][10] [13] [16][17] [18] [22] [23] [24] [25] [5]
[10]
TABLE IX: Provided distributions

All these papers provide only graphs and (sometimes) values of selected points from distributions. Table IX shows which article provides which distribution plots. None of them provide values of fitted distribution parameters for general Internet traffic. Most of them are focused on the flows as the aim of such a analysis is usually combined with elephant flows issues.

Separate group of works regards to the adjusting traffic distributions. Most of them regards to the single service, and thus, single model for the general Internet traffic is not provided.

Vi Conclusion

In this paper we statistically analyzed real life traffic traces collected in the campus network. The analyses regarding the length and size were performed for flows, packets and bytes. We proposed mixtures of distributions for each scenario accurately fitting collected data. Such a data enables credible assessment of numerous novel mechanisms proposed by researchers, regarding for example to resource provisioning, traffic engineering or reliability.

Based on the provided state-of-the-art analysis it can be concluded that similar, comprehensive statistical analyzes of real traffic traces are missing in the literature. As a further study we plan to extend our research with similar results regarding the duration and rate distributions.

Acknowledgment

The authors would like to thank Bogusław Juza for providing NetFlow dumps.

The research was carried out with the support of the project ”Intelligent management of traffic in multi-layer Software-Defined Networks” founded by the Polish National Science Centre under project no. 2017/25/B/ST6/02186.

References

  • [1] P. Jurkiewicz, R. Wójcik, J. Domżał, and A. Kamisiński, “Testing Implementation of FAMTAR: Adaptive Multipath Routing,” arXiv:1808.03209, 2018. [Online]. Available: http://arxiv.org/abs/1808.03209
  • [2] A. Shaikh, J. Rexford, and K. G. Shin, “Load-sensitive Routing of Long-lived IP Flows,” ACM SIGCOMM Computer Communication Review, vol. 29, no. 4, pp. 215–226, 1999.
  • [3] P. Jurkiewicz. Flow statistics. [Online]. Available: https://github.com/piotrjurkiewicz/flow_stats
  • [4] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977.
  • [5] M. Pustisek, I. Humar, and J. Bester, “Empirical analysis and modeling of peer-to-peer traffic flows,” in MELECON 2008 - The 14th IEEE Mediterranean Electrotechnical Conference, May 2008, pp. 169–175.
  • [6] M.-S. Kim, Y. J. Won, and J. W. Hong, “Characteristic analysis of internet traffic from the perspective of flows,” Computer Communications, vol. 29, no. 10, pp. 1639–1652, 2006.
  • [7] L. Quan and J. Heidemann, “On the characteristics and reasons of long-lived internet flows,” in Proceedings of the 10th ACM SIGCOMM conference on Internet measurement.   ACM, 2010, pp. 444–450.
  • [8] Y. Zhang, L. Breslau, V. Paxson, and S. Shenker, “On the characteristics and origins of internet flow rates,” in ACM SIGCOMM Computer Communication Review, vol. 32, no. 4.   ACM, 2002, pp. 309–322.
  • [9] F. Qian, A. Gerber, Z. M. Mao, S. Sen, O. Spatscheck, and W. Willinger, “Tcp revisited: A fresh look at tcp in the wild,” in Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement, ser. IMC ’09.   New York, NY, USA: ACM, 2009, pp. 76–89.
  • [10] P. Megyesi and S. Molnár, “Analysis of elephant users in broadband network traffic,” in Meeting of the European Network of Universities and Companies in Information and Communication Engineering.   Springer, 2013, pp. 37–45.
  • [11] D. Lee and N. Brownlee, “Passive measurement of one-way and two-way flow lifetimes,” ACM SIGCOMM Computer Communication Review, vol. 37, no. 3, pp. 17–28, 2007.
  • [12] C. Estan and G. Varghese, “New Directions in Traffic Measurement and Accounting: Focusing on the Elephants, Ignoring the Mice,” ACM Trans. Comput. Syst., vol. 21, no. 3, pp. 270–313, Aug. 2003.
  • [13] N. Antunes and V. Pipiras, “Estimation of flow distributions from sampled traffic,” ACM Transactions on Modeling and Performance Evaluation of Computing Systems, vol. 1, no. 3, p. 11, 2016.
  • [14] L. Yang, “Sample based estimation of network traffic flow characteristics.” Ph.D. dissertation, University of Michigan, 2009.
  • [15] N. Duffield, C. Lund, and M. Thorup, “Estimating flow distributions from sampled flow statistics,” in Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications.   ACM, 2003, pp. 325–336.
  • [16] N. Brownlee and K. C. Claffy, “Understanding Internet traffic streams: dragonflies and tortoises,” IEEE Communications Magazine, vol. 40, no. 10, pp. 110–117, Oct. 2002.
  • [17] B. Ryu, D. Cheney, and H. werner Braun, “Internet Flow Characterization: Adaptive Timeout Strategy and Statistical Modeling,” in in Proc. Passive and Active Measurement workshop, 2001, p. 45.
  • [18] W. Fang and L. Peterson, “Inter-as traffic patterns and their implications,” in Seamless Interconnection for Universal Services. Global Telecommunications Conference. GLOBECOM’99. (Cat. No.99CH37042), vol. 3, Dec. 1999, pp. 1859–1868 vol.3.
  • [19] X. Guan, T. Qin, W. Li, and P. Wang, “Dynamic feature analysis and measurement for large-scale network traffic monitoring,” IEEE Transactions on Information Forensics and Security, vol. 5, no. 4, pp. 905–919, Dec. 2010.
  • [20] T. Limmer and F. Dressler, “Flow-based tcp connection analysis,” in 2009 IEEE 28th International Performance Computing and Communications Conference, Dec. 2009, pp. 376–383.
  • [21] L. Qian and B. E. Carpenter, “A flow-based performance analysis of tcp and tcp applications,” in 2012 18th IEEE International Conference on Networks (ICON), Dec. 2012, pp. 41–45.
  • [22] K. Papagiannakit, N. Taft, and C. Diot, “Impact of flow dynamics on traffic engineering design principles,” in IEEE INFOCOM 2004, vol. 4, March 2004, pp. 2295–2306.
  • [23] T. Benson, A. Akella, and D. A. Maltz, “Network traffic characteristics of data centers in the wild,” in Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, ser. IMC ’10.   New York, NY, USA: ACM, 2010, pp. 267–280.
  • [24] K.-C. Lan and J. Heidemann, “On the correlation of internet flow characteristics,” 2003.
  • [25] K. chan Lan and J. Heidemann, “A measurement study of correlations of internet flow characteristics,” Computer Networks, vol. 50, no. 1, pp. 46 – 62, 2006.
  • [26] F. Hernandez-Campos, J. Marron, G. Samorodnitsky, and F. D. Smith, “Variable heavy tails in internet traffic,” Performance Evaluation, vol. 58, no. 2-3, pp. 261–284, 2004.
  • [27] V. Ramaswami, K. Jain, R. Jana, and V. Aggarwal, “Modeling heavy tails in traffic sources for network performance evaluation,” in Computational Intelligence, Cyber Security and Computational Models.   Springer, 2014, pp. 23–44.
  • [28] A. B. Downey, “Lognormal and pareto distributions in the internet,” Computer Communications, vol. 28, no. 7, pp. 790–801, 2005.
  • [29] E. Garsva, N. Paulauskas, and G. Grazulevicius, “Packet size distribution tendencies in computer network flows,” in Electrical, Electronic and Information Sciences (eStream), 2015 Open Conference of.   IEEE, 2015, pp. 1–6.
  • [30] E. Casilari, F. J. Gonzblez, and F. Sandoval, “Modeling of http traffic,” IEEE Communications Letters, vol. 5, no. 6, pp. 272–274, June 2001.
  • [31] R. Pries, Z. Magyari, and P. Tran-Gia, “An http web traffic model based on the top one million visited web pages,” in Proceedings of the 8th Euro-NF Conference on Next Generation Internet NGI 2012, June 2012, pp. 133–139.
  • [32] K.-T. Chen, P. Huang, and C.-L. Lei, “Game traffic analysis: An mmorpg perspective,” Computer Networks, vol. 50, no. 16, pp. 3002 – 3023, 2006.
  • [33] W. chang Feng, F. Chang, W. chi Feng, and J. Walpole, “A traffic characterization of popular on-line games,” IEEE/ACM Transactions on Networking, vol. 13, no. 3, pp. 488–500, June 2005.
  • [34] J. K. Seoul, “Measurement and analysis of a massively multiplayer online role playing game traffic,” in Proceedings of Advanced Network Conference (Network Research Workshop), Aug. 2003.
  • [35] J. Färber, “Network game traffic modelling,” in Proceedings of the 1st Workshop on Network and System Support for Games, ser. NetGames ’02.   New York, NY, USA: ACM, 2002, pp. 53–57.
  • [36] I. Drago, M. Mellia, M. M. Munafo, A. Sperotto, R. Sadre, and A. Pras, “Inside Dropbox: Understanding Personal Cloud Storage Services,” in Proceedings of the 2012 Internet Measurement Conference, ser. IMC ’12.   New York, NY, USA: ACM, 2012, pp. 481–494.
  • [37] G. Gonçalves, I. Drago, A. P. C. d. Silva, A. B. Vieira, and J. M. Almeida, “Modeling the dropbox client behavior,” in 2014 IEEE International Conference on Communications (ICC), June 2014, pp. 1332–1337.
  • [38] B. A. Mah, “An empirical model of http network traffic,” in Proceedings of INFOCOM ’97, vol. 2, April 1997, pp. 592–600 vol.2.
  • [39] X. Yang, “Designing traffic profiles for bursty internet traffic,” in Global Telecommunications Conference, 2002. GLOBECOM ’02. IEEE, vol. 3, Nov. 2002, pp. 2149–2154.
  • [40] S. Waldmann, K. Miller, and A. Wolisz, “Traffic model for http-based adaptive streaming,” in 2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), May 2017, pp. 683–688.
  • [41] T. Silva, J. M. Almeida, and D. Guedes, “Live streaming of user generated videos: Workload characterization and content delivery architectures,” Computer Networks, vol. 55, no. 18, pp. 4055 – 4068, 2011, internet-based Content Delivery.
  • [42] E. Veloso, V. Almeida, W. Meira, A. Bestavros, and S. Jin, “A hierarchical characterization of a live streaming media workload,” IEEE/ACM Transactions on Networking, vol. 14, no. 1, pp. 133–146, Feb 2006.
  • [43]

    H. Toral-Cruz, A.-S. K. Pathan, and J. C. R. Pacheco, “Accurate modeling of voip traffic qos parameters in current and future networks with multifractal and markov models,”

    Mathematical and Computer Modelling, vol. 57, no. 11, pp. 2832 – 2845, 2013, information System Security and Performance Modeling and Simulation for Future Mobile Networks.