Understanding how Internet services are used and how they are operating is critical to people lives. Network Traffic Monitoring and Analysis (NTMA) is central to that task. Applications range from providing a view on network traffic to the detection of anomalies and unknown attacks while feeding systems responsible for usage monitoring and accounting. They collect the historical data needed to support traffic engineering and troubleshooting, helping to plan the network evolution and identify the root cause of problems. It is correct to say that NTMA applications are a cornerstone to guarantee that the services supporting our daily lives are always available and operating as expected.
Traffic monitoring and analysis is a complicated task. The massive traffic volumes, the speed of transmission systems, the natural evolution of services and attacks, and the variety of data sources and methods to acquire measurements are just some of the challenges faced by NTMA applications. As the complexity of the network continues to increase, more observation points become available to researchers, potentially allowing heterogeneous data to be collected and evaluated. This trend makes it hard to design scalable and distributed applications and calls for efficient mechanisms for online analysis of large streams of measurements. More than that, as storage prices decrease, it becomes possible to create massive historical datasets for retrospective analysis.
These challenges are precisely the characteristics associated with what, more recently, have become known as big data, i.e., situations in which the data volume, velocity, veracity and variety are the key challenges to allow the extraction of value from the data. Indeed, traffic monitoring and analysis were one of the first examples of big data sources to emerge, and it poses big data challenges more than ever.
It is thus not a surprise that researchers are resorting to big data technologies to support NTMA applications (e.g., [lee_toward_2013, marchal_big_2014, orsini_bgpstream_2016, wullink_entrada_2016]). Distributed file systems – e.g., the Hadoop111http://hadoop.apache.org/ Distributed File System (HDFS), big data platforms – e.g., Hadoop and Spark, and distributed machine learning and graph processing engines – e.g., MLlib and Apache Giraph, are some examples of technologies that are assisting applications to handle datasets that otherwise would be intractable. However, it is by no means clear whether NTMA applications fully exploit the potential of emerging big data technologies.
We bring together NTMA research and big data. Whereas previous works documented advances on big data research and technologies [zhang_parallel_2016, fahad_survey_2014, tsai_big_2015], methods supporting NTMA (e.g., machine learning for NTMA [nguyen_survey_2008, callado_survey_2009]), or specific NTMA applications [sperotto_overview_2010, valenti_reviewing_2013, hofstede_flow_2014, bhuyan_network_2014], there is a lack of systematic surveys describing how NTMA and big data are being combined to exploit the potential of network data fully.
More concretely, the goal of this survey is to discuss to what extent NTMA researchers are exploiting the potential of big data technologies. We aim at providing network researchers willing to adopt big data approaches for NTMA applications a broad overview of success cases and pitfalls on previous research efforts, thus illustrating the challenges and opportunities on the use of big data technologies for NTMA applications.
By summarizing recent literature on NTMA, we provide researchers principled guidelines on the use of big data according to the requirements and purposes of NTMA applications. Ultimately, by cataloging how challenges on NTMA have been faced with big data approaches, we highlight open issues and promising research directions.
I-a Survey methodology
We first identify papers summarizing big data research. We have explored the literature for big data surveys, restricting our focus to the last ten years. This literature has served as a basis to our definition for big data as well as to limit our scope in terms of the considered big data technologies.
Since none of these papers addresses big data for NTMA, we have followed a similar methodology and reviewed papers in the NTMA domain. We survey how NTMA researchers are profiting from big data approaches and technologies. We have focused on the last 5 years of (i) the main system and networking conferences, namely SIGCOMM, NSDI, CONEXT, and journals (IEEE TON, IEEE TNSM), (ii) new venues targeting analytics and big data for NTMA, namely the BigDama, AnNet, WAIN and NetAI workshops, and their parent conferences (e.g., IM, NOMS and CNSM); (ii) special issues on Big Data Analytics published in this journal (TNSM). We complement the survey by searching on Google Scholar, IEEE Explorer, and ACM Digital library. To ensure relevance and limit our scope, we select papers concerning publication venues, the number of citations, and scope.
The survey is organized as follows: Sect. II introduces concepts of big data and the steps for big data analytics, giving also some background in big data platforms. Sect. III reviews a taxonomy of NTMA applications, illustrating how NTMA relates to the big data challenges. Subsequent sections detail the process of NTMA and discuss how previous work has faced big data challenges in NTMA. Sect. IV focuses on data capture, ingestion, storage and pre-processing, whereas Sect. V overviews big data analytics for NTMA. Finally, Sect. VI describes lessons learned and open issues.
Ii What is big data?
Many definitions for big data have appeared in the literature. We rely on previous surveys that focused on other aspects of big data but have already documented its definitions.
Hu at al. [hu_toward_2014] argue that “big data means not only a large volume of data but also other features, such as variety and velocity of the data”. They organize definitions in three categories: architectural, comparative, or attributive.
Architectural and comparative definitions are both abstract. Following an architectural definition, one would face a big data problem whenever the dataset is such that “traditional approaches” are incapable of storing and handling it. Similarly, comparative definitions characterize big data by comparing the properties of the data to the capabilities of traditional database tools. Those relying on attributive definitions (e.g., [manyika_big_2011, laney_3d_2001]) describe big data by the salient features of both the data and the process of analyzing the data itself. They advocate what is nowadays widely known by the big data “V’s” – e.g., volume, velocity, variety, veracity, value.
Other surveys targeting big data technologies [zhang_parallel_2016, bajaber_big_2016, hu_toward_2014] and analytics [tsai_big_2015, fahad_survey_2014] share this view. We will stick to the “5-Vs” definition because it provides concrete criteria that characterize the big data challenges. We thus consider a problem to belong to the big data class if datasets are large (i.e., volume) and need to be captured and analyzed at high rates or in real-time (i.e., velocity). The data potentially come from different sources (i.e., variety) that combine (i) structured data such as column-oriented databases; (ii) unstructured data such as server logs; and (iii) semi-structured data such as XML or JSON documents. Moreover, data can be of different quality, with the uncertainty that characterizes the data (i.e., veracity). At last, the analysis of the data brings advantages for users and businesses (i.e., value). That is, new insights are extracted from the original data to increase its value, preferably using automatic methodologies.
We argue next that NTMA shares these characteristics. Hence, NTMA is an example of big data application which can profit from methodologies developed to face these challenges.
Ii-B Knowledge discovery in big data
The process of KDD is often attributed to Fayyad et al. [fayyad_data_1996] who summarized it as a sequence of operations over the data: gathering, selection, pre-processing, transformation, mining, evaluation, interpretation, and visualization. From a system perspective, authors of [tsai_big_2015] reorganized these operations in three groups: input, analysis, and output.
Data input performs the data management process, from the collection of raw data to the delivery of data in suitable formats for subsequent mining. It includes pre-processing, which are the initial steps to prepare the data, with the integration of heterogeneous sources and cleaning of spurious data.
Data analysis methods receive the prepared data and extract information, i.e., models for classification, hidden patterns, relations, rules, etc. The methods range from statistical modeling and analysis to machine learning and data mining algorithms.
Data output completes the process by converting information into knowledge. It includes steps to measure the information quality, to display information in succinct formats, and to assist analysts with the interpretation of results. This step is not evaluated in this survey and is ignored in the following.
The stages identified by Fayyad et al. can be further grouped into data management and analytics (i.e., data analysis and output). Note that differently from [tsai_big_2015], we use the term data analysis to refer to the algorithms for extracting information from the data, whereas we use data analytics to refer to the whole process, from data analysis to knowledge discovery.
We reproduce and adapt this scheme in Fig. 1 and will use it to characterize NTMA papers according to the several stages of the KDD process.
The KDD process described above applies to any data analytics. However, the characteristics of the big data impose fundamental challenges to the methodologies on each step. For example, the data management process will naturally face much more complex tasks with big data, given the volume, velocity, and variety of the data. Data management is, therefore, crucial with big data since it plays a key role in reducing data volume and complexity. Also, it impacts the analysis and the output phases, in particular in terms of speed of the analysis and value of results.
The big data challenges (i.e., the “5-Vs”) call for an approach that considers the whole KDD process in a comprehensive analytics framework. Such a framework should include programming models that allow implementing application logic covering the complete KDD cycle. We will show later that a common practice to cope with big datasets is to resort to parallel and distributed computing frameworks that are still on expansion to cover all KDD phases.
Ii-C Programming models and platforms
We provide a short overview of the most relevant programming models and platforms for handling big data. While surveying the literature, we will give particular emphasis on works that make use of such models and platforms for NTMA.
Following the taxonomies found in [bajaber_big_2016, zhang_parallel_2016], we distinguish four types of big data processing models: (i) general purpose – platforms to process big data that make little assumptions about the data characteristics and the executed algorithms, (ii) SQL-like – platforms focusing on scalable processing of structured and tabular data, (iii) graph processing – platforms focusing on the processing of large graphs, and (iv) stream processing – platforms dealing large-scale data that continuously arrive to the system in a streaming fashion.
Fig. 2 depicts this taxonomy with examples of systems. MapReduce, Dryad, Flink, and Spark belong to the first type. Hive, HAWQ, Apache Drill, and Tajo belong to the SQL-like type. Pregel, GraphLab follow the graph processing models and, finally, Storm and S4 are examples of the latter.
A comprehensive review of big data programming models and platforms is far beyond the scope of this paper. We refer readers to [bajaber_big_2016, zhang_parallel_2016] for a complete survey.
Ii-C1 The Hadoop ecosystem
Hadoop is the most widespread solution among the general-purpose big data platforms. Given its importance, we provide some details about its components in Fig. 3, considering Hadoop v2.0 and Spark [big_data_working_group_big_2014].
Hadoop v2.0 consists of the Hadoop kernel, MapReduce and the Hadoop Distributed File System (HDFS). YARN is the default resource manager, providing access to cluster resources to several competing jobs. Other resource managers (e.g., Mesos) and execution engines (e.g., TEZ) can be used too, e.g., for providing resources to Spark.
Spark has been introduced in Hadoop v2.0 onward aiming to solve limitations in the MapReduce paradigm. Spark is based on data representations that can be transformed into multiple steps while efficiently residing in memory. In contrast, the MapReduce paradigm relies on basic operations (i.e., map/reduce) that are applied to data batches read and stored to disk. Spark has gained momentum in non-batch scenarios, e.g, iterative and real-time big data applications, as well as in batch applications that cannot be solved in a few stages of map/reduce operations.
Several high-level language and systems, such as Google’s Sawzall [pike_interpreting_2005], Yahoo’s Pig Latin [gates_building_2009], Facebook’s Hive [thusoo_hive_2009], and Microsoft’s SCOPE [chaiken_scope_2008] have been proposed to run on top of Hadoop. Moreover, several libraries such as Mahout [owen_mahout_2011] over MapReduce and MLlib over Spark have been introduced by the community to solve problems or fill gaps in the original Hadoop ecosystem. Finally, the ecosystem has been complemented with tools targeting specific processing models, such as GraphX and Spark Streaming, which support graph and stream processing, respectively.
Besides the Apache Hadoop distributions, proprietary platforms offer different features for data processing and cluster management. Some of such solutions include Cloudera CDH,222http://www.cloudera.com/downloads/cdh/5-8-2.html Hortonworks HDP,333http://hortonworks.com/products/data-center/hdp/ and MapR Converged Data Platform.444https://www.mapr.com/products/mapr-converged-data-platform
Iii Categorizing Ntma applications
We rely on a subset of the taxonomy of network management applications found in [boutaba_comprehensive_2018] to categorize papers handling NTMA with big data approaches. The previous work lists eight categories, which are defined according to the final purpose of the management application, namely: (i) traffic prediction, (ii) traffic classification, (iii) fault management, (iv) network security, (v) congestion control, (vi) traffic routing, (vii) resource management, and (viii) QoS/QoE management.
We only survey works that fit on the first four categories for two reasons. First, whereas the taxonomy in [boutaba_comprehensive_2018] is appropriate for describing network management applications, the level of dependence of such applications on NTMA varies considerably. Traffic routing and resource management seem less dependent on large-scale measurements than traffic prediction and security, for example. Second, the literature on the use of big data approaches for congestion control, traffic routing, resource management, and QoS/QoE management is almost nonexistent by the time of surveying. We conjecture that either large-scale datasets targeting problems in these categories are not available, or researchers have not identified potential on applying big data approaches in those scenarios. We thus ignore the works using big data approaches for those cases.
Iii-B Ntma applications
Tab. I shows the categories used in our survey. Next, we list examples of NTMA applications in these categories.
Iii-B1 Traffic prediction
Traffic prediction consists of estimating the future status of network links. It serves as a building block for traffic engineering, helping to define, for example, the best moment to deploy more capacity in the network so to keep QoS levels.
Traffic prediction is often faced as a time-series forecasting problem [boutaba_comprehensive_2018]
. Here both classic forecasting methods (e.g., ARIMA or SARIMA methods) and machine learning (e.g., deep neural networks) are employed. The problem is usually formulated as the estimation of traffic volumes based on previous measurements in the same links. Changes in network behavior, however, make such estimations very complicated. New services or sudden changes in service configurations (e.g., the deployment of novel bandwidth-hungry applications) poses major challenges to traffic prediction approaches.
Iii-B2 Traffic classification
Traffic classification aims at identifying services producing traffic [valenti_reviewing_2013]. It is a crucial step for managing and monitoring the network. Operators need information about services, e.g., to understand their requirements and their impact on the overall network performance.
Traffic classification used to work well by simply inspecting information in network and transport protocols. For instance, Internet services used to be identified by simply inspecting TCP/UDP port numbers. However, traffic classification is no longer a simple task [nguyen_survey_2008]. First, the number of Internet services is large and continues to increase. Second, services must be identified by observing little information seen in the network. Third, little information remains visible in packets, since a major share of the Internet services run on top of a handful of encryption protocols (e.g., HTTPS over TCP). At last, Internet services are dynamic and constantly updated.
Many approaches have been proposed to perform traffic classification. They can be grouped according to the strategy used to classify the packets: (i) Packet inspection analyzes the content of packets searching for pre-defined messages[bujlow_independent_2015] or protocol fingerprints; (ii) Supervised machine learning methods extract features from the traffic and, in a training phase, build models to associate feature values to the services; (iii) Unsupervised machine learning methods cluster traffic without previous knowledge on the services. As such, they are appropriate to explore services for which training data is unavailable [erman_traffic_2006]. (iv) Behavioral methods identify services based on the behavior of end-nodes [karagiannis_blinc_2005]. The algorithms observe traffic to build models for nodes running particular services. The models describe, for instance, which servers are reached by the clients, with which data rate, the order in which servers are contacted, etc.
Iii-B3 Fault management
Fault management is the set of tasks to predict, detect, isolate, and correct faults in networks. The goal is to minimize downtime. Fault management can be proactive, e.g., when analytics predict faults based on measurements to avoid disruptions, or reactive, e.g., when traffic and system logs are evaluated to understand ongoing problems. In either case, a key step in fault management is the localization of the root-cause of problems [boutaba_comprehensive_2018].
In large networks, diverse elements may be impacted by faults: e.g., a failed router may overload other routes, thus producing a chain of faults in the network. Diverse network elements will produce system logs related to the problem, and the behavior of the network may be changed in diverse aspects. Detecting malfunctioning is often achieved by means of anomaly detection methods that identify abnormal behavior in traffic or unusual events in system logs. Anomalies, however, can be caused also by security incidents (described next) or normal changes in usage patterns. Analytics algorithms often operate evaluating traffic, system logs, and active measurements in conjunction, so to increase the visibility of the network and easy the identification of root-causes of problems.
Many NTMA applications have been proposed for assisting cyber-security [liao_intrusion_2013]. The most common objective is to detect security flaws, virus, and malware, so to isolate infected machines and take countermeasures to minimize damages. Roughly speaking, there are two main approaches when searching for malicious network activity: (i) based on attack signatures; (ii) based on anomaly detection.
Signature-based methods build upon the idea that it is possible to define fingerprints for attacks. A monitoring solution inspects the source traffic/logs/events searching for (i) known messages exchanged by viruses, malware or other threats; or (ii) the typical communication patterns of the attacks – i.e., similar to behavioral traffic classification methods. Signature-based methods are efficient to block well-known attacks that are immutable or that mutate slowly. These methods however require attacks to be well-documented.
Methods based on anomaly detection [bhuyan_network_2014, garcia-teodoro_anomaly-based_2009] build upon the assumption that attacks will change the behavior of the network. They build models to summarize the normal network behavior from measurements. Live traffic is then monitored and alerts are triggered when the behavior of the network differs from the baseline. Anomaly detection methods are attractive since they allow the early detection of unknown threats (e.g., zero-day exploits). These methods, however, may not detect stealth attacks (i.e., false negatives), which are not sufficiently large to disturb the network. They sometimes suffer from large numbers of false positives too.
Iii-C Big data challenges in Ntma applications
We argue that NTMA applications belonging to the categories above can profit from big data approaches. Processing such measurements poses the typical big data challenges (i.e., the “5-Vs”). We provide some examples to support the claim.
Considering volume and velocity and taking traffic classification as an example: it has to be performed on-the-fly (e.g., on packets), and the input data are usually large. We will see later that researchers rely on different strategies to perform traffic classification on high-speed links, with some works applying algorithms to hundreds of Gbps streams.
Consider then variety. As more and more traffic goes encrypted, algorithms to perform traffic classification or fault management, for example, have poorer information to operate. Modern algorithms rely on diverse sources – e.g., see [trevisan_towards_2016] that combines DNS and flow measurements. Anomaly detection, as another example, is usually applied to a variety of sources too (e.g., traffic traces, routing information, etc), so to obtain diverse signals of anomalies.
In terms of veracity, we again cite cyber-security. Samples of attacks are needed to train classifiers to identify the attacks on real networks. Producing such samples is a challenging task. While simple attacks can be reproduced in laboratory, elaborate attacks are hard to reproduce – or, worst, are simulated in unrealistically ways – thus limiting the analysis.
Finally, value is clearly critical in all the above applications – e.g., in cyber-security, a single attack that goes undetected can be unbearable to the operation of the network.
Iv Data management for NTMA
Iv-a Data collection
Measuring the Internet is a big data challenge. There are more than half a million networks, 1.6 billion websites, and more than 3 billion users, accessing more than 50 billion web pages, i.e., exchanging some zettabytes per year. At no surprise, the community has spent much work in designing and engineering tools for data acquisition at high-speeds; some of them are discussed in [sakr_big_2018]. Here the main challenges addressed by the community seem to be the scalability of data collection systems and how to reduce data volumes at the collection points without impacting the data quality.
Measuring the Internet can be accomplished coarsely in two means: (i) active measurements, and (ii) passive measurements. The former indicates the process of injecting data into the network and observing the responses. It is a not scalable process, typically used for troubleshooting. Passive measurements, on the contrary, build on the continuous observation of traffic, and the extraction of indexes in real-time.
A network exporter captures the traffic crossing monitoring points, e.g., routers aggregating several customers. The exporter sends out a copy of the observed network packets and/or traffic statistics to a collector. Data is then saved in storage formats suitable for NTMA applications. Analysis applications then access the data to extract knowledge.
The components of this architecture can be integrated into a single device (e.g., a router or a dedicated network monitor) or deployed in a distributed architecture. We will argue later that big data already emerges since the first stage of the NTMA process and, as such, distributed architectures are common in practical setups. The large data rate in the monitoring points has pushed researchers and practitioners into the integration of many pre-processing functionalities directly into the exporters. The most prominent example is perhaps flow-based monitoring, in which only a summary of each traffic flow is exported to collectors. Next, we provide a summary of packet- and flow-based methods used to collect data for NTMA applications.
Iv-A1 Packet-based methods
|Description||Duration||pcap (GB)||Headers up to L4 (%)|
|Full packets of 20 k ADSL users (morning)||60 mins||675||6.8|
|Full packets of 15 k Campus users (morning)||100 mins||913||6.3|
|Only 5 kB or 5 packets per flow, 20 k ADSL users||1 day||677||22.5|
Analyzing packets provides the highest flexibility for the NTMA applications, but also requires a significant amount of resources. Deep Packet Inspection (DPI) means to look into, and sometimes export, full packet contents that include application headers and payload. The data rate poses challenges (i.e., velocity), which can be faced using off-the-shelf hardware [trevisan_traffic_2017], provided hardware support is present [intel_data_2014]. Indeed, there exist technologies to perform DPI at multiple Gbit/s, while also saving the packets to disk [deri_10_2013].
The network monitoring community has proposed multiple alternatives to avoid the bottlenecks of packet-based methods. The classical approach is to perform pre-processing near the data collection. Filtering and sampling are usually set on collection points to export only a (small) subset of packets that pass the filtering and sampling rules. Other ad-hoc transformations may be employed too, e.g., the exporting of packet headers only, instead of full packet contents. In a similar direction, authors of [maier_enriching_2008] propose to collect only the initial packets of each flow, since these packets are usually sufficient for applications such as traffic classification.
As an illustration, Tab. II describes the volume of packet traces captured in some real deployments. The first two lines report the size of packet traces capture at (i) an ISP network where around 20 k ADSL customers are connected; (ii) a campus network connecting around 15 k users. Both captures have been performed in morning hours and last for at least 1 hour. More than half of TB is saved in both cases. The table also shows that the strategy of saving only packet headers up to the transport layer only partially helps to reduce the data volume – e.g., around 45 GB of headers per hour would still be saved for the ISP trace. Finally, the last line shows the size of a full day of capture in the ISP network with a setup similar to [maier_enriching_2008] (i.e., saving only 5 packets or 5 kB per flow). Whereas the data volume is reduced significantly, more than 600 GB of pcaps are saved per day.
Nowadays, DPI is challenged by encryption as well as by restrictive privacy policies [fuchs_implications_2012]. Alternatives to performing DPI on encrypted traffic have been presented [sherry_blindbox_2015].
Iv-A2 Flow-based methods
Flow-based methods process packets near the data collection, exporting only summaries of the traffic per flow [hofstede_flow_2014]. A network flow is defined as a sequence of packets that share common characteristics, identified by a key. NTMA applications that analyze flow records have lower transmission requirements since data is aggregated. Data privacy is better protected and issues related to encryption are partially avoided. Nevertheless, there is an unavoidable loss of information, which makes the analysis more limited. In 2013, Cisco estimated that approximately 90% of network traffic analyses are flow-based, leaving the remaining 10% for packet-based approaches [patterson_netflow_2013].
Diverse flow monitoring technologies have been proposed. Cisco NetFlow, sFlow and IPFIX are the most common ones. NetFlow has been widely used for a variety of NTMA applications, for example for network security [nickless_combining_2000]. sFlow relies heavily on packet sampling (e.g., sampling 1 every 1 000 packets). This may hurt the performance of many NTMA applications, even if sFlow is still suitable for detecting attacks for example. Finally, IPFIX (IP Flow Information Export protocol) is the IETF standard for exporting flow information [hofstede_flow_2014]. IPFIX is flexible and customizable, allowing one to select the fields to be exported. IPFIX is used for a number of NTMA applications, such as the detection of SSH attacks [hofstede_ssh_2014].
Even flow-based monitoring may produce very large datasets [sakr_big_2018]. For illustration, Tab. III lists the volume of flow-level information exported when monitoring 50 Gbit/s on average. The table is built by extrapolating volumes reported by [hofstede_flow_2014] for flow exporters in a real deployment. The table includes Cisco’s NetFlow v5 and NetFlow v9 as well as IPFIX.
|1:1||NetFlow v5||4.1 k||52 M|
|1:1||NetFlow v9||10.2 k||62 M|
|1:10||4.7 k||27 M|
|1:1||IPFIX||12.5 k||75 M|
Flow-based NTMA applications would still face more than 70 Mbit/s if IPFIX is chosen to export data. Since these numbers refer to a single vantage point, NTMA applications that aggregate multiple vantage points may need to face several Gb/s of input data. Finally, the table reports data speeds when employing sampling (see 1:10 rate). Sampling reduces the data volume considerably but limiting NTMA applications.
At last, NTMA applications often need to handle historical data. The storage needed to archive the compressed flow data from the vantage point used as an example in Tab. II grows linearly over time, and the archival consumes almost 30 TB after four years [sakr_big_2018] of archival.
Iv-B Data ingestion
The previous step is the KDD step that depends the most on the problem domain since the way data is acquired varies a lot according to the given scenario. Not a surprise, generic big data frameworks mostly provide tools and methods for transporting raw data from its original sources into pre-processing and analytics pipelines. The process of transporting the data into the frameworks is usually called data ingestion. Since data at its source is still unprocessed, handling such raw data may be a huge challenge. Indeed, large data streams or a high number of distributed collectors will produce a deluge of data to be parsed and pre-processed in subsequent stages.
Several generic tools for handling big data ingestion can be cited: (i) Flume,555https://flume.apache.org a distributed system to collect logs from different sources; (ii) Sqoop,666http://sqoop.apache.org which allows to transmit data between Hadoop and relational databases; (iii) Kafka,777https://kafka.apache.org a platform that provides a distributed publish-subscribe messaging system for online data streaming; among others. These tools focus on scalability – e.g., they allow multiple streams to be processed in a distributed fashion. Once data is delivered to the framework, pre-processing can take place on the streams.
Considering NTMA, few solutions have been proposed to ingest traffic measurements into big data frameworks. Here the main challenges seem to be the transport of data in different formats and from various sources into the frameworks. The research community has made interesting progresses in this front, and we cite two examples.
Apache Spot888http://spot.incubator.apache.org is an open platform that integrates modules for different applications in the network security area. It relies on Kafka for performing data ingestion. Users and developers have to provide Spot with python scripts that parse the original measurements into a pre-defined, but flexible, format. Both the original data and converted records are loaded and stored in the big data framework.
A promising project is PNDA (Platform for Network Data Analysis).999http://pnda.io/overview
Developed by the Linux Foundation, it is an open-source platform capable of managing and grouping together different data sources. It then performs analysis on such data sources. It does not force data in any specific schema and it allows the integration of other sources, producing custom code for the analysis stage. It makes use of standard tools in big data frameworks, including Kafka for data ingestion. Here, a number of plugins for reading data in virtually all popular NTMA formats is available on the project website.
Spot and PNDA experiences clearly show that ingesting NTMA data into generic frameworks requires some effort to plugin the NTMA formats into the frameworks. Once such plugins exist, standard ingestion tools like Kafka can be employed. The availability of open source code (e.g., PNDA plugins) makes it easier to integrate these functionalities in other NTMA applications.
Iv-C Data storage
In theory, generic store systems of big data frameworks can be used with NTMA too. In practice, a number of characteristics of the big data frameworks complicate the storage of NTMA data. In order to ease the understanding of these challenges, we first provide some background in generic big data storage systems. Then, we evaluate how the NTMA community is employing these systems in NTMA applications.
Iv-C1 Generic systems
Fig. 4 reproduces a taxonomy of big data storage and management systems [chen_big_2014, big_data_working_group_big_2014]. Colors represent the media (i.e., memory or persistent media) primarily exploited by the system.
Most big data storage systems focus on horizontal scalability, i.e., growing the capacity of the system across multiple servers, rather than upgrading a single server to handle increasing data volumes. This approach results in large distributed systems, which carry many risks, such as node crashes and network failures. Consistency, availability, and fault tolerance are therefore of large concern in big data storage [big_data_working_group_big_2014, hu_toward_2014].
In terms of file systems, Google has pioneered the development by the implementation of the Google File System (GFS) [ghemawat_google_2003]. GFS was built with horizontal scalability in mind, thus running on commodity servers and providing fault tolerance and high performance. Colossus [mckusick_gfs_2009], the successor of GFS and Facebook Haystack [beaver_finding_2010] are other examples. Open source derivatives of GFS appeared later, including Apache HDFS and Kosmosfs.
On a higher abstraction level, relational databases suffer performance penalties with large datasets. NoSQL databases [han_survey_2011] are alternatives that emerged for such scenarios. NoSQL databases scale better than the relational ones by overcoming the intrinsic rigidity of database schemas. NoSQL databases adopt different physical data layouts: (i) Key-value databases store data in sets of key-value pairs organized into rows. Examples include Amazon’s Dynamo [decandia_dynamo_2007] and Memcached;101010http://memcached.org (ii) column-oriented databases (inspired on Google’s BigTable [chang_bigtable_2008]) store data by column instead of by row. Examples are Cassandra [lakshman_cassandra_2010] and HBase;111111https://hbase.apache.org/ (iii) document oriented databases store data in documents uniquely identified by a key. An example is MongoDB;121212https://www.mongodb.com/ (iv) Graph oriented databases store graphs – i.e., nodes and edges. They impose a graph schema to the database, but profiting from it to implement efficient operations in graphs. An example is Titan.131313http://titan.thinkaurelius.com/
Finally, NewSQL systems aim at providing a similar performance of NoSQL databases, while maintaining properties of relational systems [stonebraker_newsql_2011], e.g., relational data models and SQL queries [cattell_scalable_2011]. An example is Google Spanner.141414https://cloud.google.com/spanner/
Iv-C2 NTMA storage in big data frameworks
Being the Internet a distributed system, a key problem is how to archive measurements in a centralized data store. Here no standard solution exists, despite multiple attempts to provide scalable and flexible approaches [sacerdoti_wide_2003, trammell_mplane_2014]. The measurements are usually collected using ad-hoc platforms and exported in formats that are not directly readable by big data frameworks. Therefore, both special storage solutions and/or additional data transformation steps are needed.
For example, libpcap, libtrace, libcoral, libnetdude and Scapy are some libraries used for capturing packets. These libraries read and save traces using the pcap format (or pcap-ng). Distributed big data file and database systems, such as HDFS or Cassandra, are generally unable to read pcap traces directly, or in parallel, since the pcap traces are not separable – i.e., they cannot be easily split into parts. One would still need to reprocess the traces sequentially when reading data from the distributed file systems. A similar problem emerges for formats used to store flow-based measurements. For example, nfdump is a popular format used to store NetFlow measurements. While files are split into parts by the collector according to pre-set parameters (e.g., saving one file every five minutes), every single file is a large binary object. Hadoop-based systems cannot split such files into parts automatically.
A number of previous works propose new approaches to overcome these limitations: (i) loading the measurements in original format to the distributed system, while extending the frameworks to handle the classic formats; (ii) proposing new storage formats and tools for handling the big network data; (iii) transforming the network data and importing it into conventional big data storage systems (discussed in Sect. IV-D).
Lee and Lee [lee_toward_2013] have developed a Hadoop API called PcapInputFormat that can manage IP packets and NetFlow records in their native formats. The API allows NTMA applications based on Hadoop to seamlessly process pcap or NetFlow traces. It thus allows traces to be processed without any previous transformation while avoiding performance penalties of reading files sequentially in the framework. A similar direction is followed by Ibrahim et al. [ibrahim_study_2015], who develop traffic analysis algorithms based on MapReduce and a new input API to explore pcap files on Hadoop.
In [nagele_large-scale_2011], Nagele faces the analysis of pcap files in a fast and scalable way by implementing a java-based hadoop-pcap library. The project includes a Serializer/Deserializer that allows Hive to query pcaps directly. Authors of [tazaki_matatabi_2014] use the same Serializer/Deserializer to build a Hadoop-based platform for network security that relies on sFlow, Netflow, DNS measurements and SPAM email captures.
Noting a similar gap for processing BGP measurements, authors of [orsini_bgpstream_2016] introduce BGPStream. BGPStream provides tools and libraries that connect live BGP data sources to APIs and applications. BGPStream can be used, for instance, for efficiently ingesting on-line data into stream processing solutions, such as Spark Streaming.
All these APIs are however specific, e.g., to HDFS, pcap or NetFlow. Thus, similar work has to be performed for each considered measurement format or analytics framework.
Some authors have taken a distinct approach, proposing new storage formats and solutions more friendly to the big network measurements. Authors of [bar_large-scale_2014] propose DBStream to calculate continuous and rolling analytics from data streams. Other authors have proposed to extend pcap and flow solutions both to achieve higher storage efficiency (e.g., compressing traces) and to include mechanisms for indexing and retrieving packets efficiently [fusco_net-fli_2010, fusco_pcapindex_2012]. These solutions are all built upon key ideas of big data frameworks, as well as key-value or column-oriented databases. They are however specialized to solve network monitoring problems. Yet, the systems are generally centralized, thus lacking horizontal scalability.
NTMA algorithms usually operate with feature vectors
that describe the instances under study – e.g., network flows, contacted services, etc. We, therefore, consider as pre-processing all steps to convert raw data into feature vectors.
Some papers overcome the lack of storage formats by transforming the data when ingesting it into big data frameworks. Similarly, a set of features must be extracted for performing NTMA, both for packet and flow-based analysis. Next, we review works performing such pre-processing tasks for NTMA.
A popular approach to handle the big network measurements is to perform further transformations after the data leaves the collection point. Either raw pcap or packets pre-processed at the collection point (e.g., sampled or filtered) are passed through a series of transformations before being loaded into big data frameworks.
Authors of [wullink_entrada_2016] use an extra pre-processing step to convert measurements from the original format (i.e., pcap in HDFS) into a query-optimized format (i.e., Parquet). This conversion is shown to bring massive improvements in performance, but only pre-selected fields are loaded into the query-optimized format. Authors of [samak_scalable_2012] propose a Hadoop-based framework for network analysis that first imports data originally in perfSONAR format into Apache Avro. Authors of [lee_internet_2010] process flow logs in MapReduce, but transforming the original binary files into text files before loading them into HDFS.
Marchal et al. [marchal_big_2014] propose “a big data architecture for large-scale security monitoring” that uses DNS measurements, NetFlow records and data collected at honeypots. Authors take a hybrid approach: they load some measurements into Cassandra while also deploying Hadoop APIs to read binary measurement formats directly (e.g., pcaps). The performance of the architecture is tested with diverse big data frameworks, such as Hadoop and Spark. Similarly, Spark Streaming is used to monitor large volumes of IPFIX data in [cermak_real-time_2016], with the IPFIX collector passing data directly to Spark in JSON format.
In [li_supervised_2013] sFlow records of a large campus network are collected and analyzed on a Hadoop system in order to classify host behaviors based on machine learning algorithms. sFlow data are collected, fields of interest are extracted and then ingested into Cassandra using Apache Flume. Sarlis et al. propose a system for network analytics based on sFlow and NetFlow (over Hadoop or Spark) that achieves 70% speedup compared to basic analytics implementations with Hive or Shark [sarlis_datix_2015]. Measurements are first transformed into optimized partitions that are loaded into HDFS and HBASE together with indexes that help to speed up the queries.
An IPFIX-based lightweight methodology for traffic classification is developed in [murgia_lightweight_2016]
. It uses unsupervised learning for word embedding on Apache Spark, receiving as input “decorated flow summaries”, which are textual flow summaries augmented with information from DNS and DHCP logs. Finally, specifically for big data scenarios, Cisco has published a comprehensive guide for network security using Netflow and IPFIX[santos_network_2015]. Cisco presents OpenSOC, an integral solution to protect again intrusion, zero-day attacks and known threats in big data frameworks. OpenSOC includes many functionalities to parse measurements and load them into big data frameworks using, for example, Apache Flume and Kafka.
All these works reinforce the challenge posed by the lack of NTMA formats friendly to big data frameworks. At the one hand, transformations boost analysis performance, and performance seems to be the main focus of the community so far. At the other hand, information may be lost in transformations. Eventually, data replicated in many formats increase redundancy and make harder integration with other systems.
Iv-D2 Feature engineering
Several libraries exist to perform feature extraction and selection in big data frameworks. While many of them are domain-specific, generic examples are found in Spark ML and Spark MLlib.151515https://spark.apache.org/mllib/ Instead, the research about feature extraction and selection in NTMA is scarce in general. Here the main challenges seem to be the lack of standard or large-accepted features for each type of NTMA applications. Minimally, the community lacks ways to compare and benchmark systems and features in a systematic way.
Most works either refer to old datasets, e.g., [chen_survey_2006, li_building_2009, amiri_mutual_2011]. The work in [iglesias_analysis_2015] studies traffic features in classic datasets for attack/virus detection and DM/ML testing (e.g., DARPA [mit_lincoln_laboratories_darpa_1999] dataset). Such datasets have been criticized and their use discouraged [mchugh_testing_2000]. Authors consider whether the features are suitable or not for anomaly detection, showing that features present high correlation, thus mostly being unnecessary.
Abt et al. [abt_performance_2013] study the selection of NetFlow features to detect botnet C&C communication, achieving accuracy and recall rates above 92%. In [kim_combined_2004]
Netflow features undergo feature selection for the case of DDoS detection. In[valenti_identifying_2011] IPFIX records are used in feature selection processes, obtaining a set of key features for the classification of P2P traffic. All these papers handle relatively small datasets. Few authors rely on large datasets [arzani_taking_2016], but instead propose features that are finely tailored to the specific problem at hand. In a nutshell, each work proposes a custom feature engineering, with no holistic solution yet.
V Big data analytics for NTMA
The next step of the NTMA path is data analysis. We recall that we do not intend to provide an exhaustive survey on general data analytics for NTMA. In particular, here we focus on how the main analytics methods are applied to big NTMA datasets. Detailed surveys of other NTMA scenarios are available, e.g., in [bhuyan_network_2014, garcia-teodoro_anomaly-based_2009, liao_intrusion_2013, nguyen_survey_2008].
As for data management, generic frameworks exist and could be used for NTMA. We next briefly summarize them. After it, we dig into the NTMA literature.
V-a Generic big data frameworks
Several taxonomies have been proposed to describe analysis algorithms. Algorithms are roughly classified as (i) statistical or (ii) machine learning. Machine learning are further categorized as supervised, unsupervised and semi-supervised, depending on the availability of ground truth and how data is used for training. Novel approaches are also often cited, such as deep neural networks and reinforcement learning.
Many challenges emerge when applying such algorithms to big data, due to the large volumes, high dimension etc. Some machine learning algorithms simply do not scale linearly with the input size, requiring lots of resources for processing big data sets. These problems are usually tackle by (i) pre-processing further the input to reduce its complexity; (ii) parallelizing algorithms, sometimes replacing exact solutions by more efficient approximate alternatives.
Several approaches have been proposed to parallelize statistical algorithms [hastie_elements_2009] or to scale machine learning algorithms [raykar_fast_2007, sun_sparse_2010, jiang_cross-domain_2008]. Parallel version of some statistical algorithms are presented in [bennett_numerically_2009]. Pébay et al. [pebay_design_2011] provide a survey of parallel statistics. Parallel neural networks are described in [mikolov_strategies_2011, byungik_ahn_neuron_2012, yuan_privacy_2014]
. Parallel training for deep learning are covered in[bengio_learning_2009, le_building_2013, ciresan_multi-column_2012].
Frameworks do exist to perform such analyses on big data. Fig. 5 reproduces a taxonomy of machine learning tools able to handle large data sets [bifet_big_2014]
. Both non-distributed (e.g., Weka and R) and distributed (e.g., Tensorflow and MLlib) alternatives are popular. Additional challenges occur with streaming data, since algorithms must cope with strict time constraints. We see in the figure that tools targeting these scenarios are also available (e.g., StormMOA).
In terms of NTMA analytics, generic framework implementing algorithms that can scale to big data could be employed too, naturally. Next we explore the literature to understand whether big data approaches and frameworks are actually employed in NTMA.
V-B Literature categorization
In our examination, we focus on understanding the depth of the application of big data techniques in NTMA. In particular Tab. IV evaluates each work under the following perspectives:
We check whether works face large data volumes. We arbitrarily define thresholds to consider a dataset to be big data: All works handling data larger than tens of GBs or works handling backbone network use cases. Similarly, we consider big data volumes when the study covers periods of months or years if dataset size is not specified.
We verify if popular frameworks are used, i.e., Spark, Hadoop, MapReduce; we accept also custom implementations that exploit the same paradigms.
We check if machine learning is used for NTMA.
We verify if big data platforms are used in the ML process (see Fig. 5).
We check velocity, i.e., if works leverage online analysis.
We address variety, i.e., if authors use more than one data source in the analysis.
We leave out two of the 5 “V’s”, i.e., veracity and value, since it is cumbersome to evaluate them. For example, while some works discuss aspects of veracity (e.g., highlighting false positives of trained classifiers), rarely the veracity of the big data used as input in the analysis is evaluated.
Tab. IV shows that big data techniques are starting to permeate NTMA applications. Network security attracts more big data research. In general, it is interesting to notice the adoption of machine learning techniques. However, observe the limited adoption of big data platforms for machine learning.
Next, we dig into salient conclusions of this survey.
|Category||Volume?||Big data framework?||ML based?||Big data ML?||Velocity - Online analysis?||Variety?|
|Traffic Prediction||[fiadino_call_2017] [shadi_hierarchical_2017] [gonzalez_net2vec_2017][wassermann_netperftrace_2017] [tian_tadoop_2015]||[tian_tadoop_2015]||[fiadino_call_2017] [shadi_hierarchical_2017] [gonzalez_net2vec_2017][wassermann_netperftrace_2017] [tian_tadoop_2015]||[gonzalez_net2vec_2017] [tian_tadoop_2015]||[fiadino_call_2017]|
|Traffic Classification||[apiletti_selina_2016][garcia_efficient_2018][li_deep_2018][trevisan_awesome_2018][vassio_users_2017]||[apiletti_selina_2016][li_deep_2018][casas_gml_2017] [trevisan_awesome_2018] [vassio_users_2017]||[apiletti_selina_2016] [fiadino_grasping_2016][garcia_efficient_2018][zhao_data_2018][li_deep_2018] [schiff_netslicer_2018] [casas_gml_2017][trevisan_awesome_2018]||[apiletti_selina_2016][trevisan_awesome_2018]||[apiletti_selina_2016][garcia_efficient_2018] [fiadino_grasping_2016]|
|Fault Management||[otomo_finding_2018] [arzani_taking_2016] [harper_cookbook_2018] [vaarandi_unsupervised_2018] [fontugne_hashdoop_2014] [kasai_network_2016][putina_telemetry-based_2018] [clemm_dna_2015][chandramouli_model-driven_2017]||[fontugne_hashdoop_2014][clemm_dna_2015][clemm_dna_2015]||[arzani_taking_2016] [harper_cookbook_2018] [vaarandi_unsupervised_2018] [fontugne_hashdoop_2014][kasai_network_2016] [putina_telemetry-based_2018][kobayashi_mining_2018] [vassio_users_2017]||[fontugne_hashdoop_2014][kobayashi_mining_2018]||[fontugne_hashdoop_2014][kasai_network_2016][clemm_dna_2015]||[kasai_network_2016] [clemm_dna_2015]|
|Network Security||[benzidane_toward_2016] [rathore_hadoop_2016] [spina_snooping_2015] [huang_new_2016] [mulinka_stream-based_2018][lee_detecting_2011] [uwagbole_applied_2017] [hameed_efficacy_2016]||[benzidane_toward_2016] [rathore_hadoop_2016] [shibahara_malicious_2017] [spina_snooping_2015] [li_world_2016] [cogranne_detecting_2018] [vanerio_ensemble-learning_2017] [lee_detecting_2011] [hameed_efficacy_2016]||[rathore_hadoop_2016] [shibahara_malicious_2017] [spina_snooping_2015] [li_world_2016] [cogranne_detecting_2018] [frishman_cluster-based_2017] [vanerio_ensemble-learning_2017] [mulinka_stream-based_2018] [uwagbole_applied_2017]||[cogranne_detecting_2018] [li_world_2016][tian_tadoop_2015]||[benzidane_toward_2016] [mulinka_stream-based_2018] [rathore_hadoop_2016] [tian_tadoop_2015] [hameed_efficacy_2016]|
V-C Single source, early reduction, sequential analysis
From Tab. IV, we can see how most of the works are dealing with the big volumes of data. This outcome is predictable since network traffic is one of the leading sources of big data. From the last column, variety, we can notice that the researchers infrequently use different sources together, limiting the analysis on specific use cases.
As we have seen before in Sect. IV, a group of works applies big data techniques in the first phases of the KDD process. This approach allows the analytics phase to be performed using non-distributed frameworks (Fig. 5).
For example, in [rathore_hadoop_2016] authors use Hadoop for real-time intrusion detection, but only computing feature values with MapReduce. Classic machine learning algorithms are used afterward on the reduced data. Similarly, Vassio et al. [vassio_users_2017] use big data approaches to reduce the data dimension, while the classification is done in a centralized manner, with traditional machine learning frameworks. Shibahara et al. [shibahara_malicious_2017] deploy a system to classify malicious URLs through neural networks, analyzing IP address hierarchies. Only the feature extraction is performed using the MapReduce paradigm.
In summary, we observe a majority of works adopting a single (big data) source, performing early data reduction with the approaches described in Sect. IV (i.e., pre-processing data), and then performing machine learning analysis with traditional non-distributed platforms.
V-D Big data platforms enabling big NTMA
A small group of papers performs the analytics process with big data approaches. Here different directions are taken. In Hashdoop [fontugne_hashdoop_2014], authors split the traffic into buckets according to a hash function applied to traffic features. Anomaly detection methods are then applied to each bucket, directly implemented as Map functions. Authors of [li_world_2016] present a distributed semi-supervised clustering technique on top of MapReduce, using a local spectral subspace approach to analyze YouTube user comment-based graphs. Authors of [lee_detecting_2011] perform both pre-processing of network data using MapReduce (e.g., to filter out non-HTTP GET packets from HTTP traffic logs) as well as simple analytics to summarize the network activity per client-server pairs. Lastly, the Tsinghua University Campus has tested in its network an entropy-based anomaly detection system that uses IPFIX flows and runs over Hadoop [tian_tadoop_2015].
In traffic classification, Trevisan et al. [trevisan_awesome_2018] developed AWESoME, an SDN application that allows prioritizing traffic of critical Web services, while segregating others, even if they are running on the same cloud or served by the same CDN. To classify flows, the training, performed on large datasets, is implemented in Spark.
Considering other applications, some works consider the analysis of large amounts of non-traffic data with big data approaches. Authors in [spina_snooping_2015] use MapReduce as a basis for a distributed crawler, which is applied to analyze over 300 million pages from Wikipedia to identify reliable editors, and subsequently detect editors that are likely vandals. Comarela et al. [comarela_studying_2013] focus on routing and implement a MapReduce algorithm to study multi-hop routing table distances. This function, applied over a TB of data, produces a measure of the variation of paths in different timescales.
In a nutshell, here big data platforms are enablers to scale the analysis on large datasets.
V-E The rare cases of online analysis
Only a few works focus on online analysis and even fewer leverage big data techniques. Besides the previously cited [tian_tadoop_2015, rathore_hadoop_2016], Apiletti et al. in [apiletti_selina_2016] developed SeLINA, a network analyzer that offers human-readable models of traffic data, combining unsupervised and supervised approaches for traffic inspection. A specific framework for distributed network analytics that operates using Netflow and IPFIX flows is presented in [clemm_dna_2015]. Here, SDN controllers are used for the processing to improve scalability and analytics quality by dynamically adjusting traffic record generation.
In sum, online analysis in NTMA is mostly restricted to the techniques to perform high-speed traffic capture and processing, described in Sect. IV. When it comes to big data analysis, NTMA researchers have mostly focused on batch analysis, thus not facing challenges of running algorithms on big data streaming.
The usage of ML seems to be widespread, especially for network security and anomaly detection. However, just some works use machine learning coupled with big data platforms. A general challenge when considering machine learning for big data analytics is indeed parallelization, which is not always easy to reach. Not all machine learning algorithms can be directly ported into a distributed framework, basically due to their inherent centralized designs. This hinders a wider adoption of big data platforms for the analytics stage, constraining works to perform data reduction at pre-processing stages.
In sum, in terms of complexity, most ML algorithms scale poorly with large datasets. When applied to the often humongous scale of NTMA data, they clearly cannot scale to typical NTMA scenarios.
Vi Challenges, open issues and ongoing work
We discuss some open issues and future directions we have identified after literature review.
1 — Lack of a standard and context-generic big NTMA platform: The data collection phase poses the major challenges for NTMA. The data transmission rate in computer networks keeps increasing, challenging the way probes collect and manage information. This is critical for probes that have to capture and process information on-the-fly and transmit results to centralized repositories. Flow-based approaches scale better, at the cost of losing details.
To solve the lack of flexible storage formats we have seen in Sect. IV-D that researchers have developed APIs or layers that transform the data in pre-defined shapes. Those APIs are not generic and not comprehensive. As an example, in NTMA one would like to associate to a given IP address both its geographical (e.g., country) and logical (e.g., Autonomous System Number) location. There is no standard library that supports even these basic operations in the frameworks.
Considering analytics, few researchers tackled the problem from a big data perspective. There is a lack of generic approaches to access the data features, with the ability to run multiple analytics in a scalable way. Thus, researchers usually rely on single data sources and sequential/centralized algorithms, that are applied to reduced data (see Sect. V-C).
In a nutshell, the community has yet to arrive at a generic platform for the big NTMA problem, and most solutions appear to be customized to solve a specific problems.
2 — Lack of distributed machine learning and data mining algorithms in big data platforms limits NTMA: Several researchers started adopting machine learning solutions to tackle NTMA problems. However, as analyzed in Sect. V, most recent papers focus on “small data”, with few of them addressing the scalability problem of typical big data cases. Most papers use big data techniques just in the first steps of the work, for data processing and feature extraction. Most of the machine learning analysis is then executed in a centralized fashion. This design represents a lack of opportunity. For example, applying machine learning with large datasets could produce more accurate models for NTMA applications.
From a scientific point of view, it is interesting to conjecture the causes of this gap: The reasons may be several, from the lack of expertise to platform limitations. We observe that the availability of machine learning algorithms in big data platforms is still at an early stage. Despite the availability of solutions like the Spark MLlib and ML tools that have started to provide some big data-tailored machine learning, not all algorithms are ported. Some of these algorithms are also simply hard to parallelize. Parallelization of traditional algorithms is a general problem that has to be faced for big data in general, and for big NTMA in particular.
3 — Areas where big data NTMA are still missing: From Tab. IV, it is easy to notice the lack of proposals in some important categories. For example, even though fault management is a category in which usually a great amount of data must be handled, few papers faced this problem with big data approaches. The reasons may be linked to what we discussed earlier, i.e., the lack of generic and standard NTMA platforms. Similarly, as examined in Sect. III, some categories of NTMA applications (e.g., QoS/QoE management) are hardly faced with big data approaches.
4 — Lack of relevant and/or public datasets limits reproducibility: To the extent of our survey, only two contributions disclose a public dataset, namely [putina_telemetry-based_2018] and [trevisan_awesome_2018]. Few works use open data, like the well-known MAWI dataset which is used for example in [vanerio_ensemble-learning_2017, fontugne_hashdoop_2014], the (outdated) KDD CUP ’99 [frishman_cluster-based_2017, rathore_hadoop_2016], and Kyoto2006 [huang_new_2016]. Apart from these cases, public datasets are scarce and often not updated, posing limitations in reproducibility of researches as well as limiting the benchmark of new, possibly more scalable, solutions.
5 — Ongoing projects on big NTMA: We have seen a solid increase in the adoption of big data approaches in NTMA. Yet, we observe a fragmented picture, with some limitations especially regarding interoperability and standardization. In fact, ad-hoc methodologies are proliferating, with no platform to support the community.
In this direction, Apache Spot was a promising platform (see Sect. IV-B). Unfortunately, its development has stopped, thus questioning its practical adoption by the community and practitioners. PNDA is instead actively developed, and the project starts collecting interest from the community, albeit in its early stage. Beam161616https://beam.apache.org is a framework offering the unification of batch and streaming models, increasing portability and easing the work of programmers that do not need to write two code bases; yet, no applications for NTMA exists.
In sum, there is a lot of work to be done to arrive at a practical big data solution for NTMA applications. The NTMA community shall start creating synergies and consolidating solutions while relaying on the consolidated platforms offered by the big data community.
Alessandro D’Alconzo received the M.Sc. degree in Electronic Engineering with honors in 2003, and the Ph.D. in Information and Telecommunication Engineering in 2007, from Polytechnic of Bari, Italy. Since March 2018, he is head of the Data Science office at the Digital Enterprise Division of Siemens Austria. Between 2016 and 2018, he was Scientist at the Center for Digital Safety & Security of AIT, Austrian Institute of Technology. From 2007 to 2015, he was Senior Researcher in the Communication Networks Area of the Telecommunications Research Center Vienna (FTW). His research interests embrace Big Data processing systems, network measurements and traffic monitoring ranging from design and implementation of statistical based anomaly detection algorithms, to Quality of Experience evaluation, and application of secure multiparty computation techniques to cross-domain network monitoring and troubleshooting.
Idilio Drago is an Assistant Professor (RTDa) at the Politecnico di Torino, Italy, in the Department of Electronics and Telecommunications. His research interests include Internet measurements, Big Data analysis, and network security. Drago has a PhD in computer science from the University of Twente. He was awarded an Applied Networking Research Prize in 2013 by the IETF/IRTF for his work on cloud storage traffic analysis.
Andrea Morichetta (S’17) received the M.Sc. degree in Computer Engineering in 2015, from Politecnico di Torino. He joined the Telecommunication Networks Group in 2016 as a PhD student under the supervision of Prof. Marco Mellia, funded by the BIG-DAMA project. In summer 2017 he had a summer internship at Cisco in San Jose, CA. In 2019 he spent six months at the Digital Insight Lab of the AIT Austrian Institute of Technology as visiting researcher. His research interests are in the fields of traffic analysis, security and data analysis.
Marco Mellia (M’97-SM’08) graduated from the Politecnico di Torino with Ph.D. in Electronics and Telecommunications Engineering in 2001, where he held a position as Full Professor. In 2002 he visited the Sprint Advanced Technology Laboratories, CA. In 2011, 2012, 2013 he collaborated with Narus Inc, CA, working on traffic monitoring and cyber-security system design. His research interests are in traffic monitoring and analysis, and in applications of Big Data and machine learning techniques for traffic analysis, with applications to Cybersecurity and network monitoring. He has co-authored over 250 papers and holds 9 patents. He was awarded the IRTF Applied Networking Research Prize in 2013, and several best paper awards. He is Area Editor of ACM CCR, and part of the Editorial Board of IEEE/ACM Transactions on Networking.
Pedro Casas is Scientist in AI/ML for Networking at the Digital Insight Lab of the Austrian Institute of Technology in Vienna. He received an Electrical Engineering degree from Universidad de la República, Uruguay in 2005, and a Ph.D. degree in Computer Science from Institut Mines-Télécom, Télécom Bretagne in 2010. He was Postdoctoral Research at the LAAS-CNRS in Toulouse from 2010 to 2011, and Senior Researcher at the Telecommunications Research Center Vienna (FTW) from 2011 to 2015. His work focuses on machine-learning and data mining based approaches for Networking, big data analytics and platforms, Internet network measurements, network security and anomaly detection, as well as QoE modeling, assessment and monitoring. He has published more than 150 Networking research papers in major international conferences and journals, received 13 awards for his work - including 7 best paper awards. He is general chair for different conferences, including the IEEE ComSoc ITC Special Interest Group on Network Measurements and Analytics.