algorithms are now being implemented and deployed on a large scale across countless application domains, including health-care, transportation, speech analysis, computer vision, market analysis, life sciences, and many otherssejnowski2018deep.
Recently, ML applications have been moving to the cloud, in order to exploit high performance parallel and distributed computing, which has given rise to the concept of Machine Learning as a Service (MLaaS) ribeiro2015mlaas. This usually refers to the availability of online platforms and frameworks, which has enabled to implement in the cloud all the customary stages of a ML
pipeline. Such stages include, for instance, input pre-processing, feature extraction, and in particular the training phase, which is usually the most expensive from a computational and memory consumption viewpointyao2017complexity.
In this paper, we turn our attention to a particular aspect of MLaaS, that of deploying and parallelizing systems that have already been trained, and need to be made available to (possibly many) end-users. Although not computationally expensive as the training phase, prediction and forecasting tasks may be nonetheless burdensome in terms of resources, especially when results must be delivered within a close deadline, or when the service has to be made available to a wide public. In this setting, an infrastructure for MLaaS should support large-scale distributed batch processing, as well as run-time stream processing. Besides, scalability and fault tolerance are fundamental requirements. In this regard, since its first formulation in 2004, the MapReduce DBLP:conf/osdi/DeanG04 distributed programming model has gained significant diffusion in the big data research community. This success is mainly due to its simple yet powerful and intrinsically parallelizable paradigm. Several distributed computing engines have been proposed since then, to support the development of distributed programs for batch processing of very large data collections, providing autonomous fault-tolerant mechanisms and run-time infrastructure scaling capabilities. The latest evolution of these frameworks storm; flink; spark offers stream processing support, alongside with the more traditional batch processing, and leverages different programming paradigms, in addition to MapReduce. All that together greatly simplifies the implementation of efficient distributed applications for the analysis of big data flows.
Our aim with this paper is to show how a MapReduce-inspired programming paradigm can be used to improve the performance and scalability of the ML pipeline, by parallelizing the customary steps of the prediction process and supporting the development of ready-to-use pre-trained services for the end user. These are our main contributions:
A structural characterization of systems addressing the prediction task of different ML applications, from a MapReduce-inspired parallelization viewpoint.
The detailed discussion of a concrete application of the outlined structure in the natural language processing domain, specifically in the area of argumentation mining, together with an empirical evaluation of the performance and scalability of the approach.
The paper is structured as follows. Section 2 discusses related work, introducing the concept of MLaaS and its interpretations in various domains of computer science. Section 3 describes scenarios where the proposed parallel architecture could play a role, as well as the challenging issues that must be addressed. Section 4 focuses on the proposal of MLaaS for the end user by presenting a case study in the area of argumentation mining. Section 5 illustrates the parallel architecture. Section 6 presents its empirical evaluation. Section 7 draws conclusions.
2 Related Work
MLaaS is a phrase found in various areas of computer science, where it is used to refer to various concepts. A significant body of literature focuses on the description and analysis of platforms that implement a whole ML
pipeline. That may include, for instance, the capability to perform data pre-processing and feature selection, to choose the best-performing classifier, to train a model, and to predict the outcomes on query data. For example, Yao et al.yao2017complexity compare the performance and complexity of several solutions for building MLaaS applications implementing the entire ML pipeline. Similarly, Chan et al. li2017scaling describe the distributed architecture exploited in ML applications in Uber, with a focus on model training and features selection. A complete architecture for MLaaS is also described by Ribeiro et al. ribeiro2015mlaas who present a specific analysis on three ML-nearest neighbors). In each of these works, however, the parallelization of the final prediction stage is only superficially addressed, or ignored altogether, whereas we believe that a ML tool provided as-a-service to a wide public of end-users cannot disregard the parallelization of this last step (although it is generally less computationally expensive then the previous ones). In particular, we argue that the prediction stages of different ML applications have common characteristics, which make the adoption of a MapReduce-oriented approach DBLP:journals/cacm/DeanG08 particularly suitable for the purpose. The need for a ML tool as a service through the adoption of MapReduce is also envisaged by other scholars. For instance, Chan et al. DBLP:conf/cikm/ChanSSC13 and Baldominos et al. DBLP:conf/cibd/BaldominosASI14 do so while focusing on the features that such a tool should expose, rather than on techniques to obtain scalability. The application of MapReduce to the ML pipeline comes with the great advantages brought by distributed computing architectures, which allow the developer to focus on the implementation of its parallel program while disregarding lower-level architectural and infrastructural details, such as the coordination between nodes, the employment of heterogeneous hardware in the same data-center DBLP:conf/osdi/ZahariaKJKS08; DBLP:journals/tpds/ChengRGJZ17 or even the use of multiple cloud data-centers DBLP:journals/ijcc/AntoniuCBDFGPSTBTBBCKNS13; DBLP:conf/ucc/Clemente-Castello15; DBLP:conf/ucc/LoretiC15; DBLP:conf/hpcc/LoretiC15.
Another line of research that uses the MLaaS term, focuses instead on the parallelization of training algorithms for ML systems. For example, a multicore implementation for the training of many ML systems, which exploits the MapReduce paradigm, is presented by Chu et al. chu2007map. Sergeev et al. DBLP:journals/corr/abs-1802-05799
present an interesting framework to enable faster and easier distributed training in TensorFlowtensor. Tamano et al. tamano2011optimizing illustrate an approach to job scheduling optimization for MapReduce tasks in ML applications. The challenges and opportunities of exploiting MLaaS in the context of the Internet of Things are discussed by Assem et al. assem2016machine, again with a focus on the aspects related to training and classifier selection.
Fewer strands of work are instead dedicated to the performance of parallel algorithms for already trained and deployed ML models. Xu et al. xu2015making propose a software architecture that encompasses model deployment for real-time analytics. Their emphasis is on the processing of big data collections, including RESTful web services for data wrapping and integration, dynamic model training, and up-to-date prediction and forecasting. The focus of other works such as by Hanzlik et al. hanzlik2018mlcapsule is on model deployment, and in particular on problems related to model stealing and reverse engineering. Harnie et al. harnie2017scaling use the Apache Spark technology spark to achieve the desired scalability in chemoinformatics applications.
The problem of parallelizing sophisticated artificial intelligence-based reasoning engines has been studied by Loreti et al.DBLP:conf/wosp/LoretiCCM17; DBLP:journals/fgcs/LoretiCCM18 in the domain of business process compliance monitoring. However, the focus of such study lies in the input data-set partitioning strategies, whereas the present study aims to identify common patterns in various ML tasks, which could lead back to a MapReduce-inspired approach.
3 Prediction as a Service
The typical pipeline of a ML system to parallelize and distribute has the following characteristics. A collection of data is given as input, either as a set of batches, or as a continuous stream. As Figure 1
illustrates, a first stage of computation is present, where all the instances of the input dataset need independent processing. Such independence enables the distribution of the computation load across many nodes. This set up is common to many application domains. For instance, a great deal of natural language processing tasks consider independent input elements such as single sentences, paragraphs, or documents. Among such tasks we shall mention sentiment analysis, whose goal is to assign asentiment to a given segment of text, being it a tweet, a post on a social network, or a comment to a newspaper article. Document categorization aims to classify a piece of text into one or more semantic categories. Fake news detection, as well as many other tasks, aim to detect sentences or text portions with certain characteristics manning1999foundations. The same can be said of many important tasks in computer vision. For example, the goal of image tagging or image classification is to assign a label (or a set of labels) to a given image. In video surveillance steger2018machine, as well as in many other video processing tasks, the computation is often carried out at the level of single frames, or small batches of frames. In other domains, such as bioinformatics, chemoinformatics, or genomics data analysis, several different predictors can be applied to input data, such as protein or DNA sequences, so as to classify different properties of single sequence elements baldi2001bioinformatics.
The output of this first stage of the pipeline if often used as input to a second stage. In general, the first stage could be considered a sort of filtering, or detection phase, whereas the second stage works as a sort of aggregation phase, where the instances detected during the first round are matched and compared with one another. That is the case in some natural language processing setups, where the text segments selected during the first phase are processed by a clustering algorithm, such as topic modeling, or by a further categorization stage. For instance, single sentences could be first classified according to coarse-grained categories, and then classification could be further developed into fine-grained categories, as it usually happens with fact checking systems, where candidate fake news items are first selected during an initial processing phase, and then a fact checking algorithm is applied to a collection of instances found in the first phase. In some consumer-oriented applications Lippi2019nature, such as the detection of unfair contract clauses Lippi2019claudette, a first stage could identify sentences expressing potentially unfair clauses, whereas a second stage could predict whether a given clause is actually unfair, based on the context provided by other relevant sentences. In bioinformatics, the output of the predictors developed during the first stage are often aggregated into a higher-level set of predictions, that combine the information coming from the different classifiers harnie2017scaling. In computer vision, video segmentation is typically performed on frame groups, resulting from a first processing stage tekalp2015digital.
The general pipeline described so far recalls the principles of the MapReduce programming paradigm DBLP:journals/cacm/DeanG08. MapReduce is a well-known technique for simplifying the parallel execution of software, whereby the input data is partitioned into an arbitrary number of slices, each exclusively processed by a mapper task emitting intermediate results in the form of key/value pairs. The pairs are then passed on to other tasks, called reducers, which are in charge of merging together the values associated with the same key, and “emitting” the final result. As Figure 1 illustrates, the first phase of the general ML pipeline could indeed be modeled as a mapping task, where each instance undergoes the same processing to emit a key/value pair. The corresponding reduction task is then carried out in the second phase of the pipeline, where the outputs of the first phase are merged and processed together.
Programs reformulated according to the MapReduce model, i.e., in terms of map and reduce functions, can be automatically parallelized and executed on a network of computing nodes. To that end, there exist several MapReduce-oriented platforms for big data analytics Singh2014, which can turn a collection of computing nodes into a distributed infrastructure able to automatically spread (and balance) the execution tasks across the data center. Most of these platforms supply mechanisms to scale up or down the cluster on demand, e.g., to meet a strict deadline when analyzing large files, or take on an increase of the input rate when processing a stream. They also offer automated detection and runtime recovery from faults involving some computing nodes. All these features seem particularly beneficial to a ML tool provided as-a-service, especially in the presence of requirements of efficiency, scalability and fault tolerance, which a service offered to a wide public might have to meet.
3.1 Challenges of Ml pipeline parallelization
Independently of the particular ML application, the implementation of the pipeline depicted in Figure 1 on a network of cooperating computing nodes presents three major architectural and technological challenges. First of all, while the first phase enjoys the benefits of a highly parallelizable structure, the aggregation step in the second phase needs careful consideration, to prevent it from turning into a bottleneck. Indeed, combining several records together is an expensive operation in distributed environments because it causes massive data shuffle over the network. This is actually a known issue for many MapReduce applications. In case of MLaaS, it may be crucial to apply, if possible, a filtering operation aimed to reduce the input space for the aggregation function. For instance, one could anticipate some steps that conceptually may belong further down in the pipeline, in an effort to narrow down the data emitted at the end of the first phase.
A second challenge has to do with the trained model supervised or semi-supervised ML applications commonly use in order to process the input data and provide predictions. When a distributed engine is employed, each computing node must be provided a copy of such a model alongside with the input data. However, depending on the application and the underlying ML technologies, the dimension of the trained model can be significant. Nonetheless, if the learning process is not continuous, meaning that the model evolution is limited to the training phase, such a model is stable during the whole prediction process. Accordingly, the model could be distributed to all computing nodes at the beginning of the computation once and for all, and then employed by each node independently to process/filter its portion of the input.
A final technological challenge has to do with the third-party software ML applications often employ at different stages of the pipeline. The integration of third-party methods into the framework of a distributed data processing engine might not be a straightforward operation. Ad-hoc solutions might be required, for instance, to limit the number of calls to external processes executing the third-party software.
3.2 Other aspects of MLaaS on large-scale data processing engines
The implementation of MLaaS on large-scale data processing engines not only offers architectural and technological challenges, but it also often requires making some detailed choices whose impact on the performance of the offered service may be crucial. For example, although the aggregation step of the second phase works by collecting together the output of the previous processing, performing the subsequent processing/filtering step on a single node is not mandatory, and indeed it is generally worth avoiding. If a certain degree of parallelism is possible (because for example the instances emitted by the aggregation step can be considered independently for one another), a good practice would be to split the second phase too into tasks that can be carried on concurrently. Modern big data analytics engines (such as Apache Spark, which will be discussed in more detail in the following) support the implementation of distributed programs not strictly limited to a map and a reduce phase. More complex variants and combinations of the two are possible. As far as MLaaS is concerned, a further map step following the aggregation could contribute to boosting the performance of the second phase.
Furthermore, when the processing/filtering steps involve a large trained model, the task of loading such model can be time consuming and should be performed carefully. Consider for example a natural language processing system whose first phase classifies the sentences of a text according to the features of a large trained model. In such a system it is certainly not recommended to load the model for each sentence. A much preferable solution would be to load the model once, and then use it to analyze a consistent number of phrases together. Nonetheless, grouping too many sentences together reduces the degree of parallelism of the pipeline step, thus slowing down the computation. Depending on the specific predictive task to be carried out and on the degree of parallelism allowed by the underlying infrastructure, a trade-off must be found between the need to avoid unnecessarily repeating costly operations (such as model loading) and that of splitting the work to speed up the computation.
A similar conundrum regards the features extracted from the input data which might be used in multiple subsequent steps of the pipeline. Feature extraction may be a costly operation, in which case it would be worthwhile performing it only once, and then hand over the features to the following steps until they are no longer needed. However, one must consider that between each pair of steps of the pipeline a data shuffle over the network might occur. Therefore, if feature extraction produces large outputs, the shuffling of such data on a network with limited bandwidth might cause poor performance. In these cases, there is a trade-off to consider, between recomputing large features when needed and shuffling them between nodes.
Finally, a common feature of MLaaS applications is the requirement to accommodate a variety of input modes, in particular document batches and data streams. From a developer’s standpoint, a shift of perspective is unavoidable when passing from a ML application working on input files that are already materialized and stored in a specific location (i.e., batch processing) to the analysis of a flow of input data (i.e, stream processing). Nonetheless, some relatively recent MapReduce-oriented platforms flink; spark allow this shift at the price of slight changes in the application implementation. Since the first phase of our reference ML pipeline operates in theory on each input instance independently, it could operate on batches and streams alike. The second phase instead is likely to be dependent on the input mode, because it focuses on the aggregation and processing of the previous step’s output. Indeed, for stream processing aggregation entails the need to specify not only the collections of data to be merged, but also the period over which such operation must be performed. For example, aggregation could be based on the last occurrences in the flow, on a specific time window, or on all the data received so far. Each option would have a different meaning from the others and may produce completely different results. In the case of batch processing, instead, the aggregation task has a less varied range of semantics, because there are no concepts like “time window” or “arrival instant” of an input entry.
In an effort to ground the general discussion and illustrate in concrete how MLaaS can be enabled by a MapReduce-oriented approach, the section that follows is devoted to a particular case study. The case study is a natural language processing application in the area of argumentation mining DBLP:journals/toit/LippiT16, consisting of an argument component identification task followed by an argument structure prediction task, whose goal is to identify links and relations between argument components.
4 A Case Study
Argumentation mining DBLP:journals/toit/LippiT16 defines the task of extracting arguments from unstructured text by automated analysis methods. Such a task is usually, but not necessarily, referred to a specific genre, such as legal texts, persuasive essays, scientific literature, etc. One notable and increasingly popular application area where argumentation mining plays a key role is that of debating technologies mirkin-etal-2018-listening, where the data to be processed may consist of “static” Wikipedia pages as well as streamed audio signals. Owing also to the variety of addressed genres, the argumentation mining literature offers several alternative definitions of argument Peldszus2013, with varying degrees of sophistication. For the aims of this work, and without loss of generality, we shall refer to a common and rather basic claim/premise argument model walton1990reasoning, whereby an argument is composed by an assertion, or statement (the claim) supported by one or more premises, and the inference that connects claim and premises.
Argumentation mining includes several sub-tasks, spanning from claim and evidence detection (i.e., the identification of argument components), to attribution (i.e., detection of the authorship of an argument), to increasingly challenging tasks such as the prediction of the relations between arguments (argument structure prediction), the inference of implicit argument components (enthymemes), and so forth. Figure 2 offers an example of a possible output produced by an argumentation mining system.
One such system is Mining ARGuments frOm Text (MARGOT) DBLP:journals/eswa/0001T16. MARGOT exploits a combination of ML and natural language processing techniques in order to perform argumentation mining on unstructured texts of various genres. In particular, MARGOT exploits a Support Vector Machine (SVM)-based method for context-independent claim detection DBLP:conf/ijcai/LippiT15 using tree kernels, and extends its application to evidence detection, and to the identification of their boundaries.
The initial version of MARGOT follows a pipeline of subsequent stages, by initially segmenting the text into sentences, then performing the detection of argumentative sentences, that is sentences that include argument components (claims or evidence), and finally identifying the boundaries of each argument component. The current MARGOT prototype margot, also available as a web server,111See http://margot.disi.unibo.it adopts the “traditional”, sequential architecture. It analyzes each input sentence individually, thereby processing the document as a whole in a sequential manner, in order for each sentence to undergo all the subsequent steps of the pipeline. It does not address argument structure prediction.
In this work, taking MARGOT as a case study, we present a more advanced version of the system, where the claim and evidence detection stage is followed by the application of a further SVM-based step, aimed to detect the existing support links between all possible pairs of claims and evidence found in each document.
Following the model introduced in Section 3, we describe MARGOT’s pipeline as composed of two subsequent phases. A first phase, shown in Figure 3, processes each sentence of the input file using the Stanford Parser DBLP:conf/acl/ManningSBFBM14, a largely successful third-party software package. MARGOT uses the Stanford Parser to obtain a first set of features, that is, the trees encoding the grammatical structure of a sentence, known in the literature as constituency parse trees. From the same sentence, MARGOT also extracts the bag-of-words feature vector, which represents a binary encoding of its words. The two sets of features are passed on to two different classifiers employing the SubSet Tree Kernel (SSTK). The aim of these classifiers is to identify claims and evidence, respectively. All and only the sentences containing the identified claims/evidence are then sent to the second phase of the pipeline, which considers all the possible (claim, evidence) pairs in each file and calls another SVM-based classifier to detect the possible links between the two (link detection).
Given the complexity of the analysis to be carried out, especially when large files are mined, the sequential execution of MARGOT’s pipeline can be highly time- and resource-consuming. The problem is exacerbated when the number of arguments detected in the first phase is large, since this entails to consider, for each input file, the Cartesian product of large sets of the detected premises and claims. Furthermore, as we envisage the future necessity of argumentation mining analysis as a service, it is likely that such a service would be required to consider streams of input text instead of documents already materialized in a certain disk location. The run-time nature of stream processing further amplifies the need for scalable and reliable architectures to support argumentation mining as a service.
Nonetheless, as prescribed by the general model in Figure 1, the parser and the feature extractor of the first phase can process each sentence independently from the others, whereas the Cartesian product in the second phase operates as a pair-wise sentence aggregator. This observation suggests a way to distribute the computational load of MARGOT’s pipeline on a network of computing nodes, leveraging a MapReduce-oriented approach and an engine for large-scale batch and stream processing.
5 A Parallel Architecture
Among the existing variety of MapReduce-oriented engines for large-scale data processing, some offer the possibility to analyze batches of documents already materialized in a certain location hadoop; others only deal with the processing of data flows storm; samza; dataflow; whereas a restricted number offer the possibility – particularly desirable for the current work – to operate on both batches and streams flink; spark. For the purposes of this work, and without loss of generality, we will describe how MARGOT can be re-implemented to be automatically executed on a distributed infrastructure using the facilities provided by Apache Spark spark. Apache Spark has become increasingly popular in the last years also because it allows a developer to write her application in several different languages, without forcing her to think in terms of only map and reduce operators. Its good performance DBLP:conf/bigdataconf/VeigaEPTT16; DBLP:conf/ipps/ChintapalliDEFG16; DBLP:conf/iccS/SamosirIH16 and resilience to faults DBLP:conf/globecom/LopezLD16 has been empirically verified by various studies. We will first consider the case of large batches of input documents, and later refine the algorithm in order to accommodate streams as well.
5.1 Margot for batch processing
In case of batch processing, all the documents to be analyzed are already present in a certain (centralized or distributed) disk location, and the data analytics infrastructure is in charge of spreading (and balancing) the computation load on the available nodes.
As Figure 4 illustrates, in the first phase of MARGOT’s pipeline the files to be analyzed are split into sentences. A collection of () pairs is thus produced, where the key is the file name and the value is the sentence found.
Then, the core operations of the first phase are performed on each sentence independently by applying a map function. This operation extracts the parse tree and the bag-of-words feature vector, and passes them in input to two third-party classifiers that emit, for each sentence, a claim and an evidence score, respectively. As these functions require sizable models (the parsing model, the stemmed dictionary and the claim/evidence models) to operate on each sentence, we apply the general suggestion of loading such models once at the beginning of the computation. The objects produced are sent to all the computing nodes, by leveraging the concept of broadcast variable offered by Apache Spark i.e., an immutable shared variable which is cached on each worker node of the Spark cluster. The output of the map function is again a () pair with the same key (the file name), but with the values (which hosted just the sentence in the input) now enriched by the phrase feature vector, claim and evidence score.
Finally, two filters are applied to select only claims and premises. Indeed, two different collections of pairs are produced: one containing elements with positive claim score, and another with positive evidence score.
The details of the first phase implementation on a distributed MapReduce-oriented platform are presented in Listing 1 following an Apache Spark-inspired approach with lambda functions. It is worthwhile underlining that, since the SVM-based classifiers are realized by third-party C software DBLP:conf/ecml/Moschitti06 that cannot be directly converted into a Spark broadcast variable, we attempt to minimize the initial overhead of loading these external software by resorting to a mapPartitions function. Differently from map and mapValues functions, which are executed for each sentence (as in line 1), mapPartitions operates on a collection of sentences, which is a partition of the whole sentences in the input files. The size of this partition is optimized by the underlying infrastructure, based on the number of available cores. As a consequence, the call to external SVM-based classifiers and the loading of the models in lines 1 and 1 is not repeated for each sentence (which would be highly inefficient) but for each group of sentences the infrastructure has partitioned the input in. We shall remark that this solution could be applied to other MLaaS applications too, in order to deal with the frequent challenge of reducing the overhead of third-party software invocation.
Since the result of mapPartitions is later accessed by two different filters, cluster-wide caching (line 1) is employed to avoid a duplicated computation of the same mapPartitions stage.
In particular, a distributed join operation is performed, to obtain a collection of all the possible (claim,evidence) pairs in each file. The subsequent map function considers each pair individually. It is worthwhile noticing how the general suggestion offered in Section 3 about parallelizing the step after the aggregation whenever possible does find application in this case. Inside each map function, the link model and the pair of feature vectors in each record are employed by a third-party SVM-based classifier to predict a link score, indicating whether the claim and evidence in each pair are linked. As reported in Listing 2, similarly to claim/evidence detection, a mapPartitions function (line 2) is actually performed at this stage, aiming to minimize the initial overhead of loading the external software. Finally, only elements with positive link scores are maintained (line 2 of Listing 2), as these represent the final output of the argumentation mining algorithm.
5.2 Margot for stream processing
The distributed algorithm presented in the previous subsections can accommodate streams of input text after only minor modifications. In particular, those operations conducted on each sentence independently, such as map and filter, can be performed on streams and batches alike, whereas the semantics of functions that merge together different collections of key-value pairs, such as the join operation in the second phase of the pipeline, needs further refining for stream processing.
Link detection on a batch of documents implicitly entails a natural definition of “scope” which corresponds with the document at hand. Input files should thus be considered independently, so as to identify the claims and evidence therein, enabling link detection on a per-file basis, as shown in the previous section. When dealing with a stream of data instead, two different semantics for link detection are possible.
Scope file pairing: if the stream itself contains the indication of the input file transmitted, we might be asked to pair claims with evidence in each file, with a semantics similar to the one adopted for batch processing, thus keeping track of the currently identified claims/evidence in each file along the stream.
Scope window pairing: if there is no explicit or implicit concept of document/file, a natural choice would be to detect the links on a sliding window of sentences in the input stream, that is, to find a connection between claims and evidence separated by at most other sentences in the stream. This sort of “locality principle” especially holds true in such contexts as argumentation mining, where claims and supporting evidence are typically near to each other in the input text Eger2017.
The implementation of the second phase of MARGOT for stream processing, presented in Listing 3 enables our system to accommodate both semantics. We do not report the details of the first phase, since it is rather similar to that of batch processing.
To perform link detection on a sliding window of the input stream (lines 3 to 3), the flows of claim and evidence collections are sliced using a basic window operation before performing the join. In this way, only the claims and evidence inside each window are paired. To perform link detection with scope file, instead, a stateful operation is performed to maintain a growing collection of claims encountered in each streamed file (line 3). This set of past claims is joined with each new evidence detected in the stream (line 3).
6 Empirical Analysis
The objective of the analysis we are going to discuss here is gaining a quantitative understanding of the performance enhancement that can be obtained when a ML task is distributed on a network of computing nodes. We are not interested here in evaluating the accuracy of the ML methods themselves, since that measure would be independent of the architecture being parallel or otherwise, and anyway it has been studied in previous works DBLP:journals/eswa/0001T16; DBLP:conf/ijcai/LippiT15. Indeed, MARGOT here works as a case study for investigating the scalability of a MLaaS application, in case of both batch and stream processing.
6.1 Simulation setup
We evaluate the performance of the proposed distributed system for argumentation mining on a cluster of 126 physical nodes. One of these machines, configured as a Spark master, coordinates the work of the others 125 slaves. All the computers are equipped with 8 CPUs, 16GB of RAM, and a 400GB hard disk. The nodes are interconnected by a 100Mbit/s bandwidth local network. We evaluate the distributed version of MARGOT on a collection of input files downloaded from the Project Gutenberg web site.222https://www.gutenberg.org As we need to operate on texts containing a significant number of claims and evidence, common novels are not suitable as input. We therefore restrict our attention to essays in English language. The considered dataset includes 50 files, yielding 466,483 total sentences. The complete source code of the distributed version of MARGOT is available on GitHub pm.
6.2 Evaluation approach
We run separate experiments to evaluate the performance of the system in the two execution modes: with batch input documents and text streams.
To investigate batch processing, we stored the documents downloaded from Project Gutenberg into a Hadoop Distributed File System (HDFS) hdfs, which automatically slices and distributes the files on the network of computing nodes. When studying the system for stream processing, instead, we must consider an input flow of text with a certain rate, measured in bytes per second. In that case, it is crucial for the system to be able to perform all the steps in the pipeline, while keeping up with the input rate. If the computation is slower than the input flow, not only will the system introduce an increasing delay in the time to emit the output, but also the buffer area employed by Spark to temporarily store the data waiting to be analyzed may eventually become saturated. The Spark Streaming module treats the flow by periodically slicing it into portions called micro-batches, which are later distributed on the network and separately processed on each node. The period of micro-batch slicing is a configurable parameter. As a general recommendation large micro-batch period helps to keep up with high input rates at the price of an increased latency in the results. Because our goal is not to evaluate the performance of the Spark Streaming’s micro-batch processing mechanism, but to evaluate the scalability of the system, we fixed the batch time to 100 seconds for all the tests on streams.
Real-world stream processing services usually experience increases and decreases of the input rate over the day. In order to evaluate the proposed MLaaS application we progressively increase the input rate during each test, so as to identify the maximum input rate that the system can sustain before it starts falling behind.
In both streaming and batch cases, we conduct three scalability tests:
Test 1 – experiment with increasing input size (i.e., increasing file dimension for batches, and larger window for streams). The objective is to determine the scalability of the overall MLaaS application.
Test 2 – experiment with increasing number of (key, value) pairs emitted by the first phase (i.e., the number of emitted claims and evidence that would be later joined and processed in the second phase). The objective is to study the effect of the aggregation step bottleneck on the performance.
Test 3 – experiment with increasing number of support vectors in the employed SVM model. The objective is to study the impact of computationally demanding ML tasks.333Because the features of each sentence must be compared with all the vectors, it is well-known that for an SVM classifier larger models yield longer computation times.
We shall remark that Tests 1 and 2 are independent of the specific ML application considered, whereas Test 3 has been specifically conceived in the context of MARGOT, because it is strictly related to the kind of operations conducted in its pipeline. However, it is of general interest, since SVM classifiers are widely popular due to their excellent performance in a large variety of tasks.
Concerning Test 1, Figure 8 illustrates the scalability of the distributed system by increasing amounts of input data. In particular, the plot on the left (Figure (a)a) shows the time required to process datasets of different size using batch processing with increasing numbers of computing nodes. The size of datasets are reported in Table 1.
As desired, the total execution time greatly benefits from the introduction of additional nodes. The most significant improvements are observed between 1 and 50 nodes. After 50 nodes, the cost of distributing the tasks on the network balances off the benefits yielded by the additional computational resources. Furthermore, when dealing with small input files such as the “DS1” series, a slight performance loss is observed as the number of nodes increases from 25 to 50 and above. That could be the effect of the overhead generated by partitioning and distributing in the network small amounts of data.
The other plot (Figure (b)b) illustrates the system’s performance with text streams. The y-axis reports the maximum input rate in bytes per second that the system can tolerate without falling behind. The graph has been plotted by periodically increasing the input rate and checking when the processing time of each micro-batch in the stream started to exceed the configured micro-batch period. Hence, the higher is the curve in the figure, the better is the system performance. We have made several experiments by varying the windows size. For example, with we indicate the performance of the system when claims and evidence are joined over a window of 5,000 seconds, that is fifty times bigger than the micro-batch period. In this case, when employing 125 nodes, the system cannot keep up with an input frequency higher than 600 bytes/s. If we assume the average sentence length to be 200 characters, this means that the system cannot process more than 3 sentences per second. Although such a performance may seem unimpressive at a first glance, we should consider the sheer number of claim/evidence pairs to be analyzed by the link classifier. In particular, with a 600 bytes/s input rate and a 100s micro-batch period, a 5,000 seconds window contains around 15,000 sentences.
Figure (b)b also reports the performance in the “scope-file” series, when no window is employed, but instead the claim/evidence join is executed on a per-file basis. The performance are worse than those obtained with a window of size 100s (up to around 900 sentences per window processed on 125 nodes) and better than 1,000s (around 72,000 sentences per window). As the average number of sentences in each file is 9,000, the position of the “scope-file” curve between series “w=100s” and “w=1,000s” appears reasonable.
Figure 11 illustrates the results of Test 2. In order to obtain different amounts of elements to be analyzed in the second phase, in Test 2 we artificially varied the filtering thresholds that MARGOT uses to identify claims and evidence at the end of the first phase. The names of the series (5%, 35%, 65% and 90%) report the percentage of input sentences that reach the second phase. As expected, both batch (Figure (a)a) and stream (Figure (b)b) processing reveal a significant effect of this parameter on the processing time and the maximum input rate. This confirms the bottleneck effect of the aggregation step in the ML pipeline. Nonetheless, the scalability trend is evidently maintained for all the series in the graphs.
Figure 14 illustrates the results of Test 3, whose aim was to study the impact of the number of support vectors in the link prediction model on the system’s performance. Table 2 summarizes the details of each model.
|Model||Dimension (MB)||Support Vectors|
As the SVM classifier checks the features of each (claim,evidence) pair with all the support vectors in the model, one could imagine that larger models cause longer computation times. Instead, rather surprisingly, we observed that the effect of this parameter on the overall batch (Figure (a)a) and stream processing (Figure (b)b) performance is insignificant and that scalability is not visibly affected.
A growing number of ML applications are being deployed as ready-to-use, already trained services for the end-users. This calls for implementing distributed architectures able to scale up such services to broader user communities, and larger data collections.
In this paper, we presented a distributed architecture inspired by the MapReduce paradigm, which could be used to parallelize the prediction phase of a typical ML pipeline. We conducted experimental results on a real-world text mining application case study. We also discussed how the methodology is general enough to be applied to many other different scenarios. We considered both batch and stream processing, and studied the performance gain that can be achieved by this architecture under many angles.
An interesting open challenge is how to effectively extend this architecture to accommodate ML applications dealing with structured data, such as sequences, trees, and graphs. In that case, since the relations between the input data are as relevant as the data themselves, it is unlikely that the first phase could be implemented through a simple split operation, followed by independent map processes. Instead, we expect that a more elaborated slicing procedure would be needed at the beginning of the pipeline, in order to correctly partition the data across the nodes. A technique used to achieve a desired level of parallelism with input sequences of data in another application domains involved data replication DBLP:journals/fgcs/LoretiCCM18. The idea was to divide sequences into slices and replicating the data on the extremities, before assigning each slice to a different nodes. The identification of patterns for the split step that can be applied and reused in case of more elaborated input structures would be an interesting subject for future investigation. An even more challenging setting would be the case where the input data has a highly connected structure, hindering any kind of split and re-partition of the work. Then a completely different approach could be explored: instead of slicing the data, one could provide each node with the whole dataset, but only a portion of the trained model. In this way, each machine of the distributed architecture could conduct a lightweight prediction analysis on all the input. Like in the pipeline described in this work, the results of those analyses will have to be conveniently aggregated in the following phase.