With data explosion in many domains, such as Internet of Things (IoT) , scientific experiments [14, 38, 47], e-commerce, and social media [10, 12], people are facing an increasing number of obstacles concerning data processing and analytics on the sheer size of these unstructured data. These obstacles include interactive computing and user-specific element-wise data transformations . To break through these dilemmas, a growing number of data-intensive computing frameworks have been proposed, such as MapReduce , Apache Hadoop , and Spark . Generally, a mainstream approach to gain computing capability and scalability behind these platforms is to distribute data and computations across a cluster of nodes so that a large volume of data can be processed in a parallel and robust manner within a reasonable time [32, 50]. The successes of these frameworks owe to their MapReduce-like programming models, which are further based on data distribution techniques (e.g., Resilient Distributed Dataset (RDD) in Apache Spark ), and high-order functions (e.g., , , ) that can take user-defined functions (UDFs) as arguments. The semantics of these high-order functions facilitate data-parallelism to manipulate datasets in an element-wise way while UDFs are applied to each element to produce the desired result.
Despite these advantages, an endeavor to improve the performance of data-intensive applications exhibits a few challenging issues. 1) Usually, unstructured data expose less information about their schema if without metadata or annotation provided by programmers or help from runtime profiling tools. 2) It is difficult to apply conventional database-style optimizations on unstructured data directly, such as relational algebraic reordering and filter pushdown, since the programming models of current data-intensive computing platforms lack information about data schema . 3) Although Spark can process raw unstructured data directly using DataFrame or Dataset APIs, it needs to parse them before performing transformations (e.g, Map) and actions (ReduceByKey). Particularly, these applications can spend 80-90% of the entire executing time in data parsing . 4) Programming models usually treat UDFs as black-boxes and their semantics are therefore hidden from the system, resulting in insufficient information for further optimization [17, 18, 33, 34]. 5) Runtime factors are not fully utilized to tune the performance of a specific operation’s execution, such as cache management [31, 43]. Therefore, it is vital to integrate program semantics, data property and runtime factors to improve the performance of data-intensive applications since pure static optimizations are either limited or impossible if without efficient profiling information about data and runtime systems.
In this paper, we present a two-phase semantics-aware approach to optimize data-intensive applications combining static and dynamic program analysis. The first phase is an offline static analysis. Source code and performance log collected in prior executions are analyzed to refactor code by applying three kinds of optimizations: cache management, operation reordering and element pruning. The offline phase is developed as a compiler plugin of the host development languages (e.g., Scala, Java). However, not all performance issues can be fixed in the offline phase, while some may need support from system runtime information, such as intermediate data size, memory usage and execution time of operations. The second phase is an online dynamic analysis to obtain the required runtime information, where we instantiate a parameterized framework based on the instrumentation generated in the offline phase to trace applications’ execution and extract profiling information concerning data and system status. Unless otherwise specified, we shall limit our discussion to the context of Apache Spark and demonstrate its effectiveness on Spark applications in the following sections. Nevertheless, the proposed approach is general and can be extended to other data-intensive platforms. To the best of our knowledge, our implementation is the first compiler plugin to help users optimize data-intensive applications. The major contributions of this work are summarized as follows:
A framework on Spark using hybrid program analysis is proposed to optimize the performance of data-intensive applications on unstructured data at the code level.
We design approaches to detect three kinds of performance issues from the perspectives of data, code and system, respectively: cache management, operation reordering and element pruning.
SODA prototypes maximize expected caching gain offline by reducing it to a convex-concave relaxation problem and leverages Pipage Rounding approximation algorithm to construct a probabilistic cache policy within factor from the optimal, in expectation.
A novel reference measurement, named Global Execution Distance, is proposed based on application workflow to narrow down search space in the Pipage Rounding algorithm.
A piggyback profiling tool is integrated with Spark internal metrics and event system to gather statistics for applications during runtime.
The rest of this paper is organized as follows. Section II reviews performance problems and introduces the whole life cycle of SODA. A comprehensive discussion about the semantics-aware data model is provided in Section III. Section IV discusses the philosophy behind three kinds of optimization strategies. The evaluation and experiments of SODA are illustrated in Section V. Section VI sketches related work. Finally, conclusions and future work are presented in Section VII.
Ii System Overview
SODA is proposed as a two-phase framework, i.e., offline static analysis and online dynamic analysis, to interactively and semi-automatically assist programmers to scrutinize performance problems camouflaged in source code.
Ii-a Performance Problems
SODA looks for three kinds of performance problems: Cache Management (CM), Operation Reordering (OR) and Element Pruning (EP).
Cache Management (CM): It is crucial to manage cache resource for these data analytics frameworks [36, 37, 39, 45], which leverage in-memory computing to speed up performance and bypass the hindrance of disk and network I/O. Within these systems, intermediate computing data block would be put in memory by default. There is a block management component to manage these blocks and determine when and which one is evicted from memory. Recently, a rich line of research work propose different data block reference measurements to improve cache hit, such as least recently used (LRU), least reference count (LRC)  and most reference distance (MRD) . However, there remain two important factors that previous works have not taken into account, especially in Spark.
The executing order of all stages. This could impact system performance, especially for cache behaviors.
Data block size. Data blocks with the same reference in LRU or other fancy measurements, might not all fit the memory at the same time, and therefore it raises the concern about cache priority regards to system performance.
In addition, programmers may brutally persist the desired dataset in memory by invoking corresponding APIs (i.e., using the persist (or cache) method in Spark), resulting in a more complicated research problem. Therefore, an intelligent cache management mechanism using hybrid program analysis is needed to manage memory for efficiency. In this paper, we propose a stage-level cache allocation strategy in a data-intensive system by reducing it to a convex optimization problem [6, 20, 42].
Operation Reordering(OR): A data-intensive system usually supports a rich line of operations, such as map, reduce, filter, reduceByKey, and join. A developer may face a variety of executing plans assembled by a sequence of operations associated with UDFs to accomplish an application. Nonetheless, not all of these arrangements will yield identical performance. Therefore, it is crucial to orchestrate the operations in an appropriate order to bypass common pitfalls affecting performance significantly. For example, filter pushdown and join reorder are two common optimization strategies to improve the performance of relational algebra-based systems when handling structured data. As to process unstructured datasets, however, it is difficult to apply these conventional database-style techniques to systems using non-relational algebraic programming models. In this paper, we propose SODA to break through such kinds of dilemmas and extend these two strategies into a more general approach for processing unstructured data.
Element Pruning (EP): It is common that not all portions of a dataset are used to produce output, which leads to a series of redundant I/O operations, such as Disk I/O for reading/writing data and network I/O for transferring data among computing nodes. These redundant operations can become more severe when processing unstructured data. Due to the lack of predefined schema of a given dataset, it is difficult to detect data workflow in a fine-grained granularity (i.e., on attribute level), hence fails to identify unused data attributes. In particular, the redundant portion of the dataset may be a dominant barrier for performance when shuffling a huge size of data across networks in a data-intensive computing system.
Ii-B Architecture & Background
The framework of SODA includes offline and online phases, as shown in Figure 1. The offline (static) phase is developed as a compiler plugin of host programming languages (i.e., Scala, Java), and analyzes source code (src) and performance log about data and runtime factors to generate a nearly-optimized and parameterized program. Firstly, Code Analyzer analyzes source code with the help of a local compiler to construct a directed data operation graph (DOG), which represents the skeleton of an application. This graph comprises a set of nodes and edges, which denote operations and dataflows, respectively. In addition to static properties associated with corresponding operations, a group of dynamic profiling data is extracted by Log Analyzer from the performance log, which is accumulated in prior executions, including execution time, memory usage, input, and output data size of operations, runtime system status etc. These information can be extracted from system log [24, 25, 26] and provided by our profiling tool using Javassist (Java Programming Assistant), a high-level bytecode instrumentation tool to instrument APIs of Spark to expose information needed . Next, three optimization strategies, i.e. cache management, operation reordering and element pruning, are applied to assist users to scrutinize performance problems. When a problem is found, users would get informed about performance bugs from SODA and then refactor code. However, not all problems can be determined statically, it may need more information coming from executions. For example, SODA makes use of execution time and output size of operations to verify performance behavior and then create a global cache allocation strategy. To reduce system overhead resulting from the profiling process, Config Generator produces Profiling Guidance to inform the online phase about which operations and what kinds of computational resources needing to be monitored. In the online phase, SODA initializes an application with a parameterized configuration based on Profiling Guidance and starts a piggyback listener residing in each worker and master node to collect runtime information about memory usage, data property and system configuration. The profiling data would be accumulated and then delivered back as a performance log to the offline phase for further optimizations.
In this paper, SODA is implemented on Apache Spark and several real-world Spark applications are used as benchmarks to evaluate its effectiveness. Apache Spark  is an efficient and general engine for large-scale data processing that supports the scalability of MapReduce . Its main abstraction, named Resilient Distributed Dataset (RDD) , is a fault-tolerant and immutable collection of objects, which are logically partitioned across a cluster of computing nodes so that they can be manipulated in parallel. Spark’s programming model provides two types of operations, transformation and action. A transformation creates a new RDD dataset from an existing one while an action returns a value to the driver program. The lazy feature of transformations enables Spark to run more efficiently since they do not compute their results immediately until action is invoked. An RDD has to be recomputed when invoking an action on it unless it is persisted in memory using the persist (or cache) method, which facilitates much faster access. Apache Spark automatically monitors cache usage on each node and drops out old data partitions in an LRU fashion by default. In the Spark execution model, a Spark application is divided into a group of jobs executed in a sequential order111Multiple jobs can run simultaneously if they were submitted from separate threads, where a job is a parallel computation in response to a Spark action (e.g., save, collect); Within a job, multiple stages are generated and bounded by shuffle behaviors (e.g., reduce), then runs in parallel if there is no data dependency among them; otherwise, they are scheduled sequentially. Internally, a stage is a physical execution unit consisting of several operations. The unit is further divided into tasks, which share identical code but run on different data partitions in parallel. Given that, we need a fine-grained profiling tool to analyze semantic properties in code as well as runtime factors, such as the evolution of data, the execution time of operations and system status, to narrow down the gap between the programming model and execution model).
Iii Semantics-Aware Data Model
We propose semantics-aware data model to keep track of the evolution of dataset(s).
Iii-a Attribute-Based Data Abstraction
SODA parses and represents an unstructured dataset as a multiset of elements in which repetitive ones may be included, termed as , where is the number of elements. To exploit datasets deeper and provide more information to optimizations, SODA treats an element as an ordered -tuple: , where is the value(s) of an attribute . One or two datasets can be manipulated by an operation (including user-defined function (UDF)) to generate a new dataset, where the operation can access and transform attributes of an element. Let denote that a new dataset is generated by applying a unary operation (e.g., , , and ) and its corresponding UDF to an input dataset . Similarly, we can define binary operations. In the following discussion, we use unary operations to demonstrate our approach for simplicity, and the same idea can be applied to binary operations.
To process such a transformation in static code analysis, SODA first models attributes of and by analyzing their type information, as well as the input and output of . Let , denote all extracted attributes of and , respectively. Next, SODA analyzes the source code of to create dataflows between and at the level of the attribute.
Iii-B Primitive Operations
SODA defines six primitive operations to imitate common behaviors of a general data-intensive system, as shown in Table I.
|Operation||Notation||Examples in Apache Spark|
|Map||map, flatmap, mapValues,mapPartions|
|Filter||filter, sample, collect|
|Set||++, intersection, union|
|Join||join, leftOuterJoin, rightOuterJoin, fullOuterJoin|
|Agg||reduce, aggregate, fold, max, min|
is an operation to return a new dataset by applying to each element of . is a special map by flattening all elements of the input.
is an operation taking as a parameter and keeps element when is true. reduces the number of elements involved in the successive computation so as to reduce data size for computing and communication later.
is an operation on two input datasets, and , to generate a new one by applying to each pair of , where the two datasets and should have identical attribute sets.
is a binary operation on two input datasets, and , to generate a new one by applying to each pair of with matching keys , where is a subset of attributes shared by both and .
is a unary operation that returns a new dataset by applying to a group of elements sharing an identical value(s) on key(s) .
is to combine all elements in into a single value with the help of and an initial value . Note that the result of an operation (e.g.,
) is not a single value, we classify it into a “Group” operation.
Actually, there is an implicit Shuffle operation behind the last four operations to transfer data across processing nodes, which dramatically affects the whole system performance due to expensive I/O operations. One of the ultimate goals of SODA is to reduce the amount of shuffling data as much as possible with the help of our proposed optimization strategies. Although the above definitions just involve at most two input datasets, it is easy to extend the concepts to accommodate more.
Iii-C Data Operational Graph
SODA builds a directed data operational graph (DOG) to represent an application and conducts three kinds of optimizing strategies atop this graph. A vertex depicts a primitive operation described in Table I and the dataset generated by this operation. An edge denotes data flows between two operations. For each vertex, there is a group of properties accumulated from static analysis, dynamic analysis, or both on code, data and runtime system, which is defined in Table III in more detail. We also add two special nodes, named as Source and Sink. Source node is connected to all initial input datasets while all sole output of stages would point to Sink node. SODA conducts optimizations atop of a DOG, rather than on an abstract syntax tree (AST) due to the following considerations: 1) usually a data-intensive system supports various program language APIs (e.g., Scala, Java, python APIs in Apache Spark), a general optimization backend is compatible with different programming models; 2) SODA focuses on optimizations at the level of operations, rather than at the lower level of AST nodes; 3) There is a huge gap between AST nodes and simulating system behaviors that interpret applications and datasets.
Execution model. Without loss of generality, SODA splits an execution plan of DOG into a series of stages that are bounded by shuffle behaviors, denoted by . As shown in Figure 2, the toy application is composed of seven stages. A stage is delegated as a physical scheduling unit consisting of multiple operations to fulfill a sub-job. Generally, a stage involves an execution path between the Source node and its target (i.e., ) if no cache mechanism is provided: . For example, is a set of nodes involved in computing the outcome (i.e. ) of stage in Figure 2. Generally speaking, the computational cost of a stage is calculated by aggregating all involved operations’ execution time, denoted as where let denote the execution time of an operation of node . Furthermore, The total execution time of an application is given by summing all stages’ cost: .
It is well known that stages can run in parallel if there is no data dependency among them in a data-intensive system. However, without loss of generality, we assume that they are scheduled in sequential order. SODA determines this order by analyzing the data dependency of stages and the submission time of stages in prior executions extracted from the performance log. An operation can be executed simultaneously by a cluster of executors on different data partitions. Technically, these executors can be equipped with the configurable size of computing resource (e.g., CPU, memory). We also assume that memory resource in an executor is divided into two sections for storage (i.e., caching intermediate data) and computation(i.e., allocating objects). We denote as the size of storage memory.
Iv Optimization Strategies
There are three kinds of optimization strategies: cache management, operation reordering and element pruning.
Iv-a Cache Management
In this section, we go over the details about Cache Management policy. The summary notation is categorized and listed in Table III.
Maximizing Expected Caching Gain. A global cache allocation is usually preferred to minimize the aggregated execution cost of an application. In particular, we assume is the real executing time of an application without any optimizations and works as an upper bound on the expected costs. Here our objective is to determine a feasible cache allocation (i.e. ) that maximizes the caching gain, i.e., the expected cost reduction attained by caching data, which is defined as: , where is defined as the predicted (or expected) computational cost of a stage by consideration of .
To determine a global cache allocation policy, a binary matrix is defined to indicate cache status of a data generated by node after executing a stage , where reveals the real-time scheduling order of all stages extracted from online profiling information. In the matrix, a cell with value 1 (i.e., ) indicates that the output of the data of is reserved in memory after a stage is done (See Equation 5b); otherwise, i.e., when , the data is evicted from memory (See Equation 5c). It is worth mentioning that cache capacity constraints in an executor ( is the size of memory for storage) would limit the amount of involved data that could be reserved in memory (See Equation 5d). From top to bottom in a column of , it is easy to identify which stage a data is stored into memory, and which stage it is evicted from memory. Such an allocation plan tells programmers when to persist or unpersist data in memory in code.
Given a global cache allocation, all operations involved in the computation of a stage are well routed by following the execution path until it encounters a data of cached in memory. This data and its predecessors do not need to be recomputed so far. In the previous example of , the cost is equal to if data generated by and are cached in memory. Next, given the current executing stage with a global cache allocation , the number of re-computation times of (because it is used again later but not cached) is needed to get the outcome of , which defined in Equation (1).
where returns a set of paths from node to ; if is identical to , then it is ; reveals the previous executing stage of . Therefore, the predicted (or expected) computational cost of a stage can be regulated concisely under a global cache allocation policy , and defined in Equation (2).
Finally, we try to obtain an allocation of policy that maximizes the aggregate expected caching gain:
Convex-Concave Relaxation. In particular, we seek solutions to the following problem:
where is the set of matrices satisfying source constraints, cache behaviors and cache capacity, i.e.,
where denotes the size of an intermediate data generated by an operation of . As far as we know, this deterministic, combinatorial version of (4) is NP-hard, even when we already have background knowledge about the submitted application and runtime statistics. Nonetheless, we can relax it to a submodular maximization problem subject to knapsack constraints and take linear relaxation algorithm to optimize cache allocation on the stage level by minimizing the expected computational cost [20, 42]. It is obvious that Equation (4) is not a convex optimization problem. However, it can be approximated as follows. We can define based on Equation (3) as:
Note that is a concave function, and now we have the following:
Global Execution Distance. So far, SODA can approximate a solution to (7) within a factor by searching all cache allocation space, which may lead to a bad runtime performance. In other words, We convince that knowledge about data flow and stages’ dependency could have a positive effect on this defect. Therefore, we devise a new metric to measure the time-locality distance of an operation, namely execution distance, and introduce another constraint to .
Definition IV.1 (Global Execution Distance (GED)).
For a node , execution distance is defined as a relative difference between the current execution point and a future executing stage in which it will be referenced: .
In particular, there may have multiple execution distances for the data of if it is used in several stages. At this point, the final number should be the sum of all these distances. For instance, Table II shows an evolution of execution distance for each node in Figure 2 as the workload runs along with scheduling order from top to bottom. In the first row of the table, we have twelve operations, which may be cached in memory after a stage is done; The leftmost two columns reveal the relationship between stages and their corresponding scheduling order . The number in the rest of the cells indicates how far away from a future reference point to the current executing stage, and it should be recalculated and updated after each execution of stages every time. For example, after executing stage (its corresponding schedule order is 1), the execution distance of is updated from 5 to 3 since will be referred in stage and and their corresponding schedule order is 2 and 3, respectively. So the new value will be recalculated by . A cell can be set to zero if 1) the data generated by is referenced by another node in the same stage (See case cell of ); 2) the data of gets referenced and there is no more reference in the future (See case cell of ). The cells with empty content mean the nodes that have not been accessed so far.
With the help of GED, we can also learn a set of candidates that can be persisted in memory after a stage is finished, termed as . For example, since the corresponding cells are non-zero in the row of . Therefore we can narrow down search space to approach an optimal solution to (8) by merely considering data in , rather than all data in , for a stage . Consider the following problem:
where is the set of matrices satisfying source constraints, cache behaviors, cache capacity, and hypothesis of , i.e.:
Iv-B Operation Reordering
The goal of operation reordering (i.e. Filter Pushdown) is to improve applications’ performance by reordering operations along with data path. There are two challenges: Is reordering correct concerning the original semantics? Does the reordering improve performance? To answer these questions, we first define Use-Set and Def-Use by following the dataflow technique in static code analysis .
Definition IV.2 (Use-Set).
Given =, Use-Set and is accessed by }. Use-Set defines all attributes of input data used by to generate .
Definition IV.3 (Def-Set).
Given , Def-Set and b is created or updated by }. Def-Set is the attribute set newly created by an operation , or inherited directly from .
Then, SODA uses a two-step way to handle these two challenges. In the first step (static verification), Theorem IV.1 is proposed to ensure semantic correctness. It captures the fact that two successive operations can be reordered if a latter UDF does not use attributes that a former UDF defines.
Two successive operations and on an execution path can be reordered, i.e., , if .
Let’s take filter pushdown as an example to illustrate this theorem. Filter pushdown is a conventional optimization that pushes a filter towards the direction of data loading as much as possible so that the volume of intermediate data can be reduced.
For , and can be reordered, if .
The comprehensive proof statement of Lemma IV.2 is followed:
Assume the two plans
We prove that . Assume a record , let and are set of element(s) generated by applying according operations by sequence to . Notice that here , and , and in all our proofs, set is referring to dataset (mathematically termed as a multiset) which allows repetitive elements and union operations (here alias to sum operation in multiset) preserve repetitive elements as well. To prove , it suffices to show that . We prove it by justifying the following two cases: is the indicator function of filter to represent its selectiveness. 1. When , where Then, . Now since , we know that for , , and by definition IV.2, ’s behavior is solely depending on attribute set , we have
Thus for , in case 1, we have .
2. Similarly, when , . And
Thus for , in case 2, we also have . Combine the results in both cases (which are all cases possible), we have proved for , , and consequently, . ∎
Correspondingly, we can get the following lemmas to determine if a Filter operation can be pushed down before Group and Set operations, respectively.
For , and can be reordered, if .
For , and can be reordered along with and data path safely:
In the second step (dynamic evaluation), two polynomial regression models (due to their wide applicability in engineering ) are trained for and respectively using profiling information, then predict the execution time of each operation on new input. If SODA gets positive feedback from predict models, it will suggest programmers reorder these two operations.
Iv-C Element Pruning
Element pruning is an optimization to eliminate unused attributes in an element by analyzing data dependency in the attribute level among operations. SODA analyzes an operation and its associated UDF(s) to analyze attribute dependency between the input and output dataset of this operation. Then a directed data dependency graph (DDG), , is built to represent the whole data flow of the application by combining all attribute dependency relationships among operations. A node represents an attribute of an dataset involved in an operation while an edge from a node s to another node d indicates that d has either data or control dependency on s. If an edge is a control dependency, it means that s and d have identical attributes. Data dependency means that the value of d is updated or created from s. An attribute node may have multiple incoming and outgoing edges. To identify an application’s start and endpoints, we add two special nodes source and sink to this graph and connect all input attributes of this application to source and connect all output attributes to sink. All these dummy edges outgoing from source and incoming to sink are assigned as control dependencies. Therefore, we can reduce the complicated optimization into a problem of traversing the graph and eliminating a node if there exists no path between and sink, since an attribute node can be eliminated safely if it does not make a contribution to produce an output of the application.
Figure 3 shows an example of a data dependency graph of Listing 1. Each row represents a group of attributes of a dataset named by the corresponding leftmost text above the dashed arrow. A rectangle reveals an attribute labeled by the inside text. It is obvious that the attribute “[attr_3]” does not contribute to “sink” while it is grouped by groupByKey operation from the attribute “attr_3” in first map. The preliminary experiment shows that this kind of awkward design leads to a significant computation and I/O cost because of shuffling a huge size of data among computing nodes over the network. According to our proposed constraint, there is no edge between these yellow rectangles and “sink” so they can be removed without changing the snippet code purpose.
V Evaluation & Experiments
In this section, we use four real-world data-intensive applications in different domains to evaluate the overall effectiveness of SODA on a 9-node cluster of Apache Spark (v3.0.0) by comparing runtime performance of these applications before and after optimization by SODA. Each node has a hardware configuration with Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB main memory with DDR4-2133 MHz ECC and 1 GigE Ethernet as the internal communication channel between nodes.
System Log Analysis (SLA) is a job to find average ranking and total advertising revenue for each website within a specified date range. There are two datasets, uservisits and pageranks.
Customer Reviews Analysis (CRA) is a project aiming at ranking the top 20 brands according to average customer rating score in the book categories. The review datasets include over 138.1 million customer reviews spanning from May 1996 to July 2014 .
Pre-Processing Job (PPJ) is a clean task and looks for products satisfying two criteria: 1) product ID starts with “B000”; 2) average word count of the product description is greater than 100. N/A data element will be removed to avoid program crashes during runtime. The metadata dataset includes 15.5 million products.
V-B Effectiveness Assessment
To evaluate the effectiveness of SODA, we first manually examine all source code to see which problems are present by rules-of-thumb. We then apply each optimization on four benchmarks individually to obtain their results in detecting problems: Detected, Undetected, or Not Present. If a problem is detected but the performance behaves worse after the revision, we label it as a Failed case. The results shown in Table IV allow for quantifying their performance. In general, most potential performance problems are detected by SODA successfully with one exception of a Failed case in SNA workload when being applied with CM optimization.
SLA is an application working on two datasets to evaluate the performance of CM and EP, and OR is not applied in this application. SODA can scrutinize the problems successfully.
CRA is a complicated student project using Filter, Join, Agg operations, which exposes problems of EP and OR to SODA. In addition, the complicated workflow allows SODA to dig into the CM issue. All of the problems can be detected by SODA successfully.
SNA is a research project that involved all the optimizations. All of them can be detected by SODA statically, however, CM leads to negative feedback regarding execution time while the other two have positive effects on the application. We, therefore, label CM as Failed. More discussion regarding this unusual phenomenon will be given in the next section.
PPJ is a data clean task involved in Map, Filter and Group operations. There are two problems of CM and EP, are successfully detected by SODA.
V-C Performance Behavior
We start the performance improvement evaluation of SODA on workloads for each optimization. We implemented this by submitting the revised code to spark and run it five times for each workload to obtain average experimental data. Figure 4 shows the experimental results of execution time, size of shuffling data and GC time for each benchmark. Table V lists the speed up each optimization achieves for each benchmark.
SLA. There are two performance problems: CM and EP, that are detected by SODA. The revised applications are submitted to Apache Spark and become 2.07% and 1.55% faster than the baseline (RDD) (see Table V), respectively. Figure 3(a) reveals that these two optimizations are not related to shuffling data size, while GC time of CM is about 2.2 times faster than the others, since cached dataset triggers a frequent GC procedure to collect JVM garbage.
CRA. All three kinds of optimizations, CM, OR and EP, can be used on this application and their performance speeds up by 59.57%, 3.09%, 6.38%, respectively, according to Table V. In Figure 3(b), OR and EP have a positive effect on execution time and shuffling data size, while CM has speedup over execution time and does not reduce shuffling data size. Even CM has a better performance than the other two, however, the corresponding GC time is bigger. EP can reduce shuffling data size significantly but with the minimum time consumed.
SNA. Table V shows that after applying the three optimizations, this application speeds up by -7.88%, 9.70%, 6.15%, respectively. We believe two reasons are causing this worse performance (-7.88%) of the revised application-optimized by CM: 1) this benchmark is a memory-intensive application; and 2) most of the storage memory in an executor is occupied by cached dataset, which leads to high pressure on garbage collection threads. Since SODA only handles cache memory capacity constraints and does not consider the mutual effect between storage and execution memory, such a case is difficult to be avoided. Additionally, OR has reduced shuffling data size significantly.
V-D System Overhead
In this section, we conduct experiments in different granularity of monitoring, e.g. monitoring no operation, partial operations suggested by SODA, or all operations involved in applications, to compare system overhead. Table VI shows the execution time of each application with different monitoring granularity. In the partial granularity, we get profiling guidance for SLA and PPJ based on CM’s suggestions, CRA and SNA based on OR’s suggestions. Monitoring on all operations takes a longer time than the other two granularities. The rational reasons behind the acceptable system overhead lie in our lightweight design of online phase: 1) Enabling and customizing Spark internal event and metrics subsystems only cast needed information with a lower system overhead; 2) Exploiting data access pattern behind semantics code and DAG-based workflow provides an instrumentation guide to probe runtime system. For instance, we only consider and instrument candidate operations contributing to future ones if they are persisted in memory. It is worth mentioning that the system overhead of an application depends on its characteristic, input data size, and system configurations.
Vi Related Work
A few promising programming models and platforms have been proposed through the efforts of different disciplines to accommodate the sheer size of data, such as MapReduce , Apache Hadoop , Spark  and Flink . There is a surge of interest in optimizing data-intensive applications using semantics-aware approaches [1, 4, 17, 21, 22, 23, 33, 48, 40, 46, 49]. However, there is still a research gap between static and dynamic analyses to improve the performance of data-intensive systems. To the best of our knowledge, there is no optimization plugin of Scala Compiler for Spark RDD APIs, our proposed work is the first attempt in this direction.
Microsoft’s Scope compiler [8, 15, 17, 49] automatically optimizes a data-parallel program to eliminate unnecessary code and data. It performs early filtering and calculates small derived values to minimize the amount of data-shuffling I/O based on information derived by static code analysis. There is no dynamic information involved in its optimizations.
Spark Catalyst  is an extensible query optimizer that leverages advanced programming language features (e.g.
, Scala’s pattern matching and quasi-quotes) in a novel way within the core of the Spark SQL engine. Catalyst uses a tree architecture to represent operation nodes and conduct rule-based optimizations. Finally, cost-based optimization is performed by generating multiple plans and calculating their cost to choose an optimal one. Unfortunately, it only supports Dataset and DataFrame APIs[19, 35]. Thus, applications developed by RDD API cannot benefit from it directly.
As to cache management, LRU caching policy is often used [45, 36]. To improve cache management, several research works, namely MemTune , LRC  and MRD , leverage directed acyclic graph (DAG), data dependency among stages, and physical schedule unit (i.e., job and stage level) for new measurements of a data block reference. However, MemTune approach fails to answer a question of which and when each RDD will be persisted in memory. LRC updates a new reference count for each data block according to usages within a stage, however, it does not take into account the impact of data blocks spanning across multiple stages. MRD proposes a fine-grained time-locality measurement of data block reference, called reference distance. It is based on a physical schedule unit assigned by the DAG Scheduler. Nonetheless, scheduling unit orders can not reveal the real runtime executing an order to some extent. Our approach in SODA is a novel stage-level global cache management policy, which emphasizes two factors that would impact system performance, especially for cache behaviors: execution order of stages and data block size.
Vii Conclusions and Future work
In this paper, we propose a semantics-aware optimization approach to assist programmers to develop and optimize an application interactively and semi-automatically. We propose three kinds of optimization strategies: cache management, operation reordering and element pruning. Element pruning is a static rule-based model and the other two are hybrid models using static and dynamic information. To get dynamic information about data and runtime system, the online phase is developed as a piggyback monitoring tool by integrating spark internal event component, metrics system and source code profiling tools. Extensive empirical results on several real-world benchmarks using Spark RDD APIs reveal that our approach achieves better performance on the optimized code than their original implementation.
In the future work, we will extend the optimization of operation reordering to map as well as other operations. So far SODA can only take care of filter and join reordering, and help programmers choose the right operation with acceptable performance. For example, reduceByKey can replace groupByKey to reduce shuffling data size. Another promising area is to add a growing number of performance-oriented constraints to the global cache management policy. For example, we can require that all datasets needed by an operation are persisted in memory simultaneously to gain better performance.
This work was supported in part by NSF-1836881 and NSF-1952792.
-  (2014) The stratosphere platform for big data analytics. The VLDB Journal. Cited by: §VI.
-  (2019) Representations and optimizations for embedded parallel dataflow languages. ACM Transactions on Database Systems (TODS) 44 (1), pp. 4. Cited by: §I, §I.
-  (2009) Hadoop. Cited by: §I, §VI.
-  (2015) Spark sql: relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. Cited by: §VI, §VI.
-  (2009) That ‘internet of things’ thing. RFID journal 22 (7), pp. 97–114. Cited by: §I.
-  (2004) Convex optimization. Cambridge university press. Cited by: §II-A.
-  (2015) Apache flink: stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36 (4). Cited by: §VI.
-  (2008) SCOPE: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment 1 (2), pp. 1265–1276. Cited by: §VI.
-  (1998) Javassist—a reflection-based programming wizard for java. In Proceedings of OOPSLA’98 Workshop on Reflective Programming in C++ and Java, Vol. 174, pp. 21. Cited by: §II-B.
-  (2016) Dynamic trend detection in us border security social-media networks. In Simulation and Education Conference (I/ITSEC), In 2016 Interservice/Industry Training, Cited by: §I.
-  (2016) Dynamic trend detection in us border security social-media networks. Simulation and Education Conference (I/ITSEC),In 2016 Interservice/Industry Training. Cited by: 3rd item.
-  (2019) Interaction models for detecting nodal activities in temporal social media networks. ACM Transactions on Management Information Systems (TMIS) 10 (4), pp. 1–30. Cited by: §I, 3rd item.
-  (2008) MapReduce: simplified data processing on large clusters. Communications of the ACM. Cited by: §I, §II-B, §VI.
-  (2013) Addressing big data issues in scientific data infrastructure. In 2013 International conference on collaboration technologies and systems (CTS), pp. 48–55. Cited by: §I.
-  (2017) Static analysis for optimizing big data queries. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 932–937. Cited by: §VI.
-  (2016) Stage aware performance modeling of dag based in memory analytic platforms. In 2016 IEEE 9th International Conference on Cloud Computing (CLOUD), pp. 188–195. Cited by: §IV-B.
-  (2012) Spotting code optimizations in data-parallel pipelines through periscope. In OSDI, Cited by: §I, §VI, §VI.
-  (2012) Opening the black boxes in data flow optimization. Proceedings of the VLDB Endowment 5 (11), pp. 1256–1267. Cited by: §I.
-  (2016) Optimizing interactive development of data-intensive applications. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pp. 510–522. Cited by: §VI.
-  (2016) Adaptive caching networks with optimality guarantees. ACM SIGMETRICS Performance Evaluation Review 44 (1), pp. 113–124. Cited by: §II-A, §IV-A, §IV-A, §IV-A.
-  (2011) Automatic optimization for mapreduce programs. Proceedings of the VLDB Endowment. Cited by: §VI.
-  (2012) Panacea: towards holistic optimization of mapreduce applications. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, Cited by: §VI.
A reinforcement learning based resource management approach for time-critical workloads in distributed computing environment. In 2018 IEEE International Conference on Big Data (Big Data), pp. 252–261. Cited by: §VI.
-  (2017) Log-based abnormal task detection and root cause analysis for spark. In 2017 IEEE International Conference on Web Services (ICWS), pp. 389–396. Cited by: §II-B.
Detecting anomaly in big data system logs using convolutional neural network. In IEEE Cyber Science and Technology Congress (CyberSciTech), pp. 151–158. Cited by: §II-B.
-  (2019) LADRA: log-based abnormal task detection and root-cause analysis in big data processing with spark. Future Generation Computer Systems 95, pp. 392–403. Cited by: §II-B.
-  (2016) Addressing complex and subjective product-related queries with customer reviews. In Proceedings of the 25th International Conference on World Wide Web, pp. 625–635. Cited by: 2nd item.
-  (2015) Principles of program analysis. Springer. Cited by: §IV-B.
-  (2014) Inc.,“gurobi optimizer reference manual,” 2015. Cited by: §IV-A.
-  (2018) Filter before you parse: faster analytics on raw data with sparser. Proceedings of the VLDB Endowment 11 (11), pp. 1576–1589. Cited by: §I.
-  (2018) Reference-distance eviction and prefetching for cache management in spark. In Proceedings of the 47th International Conference on Parallel Processing, pp. 1–10. Cited by: §I, §II-A, §VI.
-  (2017) A survey of semantics-aware performance optimization for data-intensive computing. In 2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), pp. 81–88. Cited by: §I.
-  (2015) SOFA: an extensible logical optimizer for udf-heavy data flows. Information Systems 52, pp. 96–125. Cited by: §I, §VI.
-  (2017) Optimization of complex dataflows with user-defined functions. ACM Computing Surveys (CSUR) 50 (3), pp. 1–39. Cited by: §I.
-  (2019) Sparkcruise: handsfree computation reuse in spark. Proceedings of the VLDB Endowment 12 (12), pp. 1850–1853. Cited by: §VI.
-  (2015) Apache tez: a unifying framework for modeling and building data processing applications. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data, pp. 1357–1369. Cited by: §II-A, §VI.
-  (2012) M3R: increased performance for in-memory hadoop jobs. Proceedings of the VLDB Endowment 5 (12), pp. 1736–1747. Cited by: §II-A.
-  (2011) Rapid 3d seismic source inversion using windows azure and amazon ec2. In 2011 IEEE World Congress on Services, pp. 602–606. Cited by: §I.
-  (2014) Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, Cited by: §II-A.
-  (2009) Atomicity and provenance support for pipelined scientific workflows. Future Generation Computer Systems 25 (5), pp. 568–576. Cited by: §VI.
-  (2016) Memtune: dynamic memory management for in-memory data analytic platforms. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 383–392. Cited by: §VI.
-  (2018) Intermediate data caching optimization for multi-stage and parallel big data frameworks. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pp. 277–284. Cited by: §II-A, §IV-A.
-  (2017) LRC: dependency-aware cache management for data analytics clusters. In IEEE INFOCOM 2017-IEEE Conference on Computer Communications, pp. 1–9. Cited by: §I, §II-A, §VI.
-  (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2–2. Cited by: §I, §II-B.
-  (2010) Spark: cluster computing with working sets.. HotCloud. Cited by: §I, §II-A, §VI, §VI.
-  (2017) Mrapid: an efficient short job optimizer on hadoop. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 459–468. Cited by: §VI.
-  (2015) Dart: a geographic information system on hadoop. In 2015 IEEE 8th International Conference on Cloud Computing, pp. 90–97. Cited by: §I.
-  (2014) Smarth: enabling multi-pipeline data transfer in hdfs. In 2014 43rd International Conference on Parallel Processing, pp. 30–39. Cited by: §VI.
-  (2012) Optimizing data shuffling in data-parallel computation by understanding user-defined functions.. In NSDI, Cited by: §VI, §VI.
-  (2017) Handbook of big data technologies. Springer. Cited by: §I.