An Abstract View of Big Data Processing Programs

08/05/2021
by   Joao Batista de Souza Neto, et al.
UFRN
CNRS
0

This paper proposes a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks. The paper focuses on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow Big Data processing frameworks. The proposed model relies on monoid AlgebraandPetri Netstoabstract Big Data processing programs in two levels: a high level representing the program data flow and a lower level representing data transformation operations (e.g., filtering, aggregation, join). We extend the model for data processing programs proposed in [1], to enable the use of iterative programs. The general specification of iterative data processing programs implemented by data flow-based parallel programming models is essential given the democratization of iterative and greedy Big Data analytics algorithms. Indeed, these algorithms call for revisiting parallel programming models to express iterations. The paper gives a comparative analysis of the iteration strategies proposed byApache Spark, DryadLINQ, Apache Beam and Apache Flink. It discusses how the model achieves to generalize these strategies.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

08/05/2021

TRANSMUT-SPARK: Transformation Mutation for Apache Spark

We propose TRANSMUT-Spark, a tool that automates the mutation testing pr...
03/21/2020

Translation of Array-Based Loops to Distributed Data-Parallel Programs

Large volumes of data generated by scientific experiments and simulation...
12/06/2018

K-Pg: Shared State in Differential Dataflows

Many of the most popular scalable data-processing frameworks are fundame...
03/25/2021

Understanding the Challenges and Assisting Developers with Developing Spark Applications

To process data more efficiently, big data frameworks provide data abstr...
04/27/2018

Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

In the era of big data and cloud computing, large amounts of data are ge...
10/23/2017

Communication Efficient Checking of Big Data Operations

We propose fast probabilistic algorithms with low (i.e., sublinear in th...
07/21/2020

What Programs Want: Automatic Inference of Input Data Specifications

Nowadays, as machine-learned software quickly permeates our society, we ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The intensive processing of datasets with significant volume, variety and velocity scales, namely Big Data, calls for alternative parallel programming models adapted to the implementation of data analytics tasks and capable of exploiting the potential of those datasets. Large-scale data processing frameworks have implemented these programming models to provide execution infrastructures giving transparent access to large scale computing and memory resources.

Large scale data processing systems can be classified according to their purpose into general-purpose, SQL-based, graph processing, and stream processing 

Bajaber et al. (2016). These systems adopt different approaches to represent and process data. Examples of general-purpose systems are Apache Hadoop Hadoop (2019), Dryad/DryadLINQ Isard et al. (2007); Yu et al. (2008), Apache Flink Carbone et al. (2015), Apache Beam Beam (2016) and Apache Spark Zaharia et al. (2010). According to the programming model adopted for processing data, general-purpose systems can be control flow-based (like Apache Hadoop) or data flow-based (like Apache Spark). In these systems, a program is built from individual processing blocks. These processing blocks implement operations that perform transformations on the data. The interaction between these blocks defines the data flow that specifies the order to perform operations. Datasets exchanged among the blocks are modeled by data structures such as key-value tuples or tables. The system infrastructure manages the parallel and distributed processing of datasets transparently. This facility allows developers to avoid dealing with low-level details inherent to the use of distributed and parallel environments.

In this context, depending to the dataset properties (velocity, volume), performance expectations, and computing infrastructure characteristics (cluster, cloud, HPC nodes), it is often a critical programmer’s decision to choose a well-adapted target system used for running data processing programs. Indeed, each hardware/software facility has its particularities concerning the infrastructure and optimizations made to run a program in a parallel and distributed way. This diversity suggests that systems will have different performance scores depending on their context and available resources. The choice between different configuration options depends on the non-functional requirements of the project, available infrastructure and even preferences of the team that develops and execute the program. In this context, the formulation of more abstract, platform-agnostic program descriptions could help in the design of systems that would be deployed in a variety of contexts.

In a previous paper de Souza Neto et al. (2020) we introduced a model for non-iterative, Big Data processing programs. The model was proposed as an abstract view of data flow systems such as Apache Spark. This paper extends the model for data processing programs proposed in de Souza Neto et al. (2020), to enable the use of iterative programs. Our model provides an abstract representation of the main aspects of data flow-based data processing systems: (i) operations applied on data (e.g., filtering, aggregation, join); (ii) representation of programs execution through directed acyclic graphs (DAGs) where vertices represent operations and datasets, and edges represent data communication. In our model, a program is defined as a bipartite graph composed of transformations (i.e., operations) and datasets being processed by transformations. When considering actual system restrictions on the predefinition of the number of iterations of any cycle, these graphs may be converted into DAGs for execution. Our model has two levels: a high level representing the program data flow and a lower level representing data transformation operations.

Throughout the paper, we use the name data flow to refer to the representation of the program’s data flow graph and transformations to the operations over datasets that compose the program. We use Petri Nets Murata (1989) to represent the data flow, and Monoid Algebra Fegaras (2017, 2019) to model transformations. Monoid Algebra is a formal system to describe processing distributed data. The combined use of these formalisms allows the expression of programming logic, to be implemented independently of the target Big Data processing system111Such that Apache Spark, DryadLINQ, Apache Beam or Apache Flink.. In this way, we provide a formal, infrastructure-independent specification of data processing programs implemented according to data flow-based programming models.

To the extent of our knowledge, most works addressing Big Data processing programs have, so far, concentrated efforts on technical and engineering challenging aspects. However, few works, such as Yang et al. (2010)Chen et al. (2017), and Ono et al. (2011) have worked on formal specifications that can be used to reason about their execution abstractly. Formal modeling parallel execution implemented by systems of the same family can be important for comparing infrastructures, defining pipelines to test parallel data processing programs, and verifying programs properties (such as correctness, completeness or concurrent access to data). In this work, we use the model to define mutation operators that can be instantiated for different systems. In particular, specifications in our model have been used as an intermediate representation of programs in a mutation testing tool of Apache Spark programs Souza Neto et al. (2020).

Besides the introduction of iterative processing primitives, this paper extends de Souza Neto et al. (2020) by (i) providing a full description of our model, including a more comprehensive use of the resources provided by Petri Nets; (ii) giving a more detailed comparison of data flow-based systems, to show how they can be modeled by our proposal.


The remainder of the paper is organized as follows. Section 2 presents the background concepts of the model, namely, Petri Nets and Monoid Algebra. Section 3 presents the model for formally expressing Big Data processing programs. Section 4 describes the main characteristics of data flow based Big Data processing frameworks and discusses how our proposal can model their operations and iteration strategies. Section 5 describes the general lines of the way the model can be used in a concrete program testing application. Section 6 introduces related work addressing approaches for generalizing control and data flow parallel programming models. Finally, Section 7 concludes the paper and discusses future work.

2 Background

This section briefly presents Petri Nets and Monoid Algebra, upon which our model is built. For a more detailed presentation, the reader can refer to Murata (1989); Fegaras (2017).

Petri Nets

Petri (1962) are a formal tool to model and analyze the behavior of distributed, concurrent, asynchronous, and/or non-deterministic systems Murata (1989). A Petri Net is defined as a directed bipartite graph that contains two types of nodes: places and transitions. Places represent the system’s state variables, while transitions represent the actions performed by the system. These two components are connected through directed edges that connect places to transitions and transitions to places. With these components, it is possible to represent (i) the different states of a system; (ii) the actions taken by the system to move from one state to another (transitions) (iii) and how the state changes due to actions (edges). This modeling is done by using tokens to decorate places of the net. The distribution of the tokens among places indicates that the system is in a given state. The execution of an action (transition) takes tokens from one place to another, leading to an evolution of the system’s state.

Formally, a Petri net is a quintuple where , and:

is a finite set of places,
is a finite set of transitions,
is a finite set of edges,
is function associating positive weights to edges,
is a function defining the initial marking of a net.

The execution of a system is defined by firing transitions. Firing a transition consumes tokens from all its input places , and produces tokens to each of its output places . The transition can only be fired (it is said to be enabled) if there are at least tokens on all its input places . The semantics of a given process is then given by the evolution of markings produced by firing enabled transitions.

Monoid Algebra

was proposed in Fegaras (2017) as an algebraic formalism for data-centric distributed computing operations based on monoids and monoid homomorphisms. A monoid is an algebraic structure formed by a set , an associative operation in and a neutral element . The function is usually used to identify the monoid. A monoid homomorphism is a function over two monoids, say to , such that:

Monoid algebra uses monoid and monoid homomorphism concepts to define operations on distributed datasets, which are represented as monoid collections. One type of monoid collection is bag, an unordered data collection of elements of type (denoted as ). The elements of are formed by using the unit injection function , which generates the unitary bag from an element (), the associative operation , which unites two bags (), and the neutral element , which is an empty bag. Another monoid collection is the one formed by lists. It can be defined as an ordered bag. It can be defined from the set containing lists of elements of type , and using as the unit injection function, the list concatenation as the associative operation and the empty list as the neutral element of the monoid.

Monoid algebra defines distributed operations as monoid homomorphisms over monoid collections (which represent distributed datasets). These homomorphisms are defined to abstractly describe the basic blocks of distributed data processing systems such as map/reduce or data flow systems. The key idea behind monoid algebra is to use the associativity property of the monoid operations and the homomorphism between monoids to represent the processing of partitioned data and the combination of the results, independently from how data is partitioned.


Let us now define the most common operations used in monoid algebra. The flatmap operation receives a function of type and a collection of type as input and returns a collection resulting from the union of the results of applying to each element of . This operation captures the essence of parallel processing since can be executed in parallel on top of different data partitions in a distributed dataset. Notice that is a monoid homomorphism since it is a function that preserves the structure of bags.

The operations groupby and cogroup capture the data shuffling process by representing the reorganization and grouping of data. The groupby operation groups the elements of using the first component of type as a key, resulting in a collection , where the second component is a collection containing all elements of type that were associated with the same key in the initial collection. The cogroup operation works similarly to groupby, but it operates on two collections that have a key of the same type . In this way, the result of cogroup, when applied to two collections of type and is a collection of type .

The reduce operation represents the aggregation of the elements of into a single element of type from the application of an associative function of type .

The operation orderby represents the transformation of a bag into a list ordered by the key of type which supports the total order .

These operations are monoid homomorphisms, as proved in Fegaras (2017). This property makes it possible to make transparent to the model how data has been distributed when parallelizing tasks. However, they are not enough to model applications where iteration is needed. For this, Monoid algebra, as presented in Fegaras (2017), includes the repeat operation.

The repeat operation provided by Monoid Algebra is used to allow the representation of iterative algorithms Fegaras (2017), such as machine learning and graphs processing algorithms. The repeat operation receives a function of type , a predicate of type , a count number , and a collection of type as input and returns a collection of type as output. The definition of repeat is given below Fegaras (2019):

The repeat operation stops when the counter is zero or the condition in is false. While these conditions are not met, the operation computes and decrements recursively. Intuitively, in each iteration, the collection resulting from the previous iteration is processed by which produces a new collection for the next iteration (or for the output when repeat stops).

In addition, monoid algebra also supports the use of lambda expressions (), conditionals (if-then-else).


Our proposal combines the use of Petri Nets with Monoid Algebra to build abstract versions of the primitives present in Big Data processing applications. The main goal of our approach is to have an abstract representation common to data-centric programs. This representation may be used to compare different frameworks and as (intermediate) representation to translate, refine, or optimize programs.

3 Modeling Big Data Processing Programs

This section introduces the proposed formal model for Big Data processing programs. The model is organized in two levels: data flow, and transformations. Data flow in our model is defined using Petri Nets, and the semantics of the transformations applied to the data is modeled as monoid homomorphisms on datasets.

3.1 Data Flow

For the upper level of our two-level modelization, we define a graph representing the data flow of a data processing program. We rely on the data flow graph model presented in Kavi et al. (1986), which was formalized using Petri Nets Murata (1989).

A program is defined as a bipartite directed graph where places stand for the distributed datasets () of the program, and transitions stand for its transformations (). Datasets and transformations are connected by edges ():

This graph can be seen as a Petri Net, as defined in Section 2. Datasets correspond to the places of the net and transformations correspond to the net transitions. The initial marking () of the Petri Net represents the availability of the input datasets for the computation to begin. There will be as many tokens in an input dataset as the number of uses of this dataset in the program. The weight function is defined as for every edge leaving a place and as , for every edge arriving at a place, where is the number of times the exact same dataset is used in the program. That is, for each edge and for each edge , where represents the set of transformations that receive as input, i.e., the number o edges coming out of .

For the purpose of constructing the data flow model, the available transformations on the modeled frameworks fall into two categories: basic transformations (without cycles) and iterative transformations. We first present the more common case of acyclic programs. The extension of our model to deal with iterations is presented in Section 3.3. All basic transformations in our model can have their data flow modeled by either a single transition with one input and one output edges (see Figure 0(a)) or a single transition with two input and one output edges (see Figure 0(b)). We call unary transformations those that receive only one dataset as input and binary transformations those that receive two datasets as input. To construct the complete graph (actually a DAG), the transitions are to be sequenced by matching the corresponding input and output datasets.

(a) Unary Transformation.
(b) Binary Transformation.
Figure 1: Types of transformations in the data flow.

To illustrate the model, let us consider the Spark program shown in Figure 2. This program receives as input two datasets (RDDs) containing log messages (line 1). It makes the union of these two datasets (line 2), removes duplicate logs (line 3), and ends by filtering headers, removing logs that match a specific pattern (line 4). The program ends by returning the filtered RDD (line 5).

def unionLogsExample(firstLogs: RDD[String], secondLogs: RDD[String]): RDD[String] = {
        val aggregatedLogLines = firstLogs.union(secondLogs)
        val uniqueLogLines = aggregatedLogLines.distinct()
        val cleanLogLines = uniqueLogLines.filter((line: String) => !(line.startsWith("host") && line.contains("bytes")))
        return cleanLogLines
}
Figure 2: Sample log union program in Spark.

In this program, we can identify five RDDs, that will be referred to using short names for conciseness. So, , where firstLogs, secondLogs, aggregatedLogLines, uniqueLogLines, and cleanLogLines. For simplicity, each RDD in the code was given a unique name. It makes it easier to reference them in the text. However, the model considers that each RDD is uniquely identified, independently of the concrete name given to it in the code.

We can also identify the application of three transformations in ; thus the set in our example is defined as , where , , and !(line.startsWith (‘‘host’’) .

Each transformation in receives one or two datasets belonging to as input and produces a dataset also in as output. Besides, the sets and are disjoint and finite.

Edges connect datasets with transformations. An edge may either be a pair in , representing the input dataset of a transformation, or it can be a pair in , representing the output dataset of a transformation. In this way, the set of edges of is defined as .

The set in our example program is, then:

Using these sets, we can define a graph representing the Spark program in Figure 2. This graph is depicted in Figure 3. The distributed datasets in are represented as circle nodes, and the transformations in are represented as thick bar nodes of the graph, as it is usual in representing Petri Nets. The edges are represented by arrows that connect the datasets and transformations. The token marking in and indicate that the program is ready to be executed (initial marking). For simplicity, we only indicate the weight of edges of the Petri Net when they are different from 1.

Figure 3: Data flow representation of the program in Figure 2.

3.2 Data Sets and Transformations

The data flow model defined above represents (i) the datasets and transformations of a program ; (ii) the order in which transformations are processed when the program is executed. These representations are abstract from their actual contents or semantics.

To define the contents of datasets in and the semantics of transformations in , we make use of Monoid Algebra  Fegaras (2017, 2019). Datasets are represented as monoid collections, and transformations are defined as operations supported by monoid algebra. These representations are detailed in the following.

3.2.1 Distributed Datasets

A distributed dataset in can either be represented by a bag () or a list (). Both structures represent collections of distributed data  Fegaras (2019), capturing the essence of the concepts of RDD in Apache Spark, PCollection in Apache Beam, DataSet in Apache Flink and DryadTable in DryadLINQ. These structures provide an abstraction of the actual distributed data in a cluster in the form of a simple collection of items.

We define most of the transformations of our model in terms of bags. We consider lists only for transformations implementing sorts, which are the only ones in which the order of the elements in the dataset is relevant.

In monoid algebra, bags and lists can either represent distributed or local collections. Monoid homomorphisms treat these two kinds of collections in a unified way Fegaras (2019). In this way, we will not distinguish between distributed and local collections when defining our transformations.

3.2.2 Transformations

In our model, transformations on datasets take one or two datasets as input and produce one dataset as an output. Transformations may also receive other types of parameters such as functions, which represent data processing operations defined by the developer and literals such as boolean constants. A transformation in the transformation set of a program is characterized by (i) the operation it implements, (ii) the types of its input and output datasets, (iii) and its input parameters.

We define the transformations of our model in terms of the operations of monoid algebra defined in Section 2. We group transformations into categories according to the types of operations that we identified in the data processing systems that we studied.

Mapping Transformations

transform values of an input dataset into values of an output dataset by applying a mapping function. Our model provides two mapping transformations: flatMap and map. Both transformations apply a given function to every element of the input dataset to generate the output dataset, the only difference being the requirements on the type of and its relation with the type of the generated dataset. Given an input dataset of type , the map transformation accepts any and generates an output dataset of type , while the flatMap transformation requires to produce a dataset of type as output.

The definition of flatMap in our model is just the monoid algebra operation defined in Section 2:

The map transformation derives data of type when given a function . For that to be modeled with the flatmap from monoid algebra, we create a lambda expression that receives an element from the input dataset and results in a collection containing only the result of applying to (). Thus, map is defined as:

Filter Transformation

uses a boolean function to determine whether a data item should be mapped to the output dataset. As in the case of map, we use a lambda expression to build a singleton bag:

For each element of the input dataset , the filter transformation checks the condition . It forms the singleton bag or the empty bag (), depending on the result of that test. This lambda expression is then applied to the input dataset using the flatmap operation.

For instance, consider the boolean function and a bag . then, .

Grouping Transformations

group the elements of a dataset with respect to a key. We define two grouping transformations in our model: groupByKey and groupBy. The groupByKey transformation is defined as the groupby operation of Monoid Algebra. It maps a key-value dataset into a dataset associating each key to a bag. Our groupBy transformation uses a function to map elements of the collection to a key before grouping the elements with respect to that key:

For example, let us consider the identity function to define each key, and the datasets , and . Applying groupBy and groupByKey to these sets results in:

Set-like Transformations

correspond to binary mathematical operations in distributed collections such as those defined in set theory. They operate on two datasets of the same type and result in a new dataset of the same type. The definition of these transformations is based on the definitions in Fegaras (2019).

The union transformation represents the union of elements from two datasets into a single dataset. This operation is represented in a simple way using the bags union operator ():

We also define the intersection and subtract transformations. To define these transformations, we first define auxiliary operations some and all that represent the existential () and universal () quantifiers, respectively. These operations receive a predicate function and reduce the dataset to a logical value:

Using some and all, we can define the transformations intersection and subtract as follows:

The intersection of bags and selects all elements of appearing at least once in . Subtracting from selects all the elements of that differ from every element of .

Unlike the union operation in mathematical sets, the union transformation defined in our model maintains repeated elements from the two input datasets. To allow the removal of these repeated elements, we define the distinct transformation. To define distinct, we first map each element of the dataset to a key/value tuple containing the element itself as a key. After, we group this key/value dataset, which will result in a dataset in which the group is the repeated key itself. Last, we map the key/value elements only to the key, resulting in a dataset with no repetitions. The distinct transformation is defined as follows:

Aggregation Transformations

collapses elements of a dataset into a single element. The most common aggregations apply binary operations on the elements of a dataset to generate a single element, resulting in a single value or on groups of values associated with a key. We represent these aggregations with the transformations reduce, which operates on the whole set, and reduceByKey, which operates on values grouped by key. The reduce transformation has the same behavior as the reduce operation of monoid algebra. The definition of reduceByKey is also defined in terms of reduce, but since its result is the aggregation of elements associated with each key rather than the aggregation of all elements of the set, we first need to group the elements of the dataset by their keys:

Join Transformations

implement relational join operations between two datasets. We define four join operations, which correspond to well-known operations in relational databases: innerJoin, leftOuterJoin, rightOuterJoin, and fullOuterJoin. The innerJoin operation combines the elements of two datasets based on a join-predicate expressed as a relationship, such as the same key. LeftOuterJoin and rightOuterJoin combine the elements of two sets like an innerJoin adding to the result all values in the left (right) set that do not match to the right (left) set. The fullOuterJoin of two sets forms a new relation containing all the information present in both sets.

See below the definition of the innerJoin transformation, which was based on the definition presented in Chlyah et al. (2019):

The definition of the other joins follows a similar logic, but conditionals are included to verify the different relationships. In cases where one side does not have pairs with a certain key, the result of the join is an empty bag on that side and the element that has the key on the other side. The definitions of leftOuterJoin, rightOuterJoin, and fullOuterJoin are as follows:

Sorting Transformations

add the notion of order to a bag. In practical terms, these operations receive a bag and form a list, ordered according to some criteria. Sort transformations are defined in terms of the orderby operation of monoid algebra, which transforms a into a ordered by the key of type that supports the total order (we will also use the function, which reverses the total order of a list, thus using instead of ). We define two transformations, the orderBy transformation that sorts a dataset of type , and the orderByKey transformation that sorts a key/value dataset by the key. The definitions of our sorting transformations are as follows:

The boolean value used as first parameter defines if the direct order or its inverse is used.

To exemplify the use of sorting transformations let us consider and . Then:

3.3 Modeling Iterative Programs

Iterative algorithms apply an operation repeatedly until a predetermined number of iterations or given conditions are reached. Common iterative algorithms are machine learning algorithms, such as Logistic Regression Hastie et al. (2009), and graph analysis algorithms, such as PageRank Brin and Page (1998), which perform iterative optimizations and calculations.

Big Data processing systems like Apache Spark, Apache Flink, Apache Beam and Dryad/DryadLINQ represent their programs as DAGs (Directed Acyclic Graphs). These systems apply a lazy evaluation strategy to execute programs. Thus, the programs are first defined, then they are translated into an optimized DAG representing the execution plan, and, finally, they are sent to run in parallel. Due to this characteristic, iterative programs, characterized by cycles, must be translated into a DAG. Therefore, the operations executed iteratively in the program must be repeated times in the DAG, where is the number of iterations performed by the program.

In the systems Apache Spark, Apache Beam and Dryad/DryadLINQ, iterative programs are defined with the aid of loop statements (such as for and while) of the underlying programming language to control iterations. Apache Flink, on the other hand, has a native operation (iterate) for that, where iterative operations must be encapsulated in a step function that is performed a predetermined number of times or until a specific condition, given by a convergence function, is reached.

Our model relies on the Apache Flink approach to represent the data flow of iterative programs. We define the transformations to be executed iteratively, encapsulating them in a step function that will be repeated as many times as specified in the program. The input and output of the step function must be datasets of the same type so that the output of an iteration is an input for the next one.

Iterative Data Flow

to represent the data flow of an iterative program, we use auxiliary transitions to represent the beginning of the iterations (), the repetition of the step function through a cycle in the graph () and the end of the iterations (). In practice, these transitions are identity transformations since they do not make changes to the data, but only control the iterations. We assume that the iteration starts with an input dataset and that the step function will be executed times, resulting in the dataset as output. In the data flow model, we abstract the control of the number of iterations. Thus, the number of iterations in the data flow model is non-deterministic. We delegate this control for a specific transformation that will be presented later. Figure 4 shows how the data flow of an iterative program is represented in our model. We highlight the step function with dashed lines to represent the part repeated in each iteration.

Figure 4: Iterative data flow.

Each iteration data flow is represented by such a sub-net and, to construct the complete Petri Net for a program, it must be composed with the other transformations as was the case with acyclic transformations. The place corresponding to its initial dataset () is the output from some previous transition and the place corresponding to its final dataset () is the input for a transition in its sequence or a final (output) place.

This model can be reduced into a model without cycles. This is true because all of the studied systems do require either a explicit limit of iterations () or, because the execution plan (which corresponds to the construction of the data flow model) is evaluated before the actual execution of the transformations. Consequently, the execution plan always contains the information on the number of required iterations, making it possible to unfold the iteration as many times as needed. For example, considering the iterative data flow shown in Figure 4, when unfolding this data flow for 3 iterations (), we obtain the data flow shown in Figure 5, in which the auxiliary transitions , and were removed and the transformations within the step function have were repeated 3 times.

Figure 5: Expanded iterative data flow for 3 iterations.
Iterative Transformations

we define the semantics of iterative transformations in terms of the repeat operation of monoid algebra, which receives a step function of type , a predicate function of type , a counter n () and a bag of type as input and recursively applies the function until the condition in is reached or iterations occur, returning the resulting collection as output.

We define two iterative transformations: iterate and iterateWithCondition. The iterate transformation takes a step function , a counter , and a collection as input and applies times. The transformation iterateWithCondition is similar, but it receives an additional predicate function , so it iterates times or until the condition in is false, whichever is reached first ( is necessary to avoid an infinite loop if is never reached). The definitions of iterate and iterateWithCondition are as follows:

Example

to illustrate how an iterative program is represented in our model, let us consider the implementation of the PageRank algorithm Brin and Page (1998) in Apache Spark presented in Figure 6. This version was based on the implementation presented in Zaharia et al. (2012). The PageRank algorithm calculates the importance (ranking) of a page based on the number of links from other pages to it. Rankings are calculated iteratively so that in each iteration, a page contributes to the ranking of the pages it links to and updates its ranking with the contribution it receives from the other pages that link to it.

def pageRank(links: RDD[(String, Iterable[String])], n: Int) = {
    var ranks = links.map( link => (link._1, 1.0) )
    for(i <- 1 to n){
        val linksRanks = links.join(ranks)
        val values = linksRanks.map( lr => lr._2 )
        val contribs = values.flatMap { v =>
            val size = v._1.size
            v._1.map( url => (url, v._2 / size) )
        }
        val aggregContribs = contribs.reduceByKey( (a, b) => a + b )
        ranks = aggregContribs.map( rank => (rank._1, 0.15 + 0.85 * rank._2) )
    }
    ranks
}
Figure 6: PageRank implementation in Spark (based on Zaharia et al. (2012)).

The program shown in Figure 6 receives as input a key/value dataset of links, where the key is the address of a page, and the value is the collection of pages it links to (line 1). The program also receives the number of iterations (n) that will be made as input. The program starts by creating the initial ranks dataset, in which each page (key) of the links dataset receives an initial ranking of (line 2). The iterative part is defined between lines 3 and 12, where the iterations are controlled through a for statement executed from to n. We abstract the block inside the for statement (lines 4 to 11) as the step function that receives the ranks dataset as input and produces, at the end of the iteration, a new version of the ranks dataset with the updated ranking of each page as output.

The step function starts with a join between links and ranks (line 4). Note that the dataset links is not changed in the step function, but is only used in the join with ranks. We have a dataset where each element is a tuple containing the page address, its ranking and the list of pages that it links to. Then we take only the part that contains the ranking and the list of links to other pages (line 5). After that, we calculate the contribution that each page sends to the ranking of the others pages it links to (line 6 to 9). This contribution is equal to , where is the page ranking and is the number of neighbors (pages it links to). Next, we aggregate the contributions with the aggregateByKey transformation (line 10). Since the contribs dataset has key/value pairs where the key is a page and the value is the contribution it receives from another page, the result of the aggregation is a key/value dataset with the page (key) and the sum of all contributions it received (value). At the end of the step function (line 11), we update the ranks dataset so that the ranking of each page is equal to , where is the sum of all contributions received by the page. The program ends by returning the final ranks dataset with the ranking of each page calculated after n iterations (line 15).

To model the data flow of this program, we need to identify the datasets and transformations defined outside and inside the step function (iteration). Outside the step function, we have the input dataset of type and the initial ranks dataset of type , defined before the iteration, which we call . We also have the map transformation () that is applied to generate .

The datasets used within the step function () are updated by each iteration. Within the step function, we denote the datasets and transformations with an subscript, representing that at each iteration, a new version of the dataset or transformation will be created.

In this example, we have the datasets (note that is the same type of ), , , and . In order to fit the iteration subnet pattern, we need to distinguish between the variable before iteration and after iteration. That gets us then the following set of places for our Petri Net:

We also have the innerJoin transformation , the map , the flatMap , the reduceByKey and the map .

The data flow graph representing the PageRank program is shown in Figure 7. In it, we can see the data sets and transformations defined and the edges that connect them. We can also see the , and transitions that represent the beginning, continuation and end of the iterations.

Figure 7: Data flow of the PageRank program.

In terms of Monoid Algebra, the program is defined as follows:

where ranges from to .

The iteration that begins at and ends at is defined as:

As we mentioned earlier, the data flow systems that we are modeling define their programs as DAGs, so the representation of iterative programs takes place through the repetition of operations times where is the number of iterations, having no cycles in the graph as we did in our model. Our iteration representation is an abstraction for the expansion of the graph, but in fact, our model allows us to represent the DAG that would be created in the data flow systems. As an example, we can see the expanded representation of the data flow of the PageRanks program for 3 iterations in Figure 8. In it we can see that the iterations control transitions (, and ) were removed and that the program is represented as a DAG.

Figure 8: Expanded data flow (without cycle) of the PageRank program for 3 iterations.

The principles of the example given above are applicable to any structured iteration defined at the Petri Net level. It is easy to see that the the transformation from an iterative Petri Net into an acyclic one, for a given , can be defined using graph transformation/rewriting.

4 Comparing Parallel Big Data Processing Frameworks

The model proposed in this paper uses as reference the characteristics of the programming strategies implemented by most prominent data flow based Big Data processing frameworks like Apache Spark Zaharia et al. (2010), Dryad/DryadLINQ Isard et al. (2007); Yu et al. (2008), Apache Flink Carbone et al. (2015) and Apache Beam Beam (2016). These frameworks use a similar DAG-based model to represent the data processing programs workflow despite the adoption of different strategies for executing programs, optimizing and processing data. DAGs are composed of data processing operations that are connected through communication channels. The channels are places for intermediate data storage among operations.

Our model captures DAGs (data processing operations and communication channels) with the Petri Net data flow component. The nodes for datasets represent the communication channels among operations. They represent at a high level, the abstractions used by Big Data processing frameworks for modeling distributed datasets, such as RDD in Apache Spark (see Figure 2 and Figure  3), PCollection in Apache Beam, DataSet in Apache Flink and DryadTable in DryadLINQ. Transformation nodes represent the processing operations that receive data from datasets and transmit the processing results to another dataset. The representations of the datasets and transformations in the data flow graph encompass the main abstractions of the DAGs in these systems and allow to represent and analyze a program independently of the system in which it will be executed. The semantics of transformations and data sets is represented in the model using Monoid Algebra.

In this paper we focus on the abstract representation of both non-iterative and iterative Big Data processing programs. Therefore, the following lines compare and discuss strategies adopted by existing frameworks for implementing this type of programs. They also discuss how our model provides a general formal specification of these strategies.

4.1 Big Data Processing Frameworks

Big Data processing frameworks adopt control flow or data flow based parallel programming models for implementing programs. Dependence analysis is a formal theory in compilation theory for determining ordering constraints between computations Kennedy and Allen (2001). The theory distinguishes between control and data dependencies. Control flow models focus on sequential (imperative) programming Ivanovs (2018), thus the data follows the control and computations are executed explicitly based on the sequence programmed. Data flow models focus on data dependencies and allow avoiding spurious control dependencies like accidental locking Ivanovs (2018), which simplifies the definition of concurrent and independent computations.

Apache Hadoop Hadoop (2019)

is an open-source control flow system for the processing of distributed data that implements the

MapReduce programming model Dean and Ghemawat (2004). MapReduce is a parallel computing model that divides processing into two operations: map and reduce. The map operation applies the same function to all the elements of a list of key/value records. The result of map is feed to the reduce operation which processes key/value data aggregated by the key. Other systems such as Apache Spark Zaharia et al. (2010), Apache Flink Carbone et al. (2015), Apache Beam Beam (2016), and Dryad/DryadLINQ Isard et al. (2007); Yu et al. (2008), adopt data flow models that show better performance.

Both control and data flow parallel programming models reach expression and execution limitations when implementing iterative data processing operations in many domains of data analysis, like machine learning or graph analysis. With increasing interest to run these kinds of algorithms on massive datasets, there is a need to execute iterations in a massively parallel fashion. Therefore, existing systems propose different strategies for implementing iterative operations. The following lines analyze and compare these strategies.

Apache Spark

Zaharia et al. (2010) is a general purpose system for in-memory parallel data processing. Spark is centered on the concept of RDDs (Resilient Distributed Datasets), which are distributed datasets that can be processed in parallel in a processing cluster. Spark programs are represented through a DAG that defines the program’s data flow, where RDDs are processed by applying operations to them. Spark offers two types of operations. Transformations, which process the data in an RDD and generate a new RDD as output, and actions, which save the contents of the RDD or generate a different result from an RDD. Spark adopts a lazy evaluation strategy, where actions trigger the processing of data, possibly applying transformations. For instance in the program given on Figure 2, the program unionLogsProblem defines three transformations (lines 2, 3 and 4) that will be executed when needed. This program encapsulates only the operations of transformations in RDDs. Its execution is triggered with the call of an action, which can be called later. An example is the collect action that triggers the processing of transformations and collects the resulting RDD as a local data collection.

The in-memory processing of Spark proved to be more efficient than that of Apache Hadoop, making it more suitable for iterative programs since intermediate data does not need to be stored on disk Zaharia et al. (2010), as occurs in Hadoop. However, Spark does not have a native solution for defining iterative programs, making it necessary to use resources from the underlying programming language, like while and for loops, so that iterations can be defined. Since Spark adopts a lazy evaluation strategy, the definition of the data flow through the call of successive transformations forms an execution plan. This plan is optimized in a DAG and executed in parallel when an action is called. The definition of iterative programs follows the same principle. In this way, transformations called within an iteration form a step in the execution plan, making these transformations to be repeated in the DAG as many times as the number of iterations programmed in the loop (see the PageRank example presented in Section 3.3).

Apache Beam

Beam (2016) is a unified model for defining both batch and streaming data-parallel processing pipelines. Beam is useful for implementing parallel data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel. A pipeline can be executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.

Apache Beam programs are defined as data pipelines (Pipeline) that encapsulate its data flow with distributed data collections (PCollection) and data processing operations (PTransform). Thus, a program is defined by reading an input dataset, applying operations to datasets and writing an output dataset. This pipeline is optimized in a DAG and submitted for execution in a back-end engine. Similar to Apache Spark, Beam does not provide a definitive solution for implementing iterative programs. Thus, the definition of iterative programs is based on the use of resources from the underlying programming language, relying on external control to the pipeline to control iterations.

Dryad/DryadLINQ

Isard et al. (2007) is a system and model for parallel and distributed programming that was proposed by Microsoft. Dryad offered a flexible programming model by representing a program through a DAG where the vertices are processing operations and the edges are communication channels through which data is transferred. With this model, a program is not limited to just two operations as in MapReduce. Dryad was expanded through DryadLINQ Yu et al. (2008), a high-level interface that introduces an abstraction for representing distributed datasets (DryadTable) and offered a comprehensive set of operations. A program in DryadLINQ is represented by a data stream defined as a DAG, in which datasets are processed by applying operations in sequence. The definition of iterative programs in Dryad/DryadLINQ also follows the approach of Apache Spark and Apache Beam, i.e., there is no native operation to control iterations, but they can be defined using loops from the underlying programming language.

Apache Flink

is a framework and distributed processing engine for batch and streaming data processing Carbone et al. (2015). The system process arbitrary data flow programs in a distributed runtime environment. As in other frameworks, the data flow is organized as a DAG with one or more entry or exit points. Flink implements a lightweight fault tolerant model based on the use of checkpoints that can be manually placed in the program or that can be added by the system. Flink offers the DataSet API for batch processing and the DataStream API for streaming processing. Both offer a comprehensive set of operations for data processing, with mapping, filtering and aggregation operations, in addition to other types of operations.

From the Big Data processing frameworks analyzed in this work, Flink is the only one that offers a native solution for iterative programs. For the definition of iterative programs, Flink offers the iterate operation. This operation takes as an argument a high-order function, called step function, which encapsulates the iterative data flow that consumes an input dataset and produces an output dataset, which in turn is the input for the next iteration. The iterate operator implements a simple form of iterations: in each iteration, the step function consumes the entire input (the result of the previous iteration, or the initial dataset), and computes the next version of the partial solution. There are two options to specify termination conditions for an iteration specifying: (i) the maximum number of iterations, the iteration will be executed this many times; (ii) custom convergence function that implements a convergence criterion to end iterations. Flink also offers the delta iterate operator (iterateDelta) to address the case of incremental iterations that selectively modify elements of their solution and evolve the solution rather than fully recompute it. This leads to more efficient algorithms, because not every element in the solution set changes in each iteration.

Table 1 compares the transformations defined by the model and the operations implemented in the Big Data processing frameworks. Therefore, we grouped the transformations according to the types of processing that are done: Mapping, Filtering, Grouping, Sets, Aggregation, Joins and Ordering. We modeled the main types of operations provided by these frameworks. In the table we also indicate how the model and frameworks deal with iterative programs.

width= Model Apache Spark Apache Flink Apache Beam DryadLINQ Mapping map, flatMap map, flatMap map, flatMap ParDo, FlatMapElements, MapElements Select, SelectMany Filtering filter filter filter Filter Where Grouping groupBy, groupByKey groupBy, groupByKey groupBy GroupByKey GroupBy Sets union, intersection, subtract, distinct union, intersection, subtract, distinct union, distinct Flatten, Distinct Union, Intersect, Except, Distinct Aggregation reduce, redubeByKey reduce, reduceByKey, aggregateByKey reduce, reduceGroup, aggregate Combine Aggregate Joins innerJoin, leftOuterJoin, rightOuterJoin, fullOuterJoin join, leftOuterJoin, rightOuterJoin, fullOuterJoin join, leftOuterJoin, rightOuterJoin, fullOuterJoin CoGroupByKey Join Ordering orderBy, orderByKey sortBy, sortByKey sortPartition, sortGroup OrderBy Iteration iterate, iterateWithCondition Support with external for and while loops iterate, deltaIterate Support with external for and while loops Support with external for and while loops

Table 1: Comparing our model operations with operations in Big Data processing frameworks.

Some systems offer more specific operations that we do not define directly in our model. It is a work in progress to guarantee complete coverage of all the operations of the considered systems. However, most of the operations that are not directly represented in the model can easily be represented using the transformations provided by the model. For example, classic aggregation operations, like maximum, minimum or the sum of the elements in a dataset. We can easily represent these operations using the reduce operation of the model:

Ideally, Big Data processing frameworks should allow users to express data flow using simple imperative data flow statements while matching the performance of native data flow. Therefore we believe that it is necessary to propose formal models agnostic of the underlying programming models and their implementation to reason about iterative and non-iterative data processing algorithms abstractly. The model proposed in the previous sections can be an abstraction of existing data flow-based programming models independently of their specific implementations by different frameworks. It provides abstractions of the data flow programming models that can be applied to specify parallel data processing programs independently of target systems.

An abstract representation of parallel data flow based can be used for addressing program testing challenges beyond comparing Big Data processing tools that can be useful when adopting a framework and for migrating solutions from one framework to another. Our model is used as a representation tool for defining mutation operators to apply mutation testing on data flow based Big Data processing programs. In the next section, we briefly discuss how this is done in a testing tool we developed.

5 Applications of the model

The abstract and formal concepts provided by the model make it suitable for the automation of software development processes, such as those done by IDE tools. Consequently, we first applied the model to formalize the mutation operators presented in Souza Neto et al. (2020), where we explored the application of mutation testing in Spark programs, and in the tool TRANSMUT-Spark222TRANSMUT-Spark is publicly available at https://github.com/jbsneto-ppgsc-ufrn/transmut-spark. Souza Neto (2020) that we developed to automate this process. Mutation testing is a fault-based testing technique that relies on simulating faults to design and evaluate test sets Ammann and Offutt (2017). Faults are simulated by applying mutation operators, which are rules with modification patterns for programs (a modified program is called a mutant). In Souza Neto et al. (2020), we presented a set of mutation operators designed for Spark programs that are divided into two groups: mutation operators for the data flow and mutation operators for transformations. These mutation operators were based on faults found in Spark programs with the idea of mimicking them.

Mutation operators for the data flow model change the DAG that defines the program. In general, we define three types of modifications in the data flow: replacement of one transformation with another (both existing in the program), swap the calling order of two transformations and delete the call of a transformation in the data flow. These modifications involve changes to the edges of the program. Besides, the replacement of a transformation by another must maintain the type consistency, i.e., the I/O datasets of both transformations must be of the same type. In Figure 9 we exemplify these mutations in the data flow that was presented in Figure 3.

(a) Transformation Replacement.
(b) Transformations Swap.
(c) Transformation Deletion.
Figure 9: Examples of mutants created with mutation operators for data flow.

Mutation operators associated with transformations model the changes done on specific transformations’ types, such as operators for aggregation transformations or set transformations. In general, we model two types of modifications: replacement of the function passed as a parameter of the transformation and replacement of a transformation by another of the same group. In the first type, we defined specific substitution functions for each group of transformations. For example, for a transformation of type aggregation, we define five substitution functions () to replace it. Considering the aggregation transformation , which receives as input a function that returns the greater of the two input parameters and an integer dataset, the mutation operator for aggregation transformation replacement will generate the following five mutants:

In the other type of modification, we replace a transformation with others from the same group. For example, for set transformations (union, intersection, and subtract), we replace one transformation with the remaining two; besides, we replace the transformation for the identity of each of the two input datasets, and we also invert the order of the input datasets. Considering the set transformation , which receives two integer datasets a input, the set transformation replacement operator will generate the following mutants:

The mutation operators for the other groups of transformations follow these two types of modifications, respecting each group’s type consistency and particularities. The tool TRANSMUT-Spark Souza Neto (2020) uses the model as an intermediate representation. The tool reads a Spark program and translates it into an implementation of the model, so the mutation operators are applied to the model. We use the model as an intermediate representation in the tool to expand it in the future to apply the mutation test to programs in Apache Flink, Apache Beam and DryadLINQ.

6 Related Work

Data flow processing that defines a pipeline of operations or tasks applied on datasets, where tasks exchange data, has been traditionally formalized using (coloured) Petri Nets Lee and Messerschmitt (1987). They seem well adapted for modeling the organization (flow) of the processing tasks that receive and produce data. Regarding data processing programs based on data flow models, proposals use Petri Nets to model the flow and use other formal tools for modeling the operations applied on data. For example, Hidders et al. (2005, 2008) uses nested relational calculus for formalizing operations applied to non first normal form compliant data. Next, we describe works that have formalized data processing parallel, programming models. The analysis focuses on the tools and strategies used for formalizing either control/data flows and data processing operations.

The authors in Yang et al. (2010) formalize MapReduce using CSP Brookes et al. (1984). The objective is to formalize the behavior of a parallel system that implements the MapReduce programming model. The system is formalized with respect to four components: Master, Mapper, Reducer and FS (file system). The Master manages the execution process and the interaction between the other components. The Mapper and Reducer components represent, respectively, the processes for executing the map and reduce operations. Finally, the FS represents the file system that stores the data processed in the program. These components implement the data processing pipeline implemented by these systems, loading data from an FS, executing a map function (by several mappers), shuffling and sorting, and executing a function reduce by reducers. The model allows the analysis of properties and interaction between these processes implemented by MapReduce systems.

In Ono et al. (2011) MapReduce applications are formalized with Coq, an interactive theorem proving systems. As in Yang et al. (2010), the authors also formalized the components and execution process of MapReduce systems. The user-defined functions of the map and reduce operations are also formalized with Coq. Then these formal definitions are used to prove the correctness of MapReduce programs. This approach is different from the work presented in Yang et al. (2010) (described above) that formalizes only the MapReduce system.

More recent work has proposed formal models for data flow programming models, particularly associated with Spark. The work in Chen et al. (2017) introduces PureSpark, a functional and executable specification for Apache Spark written in Haskell. The purpose of PureSpark is to specify parallel aggregation operations of Spark. Based on this specification, necessary and sufficient conditions are extracted to verify whether the outputs of aggregations in a Spark program are deterministic.

The work Marconi et al. (2018) presents a formal model for Spark applications based on temporal logic. The model considers the DAG that forms the program, information about the execution environment, such as the number of CPU cores available, the number of tasks of the program and the average execution time of the tasks. Then, the model is used to check time constraints and make predictions about the program’s execution time.

The research community has paid attention to the problem of addressing iterative programs in data flow based programming frameworks and have proposed a number of solutions Alexandrov et al. (2019); Moldovan et al. (2018); Jeong et al. (2019). For example, Emma Alexandrov et al. (2019) can translate imperative control flow to Flink’s native iterations, but only when there is a single while-loop without any other control flow statement in its body. This makes it not suitable for data analytics tasks, such as hyper-parameter optimization, simulated annealing, and strongly connected components. AutoGraph Moldovan et al. (2018) and Janus Jeong et al. (2019)

compile imperative control flow to TensorFlow’s native iterations

Yu et al. (2018). However, they do not support general data analytics other than machine learning. Mitos Gévay et al. (2021) allows users to write imperative control flow constructs, such as regular while-loops and if statements.

7 Conclusions and Future Work

This paper presents a model for data flow processing programs. Our model combines two formal mathematical tools: Monoid Algebra and Petri Nets. Monoid Algebra is an abstract way to specify operations over partitioned datasets. Petri nets are widely used to specify parallel computation. Our proposal combines these to models by building two-level specifications. The lower level uses Monoid Algebra to specify individual transformations (i.e., operations whose arguments and results are datasets). The upper level defines the program by means of a Petri Net, where places are datasets and transitions represent operations over that data.

The paper is an extended version of de Souza Neto et al. (2020). The main technical difference to that paper is the addition of iterations to the model, as well as the proposed use of our model to specify the operations available in several existing Big Data processing frameworks. In this sense, the paper gives the specification of data processing operations (i.e., transformations) provided as built-in operations in Apache Spark, DryadLINQ, Apache Beam and Apache Flink.

In the proposed model, iterations are represented by a loop on the Petri Net that defines the program. Loops are unfolded to build a Petri Net without cycles, to have a DAG representing the program. This technique is convenient and realistic. It is convenient since it preserves the distribution and associative properties of the operations over datasets. It is realistic since it provides a general model of strategies used by most prominent Big Data processing frameworks to process loops Zaharia et al. (2012); Carbone et al. (2015); Beam (2016); Yu et al. (2008).


Beyond the interest of providing a formal model for data flow-based programs, our proposal can be used as a comparison tool of target systems or to define program testing pipelines. We also showed how operations could be combined into data flows for implementing data mutation operations in mutation testing approaches. The model was already used as an intermediary representation to specify mutation operators that were then implemented in TRANSMUT-Spark, a software engineering tool for mutation testing of Spark programs Souza Neto et al. (2020). Of course, the extension of the model to include iterations will also lead us to the definition of new mutation operators to cover the testing of iterative programs. Also, a natural extension to this work would be to instantiate the tool for other systems of the data flow family (DryadLINQ, Apache Beam, Apache Flink). This instantiation can be done by adapting TRANSMUT-Spark’s front and back ends so that a program originally written in any of these systems can be tested using the mutation testing approach proposed in Souza Neto et al. (2020). This line of work, where the model is used as the internal format, is suited for the more practical users, not willing to see the formalism behind their tools. However, when exploring the similarities of different frameworks, our model may be used as a platform-agnostic form of formally specifying and analyzing the properties of a program before its implementation.


As future work, we intend to study the extension of our model to use Colored Petri Nets (CPN) and CPN Tools Jensen et al. (2007) to specify the types for transformations over datasets explicitly and to manipulate, analyze and animate the specifications. This extension may be useful to detect design problems at an early stage. Also, we plan to work on the use of specifications for code generation to target data flow systems similar to Apache Spark. A simple form of this code generation was implemented to generate test programs in TRANSMUT-Spark back-end Souza Neto et al. (2020).

References

  • A. Alexandrov, G. Krastev, and V. Markl (2019) Representations and optimizations for embedded parallel dataflow languages. ACM Transactions on Database Systems (TODS) 44 (1), pp. 1–44. Cited by: §6.
  • P. Ammann and J. Offutt (2017) Introduction to Software Testing. Second Edition edition, Cambridge University Press, New York, NY. Cited by: §5.
  • F. Bajaber, R. Elshawi, O. Batarfi, A. Altalhi, A. Barnawi, and S. Sakr (2016) Big Data 2.0 Processing Systems: Taxonomy and Open Challenges. Journal of Grid Computing 14 (3), pp. 379–405. External Links: Document, ISBN 1572-9184 Cited by: §1.
  • A. Beam (2016) Apache Beam: an advanced unified programming model. External Links: Link Cited by: §1, §4.1, §4.1, §4, §7.
  • S. Brin and L. Page (1998) The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30 (1-7), pp. 107–117. Cited by: §3.3, §3.3.
  • S. D. Brookes, C. A. R. Hoare, and A. W. Roscoe (1984) A Theory of Communicating Sequential Processes. J. ACM 31 (3), pp. 560–599. External Links: ISSN 0004-5411, Document Cited by: §6.
  • P. Carbone, S. Ewen, S. Haridi, A. Katsifodimos, V. Markl, and K. Tzoumas (2015) Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Data Engineering Bulletin 38 (4), pp. 28–38. Cited by: §1, §4.1, §4.1, §4, §7.
  • Y. Chen, C. Hong, O. Lengál, S. Mu, N. Sinha, and B. Wang (2017) An Executable Sequential Specification for Spark Aggregation. In Networked Systems, A. El Abbadi and B. Garbinato (Eds.), Cham, pp. 421–438. External Links: ISBN 978-3-319-59647-1 Cited by: §1, §6.
  • S. Chlyah, N. Gesbert, P. Genevès, and N. Layaïda (2019) An Algebra with a Fixpoint Operator for Distributed Data Collections. External Links: Link Cited by: §3.2.2.
  • J. B. de Souza Neto, A. M. Moreira, G. Vargas-Solar, and M. A. Musicante (2020) Modeling big data processing programs. In Formal Methods: Foundations and Applications, G. Carvalho and V. Stolz (Eds.), Cham, pp. 101–118. External Links: ISBN 978-3-030-63882-5 Cited by: An Abstract View of Big Data Processing Programs, §1, §1, §7, An Abstract View of Big Data Processing Programs.
  • J. Dean and S. Ghemawat (2004) MapReduce: Simplified Data Processing on Large Clusters. In OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, pp. 137–150. Cited by: §4.1.
  • L. Fegaras (2017) An algebra for distributed Big Data analytics. Journal of Functional Programming 27, pp. e27. External Links: Document Cited by: §1, §2, §2, §2, §2, §3.2.
  • L. Fegaras (2019) Compile-Time Query Optimization for Big Data Analytics. Open Journal of Big Data (OJBD) 5 (1), pp. 35–61. External Links: ISSN 2365-029X, Link Cited by: §1, §2, §3.2.1, §3.2.1, §3.2.2, §3.2.
  • G. E. Gévay, T. Rabl, S. Breß, L. Madai-Tahy, J. Quiané-Ruiz, and V. Markl (2021) Efficient control flow in dataflow systems: when ease-of-use meets high performance. In IEEE 37th International Conference on Data Engineering (ICDE), Cited by: §6.
  • Hadoop (2019) Apache Hadoop Documentation. External Links: Link Cited by: §1, §4.1.
  • T. Hastie, R. Tibshirani, and J. Friedman (2009) The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Cited by: §3.3.
  • J. Hidders, N. Kwasnikowska, J. Sroka, J. Tyszkiewicz, and J. Van den Bussche (2005) Petri net + nested relational calculus = dataflow. In OTM Confederated International Conferences" On the Move to Meaningful Internet Systems", pp. 220–237. Cited by: §6.
  • J. Hidders, N. Kwasnikowska, J. Sroka, J. Tyszkiewicz, and J. Van den Bussche (2008) DFL: A dataflow language based on Petri nets and nested relational calculus. Information Systems 33 (3), pp. 261–284. Cited by: §6.
  • M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly (2007) Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys ’07, New York, NY, USA, pp. 59–72. External Links: ISBN 978-1-59593-636-3, Link, Document Cited by: §1, §4.1, §4.1, §4.
  • R. Ivanovs (2018) External Links: Link Cited by: §4.1.
  • K. Jensen, L. M. Kristensen, and L. Wells (2007) Coloured Petri Nets and CPN Tools for modelling and validation of concurrent systems. International Journal on Software Tools for Technology Transfer 9 (3), pp. 213–254. External Links: ISSN 1433-2787, Document Cited by: §7.
  • E. Jeong, S. Cho, G. Yu, J. S. Jeong, D. Shin, and B. Chun (2019) janus

    : Fast and flexible deep learning via symbolic graph execution of imperative programs

    .
    In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pp. 453–468. Cited by: §6.
  • K. M. Kavi, B. P. Buckles, and N. Bhat (1986) A Formal Definition of Data Flow Graph Models. IEEE Transactions on Computers C-35 (11), pp. 940–948. External Links: Document, ISSN 0018-9340 Cited by: §3.1.
  • K. Kennedy and J. R. Allen (2001) Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc.. Cited by: §4.1.
  • E. Lee and D. Messerschmitt (1987) Pipeline interleaved programmable DSP’s: Synchronous data flow programming. IEEE Transactions on acoustics, speech, and signal processing 35 (9), pp. 1334–1345. Cited by: §6.
  • F. Marconi, G. Quattrocchi, L. Baresi, M. M. Bersani, and M. Rossi (2018) On the Timed Analysis of Big-Data Applications. In NASA Formal Methods, A. Dutle, C. Muñoz, and A. Narkawicz (Eds.), Cham, pp. 315–332. External Links: ISBN 978-3-319-77935-5 Cited by: §6.
  • D. Moldovan, J. M. Decker, F. Wang, A. A. Johnson, B. K. Lee, Z. Nado, D. Sculley, T. Rompf, and A. B. Wiltschko (2018) Autograph: imperative-style coding with graph-based performance. arXiv preprint arXiv:1810.08061. Cited by: §6.
  • T. Murata (1989) Petri nets: Properties, analysis and applications. Proceedings of the IEEE 77 (4), pp. 541–580. External Links: Document, ISSN 0018-9219 Cited by: §1, §2, §2, §3.1.
  • K. Ono, Y. Hirai, Y. Tanabe, N. Noda, and M. Hagiya (2011) Using Coq in Specification and Program Extraction of Hadoop MapReduce Applications. In Software Engineering and Formal Methods, G. Barthe, A. Pardo, and G. Schneider (Eds.), Berlin, Heidelberg, pp. 350–365. External Links: ISBN 978-3-642-24690-6 Cited by: §1, §6.
  • C. A. Petri (1962) Kommunikation mit automaten. Ph.D. Thesis, Universität Hamburg, , (ger). Note: (In German) Cited by: §2.
  • J. B. Souza Neto (2020) Transformation mutation for spark programs testing. Ph.D. Thesis, Federal University of Rio Grande do Norte (UFRN), Natal/RN, Brazil, (Portuguese). Note: (In Portuguese) Cited by: §5, §5.
  • J. B. Souza Neto, A. Martins Moreira, G. Vargas-Solar, and M. A. Musicante (2020) Mutation Operators for Large Scale Data Processing Programs in Spark. In Advanced Information Systems Engineering, S. Dustdar, E. Yu, C. Salinesi, D. Rieu, and V. Pant (Eds.), Cham, pp. 482–497. External Links: ISBN 978-3-030-49435-3 Cited by: §1, §5, §7, §7.
  • F. Yang, W. Su, H. Zhu, and Q. Li (2010) Formalizing MapReduce with CSP. In 2010 17th IEEE International Conference and Workshops on Engineering of Computer Based Systems, Vol. , pp. 358–367. Cited by: §1, §6, §6.
  • Y. Yu, M. Abadi, P. Barham, E. Brevdo, M. Burrows, A. Davis, J. Dean, S. Ghemawat, T. Harley, P. Hawkins, et al. (2018) Dynamic control flow in large-scale machine learning. In Proceedings of the Thirteenth EuroSys Conference, pp. 1–15. Cited by: §6.
  • Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey (2008) DryadLINQ: A System for General-purpose Distributed Data-parallel Computing Using a High-level Language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, Berkeley, CA, USA, pp. 1–14. External Links: Link Cited by: §1, §4.1, §4.1, §4, §7.
  • M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica (2012) Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, pp. 2–2. External Links: Link Cited by: item 1, Figure 6, §3.3, §7.
  • M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica (2010) Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, Berkeley, CA, USA, pp. 10–10. External Links: Link Cited by: §1, §4.1, §4.1, §4.1, §4.