Translation of Array-Based Loops to Distributed Data-Parallel Programs

Large volumes of data generated by scientific experiments and simulations come in the form of arrays, while programs that analyze these data are frequently expressed in terms of array operations in an imperative, loop-based language. But, as datasets grow larger, new frameworks in distributed Big Data analytics have become essential tools to large-scale scientific computing. Scientists, who are typically comfortable with numerical analysis tools but are not familiar with the intricacies of Big Data analytics, must now learn to convert their loop-based programs to distributed data-parallel programs. We present a novel framework for translating programs expressed as array-based loops to distributed data parallel programs that is more general and efficient than related work. Although our translations are over sparse arrays, we extend our framework to handle packed arrays, such as tiled matrices, without sacrificing performance. We report on a prototype implementation on top of Spark and evaluate the performance of our system relative to hand-written programs.


page 1

page 2

page 3

page 4


An Abstract View of Big Data Processing Programs

This paper proposes a model for specifying data flow based parallel data...

Verifying Array Manipulating Programs by Tiling

Formally verifying properties of programs that manipulate arrays in loop...

BigDL: A Distributed Deep Learning Framework for Big Data

In this paper, we present BigDL, a distributed deep learning framework f...

A Fast, Scalable, Universal Approach For Distributed Data Aggregations

In the current era of Big Data, data engineering has transformed into an...

HeAT – a Distributed and GPU-accelerated Tensor Framework for Data Analytics

To cope with the rapid growth in available data, the efficiency of data ...

STEP : A Distributed Multi-threading Framework Towards Efficient Data Analytics

Various general-purpose distributed systems have been proposed to cope w...

A Distributed Learning Architecture for Scientific Imaging Problems

Current trends in scientific imaging are challenged by the emerging need...

1. Introduction

Most data used in scientific computing and machine learning come in the form of arrays, such as vectors, matrices and tensors, while programs that analyze these data are frequently expressed in terms of array operations in an imperative, loop-based language. These loops are inherently sequential since they iterate over these collections by accessing their elements randomly, one at a time, using array indexing. Current scientific applications must analyze enormous volumes of array data using complex mathematical data processing methods. As datasets grow larger and data analysis computations become more complex, programs written with array-based loops must now be rewritten to run on parallel or distributed architectures. Most scientists though are comfortable with numerical analysis tools, such as MatLab, and with certain imperative languages, such as FORTRAN and C, to express their array-based computations using algorithms found in standard data analysis textbooks, but are not familiar with the intricacies of parallel and distributed computing. Because of the prevalence of array-based programs, a considerable effort has been made to automatically parallelize these loops. Most automated parallelization methods in High Performance Computing (HPC) exploit loop-level parallelism by using multiple threads to access the indexed data in a loop in parallel. But indexed array values that are updated in one loop step may be used in the next steps, thus creating loop-carried dependencies, called recurrences. The presence of such dependencies complicates the parallelization of a loop. DOALL parallelization 

(doall-kavi, ) identifies and parallelizes loops that do not have any recurrences, that is, when statements within a loop can be executed independently. Although there is a substantial body of work on automated parallelization on shared-memory architectures in HPC, there is very little work done on applying these techniques to the new emerging distributed systems for Big Data analysis (with the notable exceptions of MOLD (mold:oopsla14, ) and Casper (casper:sigmod18, )).

In recent years, new frameworks in distributed Big Data analytics have become essential tools for large-scale machine learning and scientific discoveries. These systems, which are also known as Data-Intensive Scalable Computing (DISC) systems, have revolutionized our ability to analyze Big Data. Unlike HPC systems, which are mainly for shared-memory architectures, DISC systems are distributed data-parallel systems on clusters of shared-nothing computers connected through a high-speed network. One of the earliest DISC systems is Map-Reduce (dean:osdi04, )

, which was introduced by Google and later became popular as an open-source software with Apache Hadoop 

(hadoop, ). For each Map-Reduce job, one needs to provide two functions: a map and a reduce. The map function specifies how to process a single key-value pair to generate a set of intermediate key-value pairs, while the reduce function specifies how to combine all intermediate values associated with the same key. The Map-Reduce framework uses the map function to process the input key-value pairs in parallel by partitioning the data across a number of compute nodes in a cluster. Then, the map results are shuffled across a number of compute nodes so that values associated with the same key are grouped and processed by the same compute node. Recent DISC systems, such as Apache Spark (spark, ) and Apache Flink (flink, ), go beyond Map-Reduce by maintaining dataset partitions in the memory of the compute nodes. Essentially, in their core, these systems remain Map-Reduce systems but they provide rich APIs that implement many complex operations used in data analysis and support libraries for graph analysis and machine learning.

The goal of this paper is to design and implement a framework that translates array-based loops to DISC operations. Not only do these generated DISC programs have to be semantically equivalent to their original imperative counterparts, but they must also be nearly as efficient as programs written by hand by an expert in DISC systems. If successful, in addition to parallelizing legacy imperative code, such a translation scheme would offer an alternative and more conventional way of developing new DISC applications.

DISC systems use data shuffling to exchange data among compute nodes, which takes place implicitly between the map and reduce stages in Map-Reduce and during group-bys and joins in Spark and Flink. Essentially, all data exchanges across compute nodes are done in a controlled way using DISC operations, which implement data shuffling by distributing data based on some key, so that data associated with the same key are processed together by the same compute node. Our goal is to leverage this idea of data shuffling by collecting the cumulative effects of updates at each memory location across loop iterations and apply these effects in bulk to all memory locations using DISC operations. This idea was first introduced in MOLD (mold:oopsla14, ), but our goal is to design a general framework to translate loop-based programs using compositional rules that transform programs piece-wise, without having to search for program templates to match (as in MOLD (mold:oopsla14, )) or having to use a program synthesizer (as in Casper (casper:sigmod18, )).

Consider, for example, the incremental update in a loop, for a sparse vector . The cumulative effects of all these updates throughout the loop can be performed in bulk by grouping the values across all loop iterations by the array index (that is, by the different destination locations) and by summing up these values for each group. Then the entire vector can be replaced with these new values. For instance, assuming that the values of were zero before the loop, the following program

for i = 0, 9 do
    C[A[i].K] += A[i].V

can be evaluated in bulk by grouping the elements of the vector by (the group-by key), and summing up all the values associated with each different group-by key. Then the resulting key-sum pairs are the new values for the vector . If the sparse vectors and are represented as relational tables with schemas and , respectively, then the new values of can be calculated as follows in SQL:

insert into C select A.K as I, sum(A.V) as V
              from A group by A.K

For example, from on the left we get on the right:

(3,3,10) (3,23)
(8,5,25) (5,25)

These results are consistent with the outcome of the loop, which can be unrolled to the updates C[3]+=10; C[3]+=13; C[5]+=25.

Instead of SQL, our framework uses monoid comprehensions (jfp17, ), which resemble SQL but have less syntactic sugar and are more concise. Our framework translates the previous loop-based program to the following bulk assignment that calculates all the values of using a bag comprehension that returns a bag of index-value pairs:

A group-by operation in a comprehension lifts each pattern variable defined before the group-by (except the group-by keys) from some type to a bag of , indicating that each such variable must now contain all the values associated with the same group-by key value. Consequently, after we group by , the variable is lifted to a bag of values, one bag for each different . In the comprehension result, the aggregation sums up all the values in the bag , thus deriving the new values of for each index .

A more challenging example, which is used as a running example throughout this paper, is the product of two square matrices and such that . It can be expressed as follows in a loop-based language:

for i = 0, d-1 do
    for j = 0, d-1 do {
        R[i,j] := 0;
        for k = 0, d-1 do
            R[i,j] += M[i,k]*N[k,j]  }

A sparse matrix can be represented as a bag of tuples such that . This program too can be translated to a single assignment that replaces the entire content of the matrix with a new content, which is calculated using bulk relational operations. More specifically, if a sparse matrix is implemented as a relational table with schema (I,J,V), matrix multiplication between the tables and can be expressed as follows in SQL:

select M.I, N.J, sum(M.V*N.V) as V
from M join N on M.J=N.I group by M.I, N.J

As in the previous example, instead of SQL, our framework uses a comprehension and translates the loop-based program for matrix multiplication to the following assignment:

Here, the comprehension retrieves the values and as triples and so that , and sets . After we group the values by the matrix indexes and , the variable is lifted to a bag of numerical values , for all . Hence, the aggregation will sum up all the values in the bag , deriving for the element of the resulting matrix. If we ignore non-shuffling operations, this comprehension is equivalent to a join between and followed by a reduceByKey operation in Spark.

1.1. Highlights of our Approach

Our framework translates a loop-based program in pieces, in a bottom-up fashion over the abstract syntax tree (AST) representation of the program, by translating every AST node to a comprehension. Matrix indexing is translated as follows:

If exists, it will return the singleton bag , otherwise, it will return the empty bag. Since any matrix access that normally returns a value of is lifted to a comprehension that returns a bag of , every term in the loop-based program must be lifted in the same way. For example, the integer multiplication must be lifted to the comprehension over the two bags and (the lifted operands) that returns a bag (the lifted result). Consequently, the term in matrix multiplication is translated to:

which, after unnesting the nested comprehensions and renaming some variables, is normalized to:

which is equivalent to a join between and .

Incremental updates, such as in matrix multiplication, accumulate their values across iterations, hence they must be considered in conjunction with iterations. Consider the following loop, where , , and are terms that may depend on :

for k = 0, 99 do #$M[f(k),g(k)] \pluseq{+} h(k)$#

Suppose now that there are two values, and , that have the same image under both and , that is, when and . Then, and should be aggregated together. In general, we need to bring together all values that have the same values for and . That is, we need to group by and and sum up all in each group. This is accomplished by the comprehension:

where is an iterator that corresponds to the for-loop and the summation sums up all that correspond to the same indexes and .

If we apply this method to , which is embedded in a triple-nested loop, we derive:

After replacing and unnesting the nested comprehensions, we get:

Joins between a for-loop and a matrix traversal, such as

can be optimized to a matrix traversal, such as

where the predicate returns true if . Based on this optimization, the previous comprehension becomes:

which is the desired translation of matrix multiplication.

We present a novel framework for translating array-based loops to DISC programs using simple compositional rules that translate these loops piece-wise. Our framework translates an array-based loop to a semantically equivalent DISC program as long as this loop satisfies some simple syntactic restrictions, which are more permissive than the recurrence restrictions imposed by many current systems and can be statically checked at compile-time. For a loop to be parallelizable, many systems require that an array should not be both read and updated in the same loop. For example, they reject the update inside a loop over because is read and updated in the same loop. But they also reject incremental updates, such as , because such an update reads from and writes to the same vector . Our framework relaxes these restrictions by accepting incremental updates of the form in a loop, for some commutative operation and for some terms and that may contain arbitrary array operations, as long as there are no other recurrences present. It translates such an incremental update to a group-by over , followed by a reduction of the values in each group using the operation . Operation is required to be commutative because a group-by in a DISC system uses data shuffling across the computing nodes to bring the data that belong to the same group together, which may not preserve the original order of the data. Therefore, a non-commutative reduction may give results that are different from those of the original loop. We have proved the soundness of our framework by showing that our translation rules are meaning preserving for all loop-based programs that satisfy our restrictions. Given that our translation scheme generates DISC operations, this proof implies that loop-based programs that satisfy our restrictions are parallelizable. Furthermore, the class of loop-based programs that can be handled by our framework is equal to the class of programs expressed in our target language, which consists of comprehensions (i.e., basic SQL), while-loops, and assignments to variables. Some real-world programs that contain irregular loops, such as bubble-sort which requires swapping vector elements, are rejected.

Compared to related work (MOLD (mold:oopsla14, ) and Casper (casper:sigmod18, )): 1) Our translation scheme is complete under the given restrictions as it can translate correctly any program that does not violate such restrictions, while the related work is very limited and can work on simple loops only. For example, neither of the related systems can translate PageRank or Matrix Factorization. 2) Our translator is faster than related systems by orders of magnitude in some cases, since it uses compositional transformations without having to search for templates to apply (as in (mold:oopsla14, )) or use a program synthesizer to explore the space of valid programs (as in (casper:sigmod18, )). 3) Our translations have been formally verified, while Casper needs to call an expensive program validator after each program synthesis. Our system, called DIABLO (a Data-Intensive Array-Based Loop Optimizer), is implemented on top of DIQL (diql:BigData, ; diql, ), which is a query optimization framework for DISC systems that optimizes SQL-like queries and translates them to Java byte code at compile-time. Currently, DIABLO has been tested on Spark (spark, ), Flink (flink, ), and Scala’s Parallel Collections.

Although our translations are over sparse arrays, our framework can easily handle packed arrays, such as tiled matrices, without any fundamental extension. Essentially, the unpack and pack functions that convert dense array structures to sparse arrays and vice versa, are expressed as comprehensions that can be fused with those generated by our framework, thus producing programs that directly access the packed structures without converting them to sparse arrays first. This fusion is hard to achieve in template-based translation systems, such as MOLD (mold:oopsla14, ), which may require different templates for different storage structures. The contributions of this paper are summarized as follows:

  • We present a novel framework for translating array-based loops to distributed data parallel programs that is more general and efficient than related work.

  • We provide simple rules for dependence analysis that detect recurrences across loops that cannot be handled by our framework.

  • We describe how our framework can be extended to handle packed arrays, such as tiled matrices, which can potentially result to a better performance.

  • We evaluate the performance of our system relative to hand-written programs on a variety of data analysis and machine learning programs.

This paper is organized as follows. Section 3 describes our framework in detail. Section 4 lists some optimizations on comprehensions that are necessary for good performance. Section 5 explains how our framework can be used on densely packed arrays, such as tiled matrices. Finally, Section 6 gives some performance results for some well-known data analysis programs.

2. Related Work

Most work on automated parallelization in HPC is focused on parallelizing loops that contain array scans without recurrences (DOALL loops) and total reductions (aggregations) (fisher:pldi94, ; jiang:pact18, ). As a generalization of these methods, DOACROSS parallelization (doall-kavi, ) separates the loop computations that have no recurrences from the rest of the loop and executes them in parallel, while the rest of the loop is executed sequentially. Other methods that parallelize loops with recurrences simply handle these loops as DOALL computations but they perform a run-time dependency analysis to keep track of the dynamic dependencies, and sequentialize some computations if necessary (venkat:sc16, ). Recently, the work by Farzan and Nicolet (farzan:pldi17, ; farzan:pldi19, ) describes loop-to-loop transformations that augment the loop body with extra computations to facilitate parallelization. Data parallelism is an effective technique for high-level parallel programming in which the same computation is applied to all the elements of a dataset in parallel. Most data parallel languages limit their support to flat data parallelism, which is not well suited to irregular parallel computations. In flat data-parallel languages, the function applied over the elements of a dataset in parallel must be itself sequential, while in nested data-parallel languages this function too can be parallel. Blelloch and Sabot (nesl, ) developed a framework to support nested data parallelism using flattening, which is a technique for converting irregular nested computations into regular computations on flat arrays. These techniques have been extended and implemented in various systems, such as Proteus (palmer:95, ). DISC-based systems do not support nested parallelism because it is hard to implement in a distributed setting. Spark, for example, does not allow nested RDDs and will raise a run-time error if the function of an RDD operation accesses an RDD. The DIQL and DIABLO translators, on the other hand, allow nested data parallel computations in any form, by translating them to flat-parallel DISC operations by flattening comprehensions and by translating nested comprehensions to DISC joins (diql:BigData, ).

The closest work to ours is MOLD (mold:oopsla14, )

. To the best of our knowledge, this was the first work to identify the importance of group-by in parallelizing loops with recurrences in a DISC platform. Like our work, MOLD can handle complex indirect array accesses simply using a group-by operation. But, unlike our work, MOLD uses a rewrite system to identify certain code patterns in a loop and translate them to DISC operations. This means that such a system is as good as its rewrite rules and the heuristic search it uses to apply the rules. Given that the correctness of its translations depends on the correctness of each rewrite rule, each such rule must be written and formally validated by an expert. Another similar system is

Casper (casper:sigmod18, ), which translates sequential Java code into semantically equivalent Map-Reduce programs. It uses a program synthesizer to search over the space of sequential program summaries, expressed as IRs. Unlike MOLD, Casper uses a theorem prover based on Hoare logic to prove that the derived Map-Reduce programs are equivalent to the original sequential programs. Our system differs from both MOLD and Casper as it translates loops directly to parallel programs using simple meaning preserving transformations, without having to search for rules to apply. The actual rule-based optimization of our translations is done at a second stage using a small set of rewrite rules, thus separating meaning-preserving translation from optimization.

Another related work on automated parallelization for DISC systems is Map-Reduce program synthesis from input-output examples (smith:pldi16, ), which is based on recent advances in example-directed program synthesis. One important theorem for parallelizing sequential scans is the third homomorphism theorem, which indicates that any homomorphism (ie, a parallelizable computation) can be derived from two sequential scans; a foldl that scans the sequence from left to right and a foldr that scans it from right to left. This theorem has been used to parallelize sequential programs expressed as folds (morita:pldi07, ) by heuristically synthesizing a foldr from a foldl first. Along these lines is GRAPE (fan:sigmod17, ), which requires three sequential incremental programs to derive one parallel graph analysis program, although these programs can be quite similar. Lara (lara, )

is a declarative domain-specific language for collections and matrices that allows linear algebra operations on matrices to be mixed with for-comprehensions for collection processing. This deep embedding of matrix and collection operations with the host programming language facilitates better optimization. Although Lara addresses matrix inter-operation optimization, unlike DIABLO, it does not support imperative loops with random matrix indexing. Another area related to automated parallelization for DISC systems is deriving SQL queries from imperative code 

(emani:sigmod16, ). Unlike our work, this work addresses aggregates, inserts, and appends to lists but does not address array updates. Finally, our bulk processing of loop updates resembles the framework described in (guravannavar:vldb08, ), which rewrites a stored procedure to accept a batch of bindings, instead of a single binding. That way, multiple calls to a query under different parameters become a single call to a modified query that processes all parameters in bulk. Unlike our work, which translates imperative loop-based programs on arrays, this framework modifies existing SQL queries and updates.

Many scientific data generated by scientific experiments and simulations come in the form of arrays, such as the results from high-energy physics, cosmology, and climate modeling. Many of these arrays are stored in scientific file formats that are based on array structures, such as, CDF (Common Data Format), FITS (Flexible Image Transport System), GRIB (GRid In Binary), NetCDF (Network Common Data Format), and various extensions to HDF (Hierarchical Data Format), such as HDF5 and HDF-EOS (Earth Observing System). Many array-processing systems use special storage techniques, such as regular tiling, to achieve better performance on certain array computations. TileDB (tiledb, ) is an array data storage management system that performs complex analytics on scientific data. It organizes array elements into ordered collections called fragments, where each fragment is dense or sparse, and groups contiguous array elements into data tiles of fixed capacity. Unlike our work, the focus of TileDB is the I/O optimization of array operations by using small block updates to update the array stores. SciDB (scidb:sigmod10, ; scidb:ssdbm15, ) is a large-scale data management system for scientific analysis based on an array data model with implicit ordering. The SciDB storage manager decomposes arrays into a number of equal sized and potentially overlapping chunks, in a way that allows parallel and pipeline processing of array data. Like SciDB, ArrayStore (arraystore:sigmod11, ) stores arrays into chunks, which are typically the size of a storage block. One of their most effective storage method is a two-level chunking strategy with regular chunks and regular tiles. SystemML (systemML, ) is an array-based declarative language to express large-scale machine learning algorithms, implemented on top of Hadoop. It supports many array operations, such as matrix multiplication, and provides alternative implementations to each of them. SciHadoop (scihadoop:sc11, ) is a Hadoop plugin that allows scientists to specify logical queries over arrays stored in the NetCDF file format. Their chunking strategy, which is called the Baseline partitioning strategy, subdivides the logical input into a set of partitions (sub-arrays), one for each physical block of the input file. SciHive (scihive, ) is a scalable array-based query system that enables scientists to process raw array datasets in parallel with a SQL-like query language. SciHive maps array datasets in NetCDF files to Hive tables and executes queries via Map-Reduce. Based on the mapping of array variables to Hive tables, SQL-like queries on arrays are translated to HiveQL queries on tables and then optimized by the Hive query optimizer. SciMATE (scimate, )

extends the Map-Reduce API to support the processing of the NetCDF and HDF5 scientific formats, in addition to flat-files. SciMATE supports various optimizations specific to scientific applications by selecting a small number of attributes used by an application and perform data partition based on these attributes. TensorFlow 

(tensorflow, ) is a dataflow language for machine learning that supports data parallelism on multi-core machines and GPUs but has limited support for distributed computing. Finally, MLlib (MLlib:mlr16, ) is a machine learning library built on top of Spark and includes algorithms for fast matrix manipulation based on native (C++ based) linear algebra libraries. Furthermore, MLlib provides a uniform rigid set of high-level APIs that consists of several statistical, optimization, and linear algebra primitives that can be used as building blocks for data analysis applications.

3. Our Framework


Figure 1. Syntax of loop-based programs

3.1. Syntax of the Loop-Based Language

The syntax of the loop-based language is given in Figure 1. This is a proof-of-concept loop-based language; many other languages, such as Java or C, can be used instead. Types of values include parametric types for various kinds of collections, such as vectors, matrices, key-value maps, bags, lists, etc. To simplify our translation rules and examples in this section, we do not allow nested arrays, such as vectors of vectors. There are two kinds of assignments, an incremental update for some commutative operation , which is equivalent to the update , and all other assignments . To simplify translation, variable declarations, , cannot appear inside for-loops. There are two kinds of for-loops that can be parallelized: a for-loop in which an index variable iterates over a range of integers, and a for-loop in which a variable iterates over the elements of a collection, such as the values of an array. Our current framework generates sequential code from a while-loop. Furthermore, if a for-loop contains a while-loop in its body, then this for-loop too becomes sequential and it is treated as a while-loop. Finally, a statement block contains a sequence of statements.

3.2. Restrictions for Parallelization

Our framework can translate for-loops to equivalent DISC programs when these loops satisfy certain restrictions described in this section. In Appendix A, we provide a proof that, under these restrictions, our transformation rules to be presented in Section 3.8 are meaning preserving, that is, the programs generated by our translator are equivalent to the original loop-based programs. In other words, since our target language is translated to DISC operations, the loop-based programs that satisfy our restrictions are parallelizable.

Our restrictions use the following definitions. For any statement in a loop-based program, we define the following three sets of L-values (destinations): the readers , the writers , and the aggregators . The readers are the L-values read in , the writers are the L-values written (but not incremented) in , and the aggregators are the L-values incremented in . For example, for the following statement:

where is a loop index, the aggregators are , the readers are , and the writers are . Two L-values and overlap, denoted by , if they are the same variable, or they are equal to the projections and with , or they are array accesses over the same array name. The context of a statement , , is the set of outer loop indexes for all loops that enclose . Note that, each for-loop must have a distinct loop index variable; if not, the duplicate loop index is replaced with a fresh variable. For an L-value , is the set of loop indexes used in .

An affine expression (aho:book, ) takes the form

where are loop indexes and are constants. For an L-value in a statement , is true if is a variable, or a projection with , or an array indexing , where each index is an affine expression and all loop indexes in are used in . In other words, if is true, then is stored at different locations for different values of the loop indexes in .

Definition 3.0 (Affine For-Loop).

A for-loop statement is affine if satisfies the following properties:

  1. for any update in , ;

  2. there are no dependencies between any two statements and in , that is, if there are no L-values and such that, with the following exceptions:

    1. if , , and precedes ;

    2. if , , precedes , , and .

Restriction 1 indicates that the destination of any non-incre-mental update must be a different location at each loop iteration. If the update destination is an array access, the array indexes must be affine and completely cover all surrounding loop indexes. This restriction does not hold for incremental updates, which allow arbitrary array indexes in a destination as long as the array is not read in the same loop. Restriction 2 combined with exception (a) rejects any read and write on the same array in a loop except when the read is after the write and the read and write are at the same location (), which, based on Restriction 1, is a different location at each loop iteration. Exception (b) indicates that if we first increment and then read the same location, then these two operations must not be inside a for-loop whose loop index is not used in the destination. This is because the increment of the destination is done within the for-loops whose loop indexes are used in the destination and across the rest of the surrounding for-loops. For example, the following loop:

increments and reads . The contexts of the first and second updates are and , respectively, and their intersection gives , which is equal to the indexes of . If there were another statement inside the inner loop, this would violate Exception (b) since their context intersection would have been , which is not equal to the indexes of .

An affine for-loop satisfies the following theorem, which is proved in Appendix A. It is used as the basis of our program translations.

Theorem 3.2 ().

An affine for-loop satisfies:


In fact, our restrictions in Definition 3.1 were designed in such a way that all affine for-loops satisfy this theorem and at the same time are inclusive enough to accept as many common loop-based programs as possible. In Appendix A, we prove that our program translations, to be described in Section 3.8, under the restrictions in Definition 3.1 are meaning preserving, which implies that all affine for-loops are parallelizable since the target of our translations is DISC operations.

For example, the incremental update:

which counts all in groups that have the same key , satisfies our restrictions since it increments but does not read . On the other hand, some non-incremental updates may outright be rejected. For example, the loop:

will be rejected by Restriction 2 because is both a reader and a writer. To alleviate this problem, one may rewrite this loop as follows:

which first stores to and then reads to compute . This program satisfies our restrictions but is not equivalent to the original program because it uses the previous values of to compute the new ones. Another example is:

which is also rejected because is not affine as it does not cover the loop indexes (namely, ). To fix this problem, one may redefine as a vector and rewrite the loop as:

Redefining variables by adding to them more array dimensions is currently done manually by a programmer, but we believe that it can be automated when a variable that violates our restrictions is detected.

A more complex example is matrix factorization using gradient descent (koren:comp09, ). The goal of matrix factorization is to split a matrix of dimension into two low-rank matrices and of dimensions and , for small , such that the error between the predicted and the original matrix is below some threshold. One step of matrix factorization that computes the new values and from the previous values and can be implemented using the following loop-based program:

for i = 0, n-1 do
   for j = 0, m-1 do {
      pq := 0.0;
      for k = 0, l-1 do
         pq += P#’#[i,k]*Q#’#[k,j];
      error := R[i,j]-pq;
      for k = 0, l-1 do {
         P[i,k] += a*(2*error*Q#’#[k,j]-b*P#’#[i,k]);
         Q[k,j] += a*(2*error*P#’#[i,k]-b*Q#’#[k,j]); }}

where a is the learning rate and b is the normalization factor used in avoiding overfitting. This program first computes pq, which is the element of , and error, which is the element of . Then, it uses error to improve and . This program is rejected because the destinations of the assignments pq := 0.0 and error := R[i,j]-pq do not cover all loop indexes, and the read of pq violates exception (b) (since the intersection of the contexts of pq += P’[i,k]*Q’[k,j] and error := R[i,j]-pq is {i,j}, which is not equal to the indexes of pq). To rectify these problems, we can convert the variables pq and error to matrices, so that, instead of pq and error, we use pq[i,j] and error[i,j].

3.3. Monoid Comprehensions

The target of our translations consists of monoid comprehensions, which are equivalent to the SQL select-from-where-group-by-having syntax. Monoid comprehensions were first introduced and used in the 90’s as a formal basis for ODMG OQL (tods00, ). They were recently used as the formal calculus for the DISC query languages MRQL (jfp17, ) and DIQL (diql:BigData, ). The formal semantics of monoid comprehensions, the query optimization framework, and the translation of comprehensions to a DISC algebra, are given in our earlier work (jfp17, ; diql:BigData, ). Here, we describe the syntax only.

A monoid comprehension has the following syntax:

where the expression is the comprehension head and a qualifier is defined as follows:

The domain of a generator must be a bag. This generator draws elements from this bag and, each time, it binds the pattern to an element. A condition qualifier is an expression of type boolean. It is used for filtering out elements drawn by the generators. A let-binding binds the pattern to the result of . A group-by qualifier uses a pattern and an optional expression . If is missing, it is taken to be . The group-by operation groups all the pattern variables in the same comprehension that are defined before the group-by (except the variables in ) by the value of (the group-by key), so that all variable bindings that result to the same key value are grouped together. After the group-by, is bound to a group-by key and each one of these pattern variables is lifted to a bag of values. The result of a comprehension is a bag that contains all values of derived from the variable bindings in the qualifiers.

Comprehensions can be translated to algebraic operations that resemble the bulk operations supported by many DISC systems, such as groupBy, join, map, and flatMap. We use to represent the sequence of qualifiers , for . To translate a comprehension to the algebra, the group-by qualifiers are first translated to groupBy operations from left to right. Given a bag of type , groups the elements of by their first component of type (the group-by key) and returns a bag of type . Let be the pattern variables in the sequence of qualifiers that do not appear in the group-by pattern , then we have:

That is, for each pattern variable , this rule embeds a let-binding so that this variable is lifted to a bag that contains all values in the current group. Then, comprehensions without any group-by are translated to the algebra by translating the qualifiers from left to right:

Given a function that maps an element of type to a bag of type and a bag of type , the operation maps the bag to a bag of type by applying the function to each element of and unioning together the results. Although this translation generates nested flatMaps from join-like comprehensions, there is a general method for identifying all possible equi-joins from nested flatMaps, including joins across deeply nested comprehensions, and translating them to joins and coGroups (jfp17, ).

Finally, nested comprehensions can be unnested by the following rule:


for any sequence of qualifiers , , and . This rule can only apply if there is no group-by qualifier in or when is empty. It may require renaming the variables in to prevent variable capture.

3.4. Array Representation

In our framework, a sparse array, such as a sparse vector or a matrix, is represented as a key-value map (also known as an indexed set), which is a bag of type , where is the array index type and is the array value type. More specifically, a sparse vector of type is captured as a key-value map of type , while a sparse matrix of type is captured as a key-value map of type .

Merging two compatible arrays is done with the array merging operation , defined as follows:

where returns the keys of . That is, is the union of and , except when there is and , in which case it chooses the latter value, . For example, is equal to . On Spark, the operation can be implemented as a coGroup.

An update to a vector is equivalent to the assignment . That is, the new value of is the current vector but with the value associated with the index (if any) replaced with . Similarly, an update to a matrix is equivalent to the assignment .

Array indexing though is a little bit more complex because the indexed element may not exist in the sparse array. Instead of a value of type , indexing over an array of should return a bag of type , which can be for some value of type , if the value exists, or , if the value does not exist. Then, the vector indexing is , which returns a bag of type . Similarly, the matrix indexing is .

We are now ready to express any assignment that involves vectors and matrices. For example, consider the matrices , , and of type matrix[float]. The assignment:


is translated to the assignment:


which uses a bag comprehension equivalent to a join between the matrices and . This assignment can be derived from assignment (3) using simple transformations. To understand these transformations, consider the product . Since both and have been lifted to bags, because they may contain array accesses, this product must also be lifted to a comprehension that extracts the values of and , if any, and returns their product:

Given that matrix accesses are expressed as:

the product is equal to:

which is normalized as follows using Rule (2), after some variable renaming:

Lastly, since the value of in the assignment is lifted to a bag, this assignment is translated to , that is, is augmented with an indexed set that results from accessing the lifted value of . If contains a value, the comprehension will return a singleton bag, which will replace with that value. After substituting the value with the term derived for , we get an assignment equivalent to the assignment (4).

3.5. Handling Array Updates in a Loop

We now address the problem of translating array updates in a loop. We classify updates into two categories:

  1. Incremental updates of the form , for some commutative operation , where is an update destination, which is also repeated as the left operand of . It can also be written as . For example, increments by 1.

  2. All other updates of the form .

Consider the following loop with a non-incremental update:


for some vectors and , and some terms and that depend on the index . Our framework translates this loop to an update to the vector , where all the elements of are updated at once, in a parallel fashion:


But this expression may not produce the same vector as the original loop if there are recurrences in the loop, such as, when the loop body is . Furthermore, the join between range and in (6) looks unnecessary. We will transform such joins to array traversals in Section 3.6.

In our framework, for-loops are embedded as generators inside the comprehensions that are associated with the loop assignments. Consider, for example, matrix copying:

Using the translation of the assignment , the loop becomes:


To parallelize this loop, we embed the for-loops inside the comprehension as generators:


Notice the difference between the loop (7) and the assignment (8). The former will do 10*20 updates to while the latter will only do one bulk update that will replace all with at once. This transformation can only apply when there are no recurrences across iterations.

3.6. Eliminating Loop Iterations

Before we present the details of program translation, we address the problem of eliminating index iterations, such as in assignment (6), and and range(1, 20) in assignment (8). If there is a right inverse of such that , then the assignment (6) is optimized to:


where the predicate returns true if is within the range . Given that the right-hand side of an update may involve multiple array accesses, we can choose one whose index term can be inverted. For example, for , the inverse of is . In the case where no such inverse can be derived, the range iteration simply remains as is. One such example is the loop , which is translated to .

3.7. Handling Incremental Updates

There is an important class of recurrences in loops that can be parallelized using group-by and aggregation. Consider, for example, the following loop with an incremental update:


Let’s say, for example, that there are 3 indexes overall, , , and , that have the same image under , ie, . Then, must be set to . In general, we need to bring together all values of whose indexes have the same image under . That is, we need to group by . Hence, the loop can be translated to a comprehension with a group-by: