I Introduction
Modern sequencing technologies generate a very large number of short DNA segments called reads, which are stitched together to generate longer DNA sequences for finding new genomes. Although millions of reads can be generated in a day to allow high sequencing coverage, the assembly process becomes highly computationintensive. Singlethreaded assemblers often require a highend server with terabytes of RAM, and are not efficient enough. As a result, parallel short read assembly has aroused a lot of attention recently thanks to the advances in big data systems. Many parallel (and often, distributed) assemblers have emerged, including ABySS [17], Spaler [1], Ray [2] and SWAPAssembler [12].
Instead of providing yet another parallel assembler, we developed a toolkit called PPAassembler (https://github.com/yaobaiwei/PPAAssembler), which implements the basic data structures and operations in genome assembly. The operations have strong performance guarantees, and can be assembled to implement various sequencing strategies. Each operation may either read its input from Hadoop Distributed File System (HDFS)^{1}^{1}1Hadoop: http://hadoop.apache.org/, or directly obtain its input by converting the output of another operation in memory. As a result, PPAassembler can interoperate with existing Big Data platforms such as Hadoop and Spark [21], and intermediate results between consecutive jobs do not have to go through HDFS.
PPAassembler adopts the popular de Bruijn graph (DBG) based approach for sequencing [13]. Therefore, we built it on top of Pregel+^{2}^{2}2Pregel+: http://www.cse.cuhk.edu.hk/pregelplus/, our opensource implementation of Google’s Pregel framework for big graph processing. We remark that our solution is applicable to any Pregellike system [4, 14, 6], and Pregel+ is adopted mainly due to its superior performance as reported by [10] and due to its wide application [5, 19, 18]. Since the assembly process also involves some nongraph operations, such as to construct DBG from raw DNA reads, we also extended Pregel+’s API with new functionalities, including grouping and merging data by key, and inmemory data conversion for seamless job concatenation.
We summarized the key operations from existing assemblers, such as contig merging, tip removing and bubble filtering, and implemented these operations in PPAassembler as scalable Pregel programs. Specifically, each operation is implemented as a Practical Pregel Algorithm (PPA) as defined in [19], which runs for at most logarithmic number of iterations (to DBG size), and each iteration has linear space usage, computation cost and communication cost. Users may also include new operations into PPAassembler by implementing them using the userfriendly Pregel API.
Unlike ABySS [17], Ray [2] and SWAPAssembler [12], PPAassembler decouples lowlevel execution (e.g., data distribution and communication) from the highlevel assembly logic, allowing both layers to be independently optimized. For example, as the Implementation section of [17] indicates, ABySS needs to collect messages into larger 1KB packets for transmission in batch, in order to hide the roundtrip time of individual messages, but such communication details are automatically taken care of and optimized by a Pregellike system. Moreover, PPAassembler is compatible with existing big data platforms and can interoperate with other systems that perform various sequence mining and analytics tasks.
Although Spaler [1] is built on top of Spark to be compatible with existing big data platforms, the algorithms designed are rather ad hoc: they only demonstrate how genome assembly operations can be mapped into Spark API, without any formal analysis on the computation complexity. Moreover, most operations in DBGbased sequencing are graph operations, for which Spaler [1] uses the GraphX (Spark’s graph API) that are often over one order of magnitude slower than tailermade Pregellike systems [20, 3]. PPAassembler adopts the efficient Pregel+ system to process the dominating graph operations in DBGbased sequencing, and extends the API to conveniently and efficiently support those nongraph operations required during assembly. We compared PPAassembler with other existing distributed assemblers on large simulated and real datasets, and the results show that PPAassembler significantly beats all other assemblers in execution time, and provides comparable sequencing quality.
The rest of this paper is organized as follows. Section II reviews the framework of Pregel, and the definition of PPA. Section III provides the necessary concepts in genome assembly for readers without bioinformatics background. Section IV presents the implementation of our various operations in PPAassembler. Finally, we report the experimental results in Section V, and conclude this paper in Section VI.
Ii Pregel Review
For ease of presentation, we first define our graph notations. Given a graph , we denote the number of vertices by , and the number of edges by . We also denote the diameter of by . If is undirected, we denote ’s neighbors by and ’s degree by . If is directed, we denote ’s inneighbors (resp. outneighbors) by (resp. ) and ’s indegree (resp. outdegree) by (resp. ). We denote the ID of by , and use and interchangeably.
Computation Model. Pregel [11] distributes vertices to different machines in a cluster, where each vertex is associated with its adjacency list (e.g., ) and its attribute . A program in Pregel implements a userdefined compute(.) function and proceeds in iterations (called supersteps). In each superstep, each active vertex calls compute(msgs), where msgs is the set of incoming messages sent from other vertices in the previous superstep. In .compute(msgs), may process msgs and update , send new messages to other vertices, and vote to halt (i.e., deactivate itself). A halted vertex is reactivated if it receives a message in a subsequent superstep. The program terminates when all vertices are inactive and there is no pending message for the next superstep. Finally, the results (e.g., ) are dumped to HDFS.
Pregel numbers the supersteps, so that a user may access the current superstep number in compute(.) to decide the proper behavior. Pregel also supports aggregator, a mechanism for global communication. Each vertex can provide a value to an aggregator in compute(.) in a superstep. The system aggregates those values and makes the aggregated result available to all vertices in the next superstep.
Our Extensions to Pregel API. We find the following two API extensions useful in implementing PPAassembler. Firstly, for two consecutive jobs and , we allow to directly obtain input from the output of in memory. In contrast, existing Pregellike systems require to first dump its output to HDFS, which is then loaded again by . Let the vertex class of job (resp. ) be (resp. ), then to enable the direct memory input, users need to define a userdefined function (UDF) convert() which indicates how to transform an object of class (processed by ) into (zero or more) input objects of class (for job ). After job finishes, each machine generates a set of objects of type by calling convert(.) on its assigned vertices of type (which are then garbage collected). Since Pregel+ distributes vertices to machines by hashing vertex ID, the generated objects of type are then shuffled according to their vertex ID, before running job .
Secondly, the input data may not be in the format of one line per vertex. For example, each line may correspond to one edge, and hence the adjacency list of a vertex can be obtained from multiple lines. To create vertices from such input data, we support a miniMapReduce procedure during graph loading. Specifically, each line may generate (zero or more) keyvalue pairs (using UDF map(.)) where the key is vertex ID, and these keyvalue pairs are then shuffled according to vertex ID. After each machine receives its assigned keyvalue pairs, these pairs are sorted by key, so that all pairs with the same key form a group. Finally, each group with key are processed (using UDF reduce(.)) to create the input vertex object .
Practical Pregel Algorithm (PPA). Our prior work [19] defined a class of scalable Pregel algorithms called PPAs, and it designed PPAs for many fundamental graph problems. These PPAs can be used as building blocks to design PPAs for other sophisticated graph problems, such as DBGbased sequencing studied in this paper. Formally, a Pregel algorithm is called a balanced practical Pregel algorithm (BPPA) if it satisfies the following constraints:

Linear space usage: each vertex uses (or ) space of storage.

Linear computation cost: the time complexity of the function for each vertex is (or ).

Linear communication cost: at each superstep, the size of the messages sent/received by each vertex is (or ).

At most logarithmic number of rounds: the algorithm terminates after supersteps.
Constraints 13 offers good load balancing and linear cost at each superstep, while Constraint 4 controls the total running time. Note that Constraint 4 includes those algorithms that run for a constant number of supersteps. For some problems, the pervertex requirements of BPPA can be too strict, and we can only achieve overall linear space usage, computation and communication cost (still in rounds). We call a Pregel algorithm that satisfies these constraints simply as a practical Pregel algorithm
(PPA). The workload skewness problem can be solved using the requestrespond API of Pregel+
[18].We now review two PPAs proposed in [19], both will be used by PPAassembler for finding contigs in Section IVB.
BPPA for List Ranking. Consider a linked list with vertices, where each vertex keeps a value and its predecessor . The vertex at the head of has . For each vertex in , let us define to be the sum of the values of all the vertices from following the predecessor link to the head. The list ranking problem computes for every vertex in , where the vertices are stored on HDFS in arbitrary order.
The BPPA for list ranking works as follows. Each vertex initializes . Then in each round, each vertex does the following in compute(.): if , sets and ; otherwise, votes to halt. Note that to perform these updates, needs to first request its predecessor for and , which takes another superstep. This process repeats until for every vertex , at which point all vertices vote to halt and we have as desired.
Figure 1 illustrates how the algorithm works. Initially, objects – form a linked list with and . Let us now focus on . In Round 1, we have and so we set and . One can verify the states of the other vertices similarly. In Round 2, we have and so we set and . In Round 3, we have and so we set and . The number of vertices whose values get summed is doubled after each iteration, and thus the algorithm terminates in rounds.
Simplified SV Algorithm. The SV algorithm was proposed in [19] for computing the connected components (CCs) of a big undirected graph in number of supersteps, by adapting ShiloachVishkin’s PRAM algorithm [16] to run in Pregel. In the SV algorithm, each round of computation requires three operations: tree hooking, star hooking, and shortcutting. However, we find that star hooking is actually an artifact required by the original ShiloachVishkin’s algorithm for correct termination in the PRAM setting. Here, we propose a simplified version of the SV algorithm that does not require star hooking, which is more efficient as the expensive checking of whether a vertex is in a star (i.e., a tree with height 1) required by the original SV algorithm is eliminated.
Throughout this algorithm, vertices are organized by a forest such that all vertices in a tree belong to the same CC. Each vertex maintains a link to its parent in the forest. We relax the tree definition a bit here to allow the tree root to have a selfloop (i.e., ).
At the beginning, each vertex initializes , forming a self loop as shown Figure 2(a). Then, the algorithm proceeds in rounds, and in each round, the parent links are updated in two steps: (1) tree hooking (see Figure 2(b)): for each edge , if ’s parent is a tree root, we hook as a child of ’s parent (i.e., we merge the tree rooted at into ’s tree); (2) shortcutting (see Figure 2(c)): for each vertex , we move it closer to the tree root by linking to the parent of ’s parent, i.e., . Note that Step 2 has no impact on if is a root or a child of a root.
The algorithm repeats these two steps until no vertex has updated in a round (checked by using aggregator), by which time every vertex is in a star, and each star corresponds to a CC. Since monotonically decreases during the computation, at the end equals the smallest vertex in ’s CC (which is also the root of ’s star). In other words, all vertices with the same value of constitute a CC. Since each round can be formulated in Pregel as a constant number of supersteps, and shortcutting provides the round bound, the algorithm is a PPA.
Iii De Novo Genome Assembly
This section provides the necessary concepts in DBGbased sequencing for readers without bioinformatics background.
We model a DNA molecule as a very long sequence of nucleotides, where each nucleotide can take one of the four base types A, C, G and T. Since sequencing long DNA segments is errorprone, modern sequencing technologies generate a large number of short DNA segments, called reads. Figure 3 illustrates this process, where 4 DNA clones are sheared into 6 reads. Note that a DNA molecule consists of two strands coiled around each other, and we only consider strand 1 in Figure 3 for simplicity. Reads can have variable lengths, and sequencing errors may happen at some positions such as in reads 1⃝ and 5⃝ in Figure 3 (errors highlighted in red). Also, reads may overlap with each other, such as reads 2⃝ and 4⃝ in Figure 3 that share the segment “AGT”. It is through these overlaps that genome assembly algorithms stitch reads to get longer sequences (called contigs) or even the whole sequence.
De Bruijn Graph & mer. The DBGbased assembly approach first constructs a de Bruijn graph (DBG) from the reads, and then find contigs from the DBG. To construct a DBG, each read is cut into consecutive subsequences of length , where each subsequence is called a mer. For example, Figure 4 illustrates how we can generate 3mers from reads 3⃝, 4⃝ and 6⃝ of Figure 3 (we consider here), where read 3⃝ “ATTG” can be cut into two 3mers “ATT” and “TTG”. For each mer, we define its prefix (resp. suffix) as the subsequence without the last (resp. first) nucleotide, which is a mer. The mers define the vertices in the DBG, and each mer defines an edge from its prefix to its suffix in the DBG. For example, in Figure 4, the first 3mer of read 3⃝, i.e., “ATT”, defines a directed edge in DBG from vertex “AT” to vertex “TT”. From all the 3mers in Figure 4, we can create a path as shown at the top right corner of Figure 4, which essentially stitches reads 3⃝, 4⃝ and 6⃝ together into a longer contig “ATTGCAAGT”.
Ideally, we would like to choose to be large enough so that any subsequence of length in the whole DNA sequence appears only once, i.e., any mer vertex of the DBG corresponds to a unique subsequence in the whole sequence. If this is the case (an extreme case is when we set equal to the length of the whole sequence), the DBG is essentially a path like in Figure 4, following which we can reconstruct the whole sequence. Also, since we are essentially hashing the mers contained in the whole sequence to possible values (4 is because each nucleotide can take four values), a larger reduces the chance of having two DNA subsequences of length with the same sequence content. However, we cannot set arbitrarily large, since reads are short and any read with length less than cannot contain any mer and will be ignored. We should set to use the majority of the reads when constructing the DBG, but a mer vertex in the DBG may correspond to several different segment positions in the whole sequence. We call such a vertex as ambiguous. It is clear that the whole sequence corresponds to a Eulerian path of the DBG (not considering long repetitive sequence patterns that cause cycles), but there can be many Eulerian paths in the DBG. Thus, the goal of DBGbased sequencing is to find the maximal simple paths in the DBG that do not contain any ambiguous vertex, which constitute contigs which are longer subsequences of the whole sequence.
Error Correction. The above sequencing method does not consider the possibility that reads may contain errors. Read errors can further complicate the assembly process by introducing false vertices and edges into the DBG. Two typical errors are tips and bubbles, as illustrated in Figure 5 which shows the DBG constructed from the reads of Figure 3. A tip is a short dangling path in the DBG that leads to a deadend, such as edge “TG”“GA” in Figure 5 that is contributed by the error in read 1⃝. A bubble is a subpath that starts from a certain vertex at the main path of the DBG, and returns to the same path after a few hops. Figure 5 shows a bubble where the main path “GC”“CA”“AA”“AG” is contributed by correct reads such as 4⃝ and 6⃝, and the erroneous subpath “GC”“CT”“TA”“AG” is caused by read 5⃝ that has an error. Note that vertex “TG”, “GC” and “AG” become ambiguous simply because of the read errors. If we can correct the errors, we can obtain longer contigs. For example, in Figure 5, we essentially reconstruct the whole sequence as a contig after removing erroneous paths.
However, we should not be overly aggressive when correcting potential errors, as false alarms may create wrong (albeit longer) contigs. For example, we only consider a dangling path as a tip if it is short, as a long tip needs to be generated by multiple errors which is unlikely. In fact, a long dangling path is mostly likely to be a valid contig, with its deadend caused by no read coverage at the corresponding position in the whole sequence (or covered by reads with length less than ). Here, we define the coverage of a basepair (or, position) in the whole sequence as the number of reads that covers it. For example, in Figure 3, the first nucleotide has coverage 1 (covered only by read 3⃝), and the second nucleotide has coverage 2 (covered by reads 1⃝ and 3⃝). As there are many DNA clones, it is unlikely (but still possible) that a particular basepair is never covered.
As for a bubble, we remove subpath(s) with a very low coverage. Here, we define the coverage of a path as the minimum coverage among all edges on the path, where the coverage of an edge (i.e., a mer) is defined as the number of reads that generate that mer. A correct path is unlikely to have a low coverage as there are many DNA clones, and a low coverage is often contributed by an erroneous read. We also require a subpath to be similar to the main path (with high coverage) in order to remove it, since it is unlikely to have multiple errors that significantly changes the corresponding subsequence. The similarity can be measured by the edit distance between the two subsequences represented by these two paths.
Directionality. In the discussion so far, we have been assuming that the DNA molecule is just a long sequence. In reality, reads may be obtained from both strands of the DNA molecule. As Figure 3 on Page 3 shows, each strand has an endtoend chemical orientation, and reads are always obtained in the 5’to3’ direction. Specifically, strand 1 (resp. strand 2) is read from left to right (resp. from right to left). Let us temporarily read both strands from left to right, then two nucleotides at the same position in both strands constitute a basepair. For example, in Figure 3, the first two basepairs are and . Nucleotides and (resp. and ) are complementary to each other, and the two nucleotides in a basepair are always complementary, as shown in Figure 3.
Given a nucleotide , we denote its complement by . The reverse complement of a DNA sequence is denoted by (or simply ). For example, the reverse complement of strand 1 in Figure 3 is “GACTTGCAAT”, which is exactly strand 2 reading in the 5’to3’ direction.
Now consider read 4⃝ in Figure 3 which is replotted on strand 1 in Figure 6. If we read the same DNA segment on strand 2 in the 5’to3’ direction, we obtain another read “ACTT”, which is exactly the reverse complement of read 4⃝. Figure 6 also shows the mer vertices and mer edges generated by these two reads ().
Ideally, we would like any length subsequence of a strand to appear only once in that strand, and not to appear in the other strand, so that each mer decides a unique segment location in only one of the two strands. Assume this is the case, it is not difficult to see in Figure 6 that a mer and its reverse complement refer to the same position in the DNA molecule. For example, the two rightmost mer nodes “GT” and “AC” are the reverse complement of each other. We would like a mer and its reverse complement to correspond to a unique vertex in the DBG, so that reads from different strands can be stitched to create longer contigs as long as the reads share overlapping DNA segments. To achieve this goal, we define the canonical mer of a mer as the lexicographically smaller sequence between and , and use the canonical mer as a vertex in the DBG. For example, the rightmost mers “GT” and “AC” in Figure 6 both refer to the rightmost DBG vertex “AC” of the chain in the middle of Figure 6.
Accordingly, now each DBG edge needs to have a polarity to indicate the direction of a mer that generates this edge, i.e., to or to. Polarity is used to indicate the stitching directions when constructing contigs. We use an example to explain how edge polarity is determined. Consider the last mer of read “AAGT” from strand 1 in Figure 6 (), i.e., “AGT”, which creates an edge “AG”“GT”. Edge source “AG” is already canonical and thus we give it a label , while edge target “GT” needs to be converted to its reverse complement “AC” to be a DBG vertex, in which case we give it a label . The edge direction is simply a concatenation of the source and target labels, i.e., .
We say that labels and are complementary, and denote and . It is not difficult to see the following property (e.g., from Figure 6).
Edge with polarity is equivalent to edge with polarity .
This property allows us to stitch mers generated from different strands. For example, consider mers “AAG” from strand 1 and “ACT” from strand 2, which generates two edges “AA”“AG” and “AC”“AG”. Although both edges are incident to “AG”, the labels at the side of “AG” do not match. Since the latter edge is equivalent to “AG”“AC”, we can stitch the edges to obtain “AA”“AG”“AC” where both edges agree on label for “AG” and are in the same direction.
Finally, we remark that our discussion has been assuming the ideal case, and in reality a mer may appear in multiple positions in both strands. Such a mer is ambiguous, and our goal is still to find the contigs, i.e., the maximal simple paths in the DBG that do not contain any ambiguous vertex.
Iv PPAAssembler Algorithms
We first present our compact graph data structures, and then describe the operations supported by PPAAssembler.
Iva Vertex & Edge Formats
We design compact data structures for vertices and edges in our vertexcentric programs to be memoryefficient (note that genome assembly has a very high memory demand [9]).
Vertex ID. Each vertex in a Pregel program has a unique ID for message passing, and we use integer to specify vertex ID. There are two kinds of vertices in PPAAssembler, (1) mer and (2) contig. We encode the sequence of a mer directly into its integer ID, so that different mers have different IDs. Recall that reads are cut into mers during DBG construction. Without loss of generality, let us assume that , and hence we use 64bit integer for ID (more bits will be used if ). Each nucleotide is represented by two bits: A (00), T (11), G (10), C (01), and thus a
mer requires at most 62 bits to represent. We align this binary sequence to the right of the 64bit ID, and pad all remaining bits (at least 2) on the left with zeros. As an example, Figure
7(a) shows the ID of a 5mer “ATTGC”. Sometimes, we would like to indicate that a mer or a contig has no neighbor along one direction (e.g., the deadend of a tip). We use a dummy neighbor in this case, denoted by NULL, whose ID is given by the special 64bit binary sequence with the most significant bit being 1 and all others being 0. Finally, since a contig can be an arbitrarily long sequence, we cannot encode the sequence into the contig’s ID. Instead, since the contigs are distributed among the machines after their generation in PPAassembler, we let the th worker machine assign its th contig a 64bit ID that equals the 32bit integer representation of concatenated with the 32bit integer representation of as shown in Figure 7(c). To avoid collision with the ID of a mer, we also flip the most significant bit to be 1.Compared with directly using sequence as vertex ID, using integer ID provides the following benefits: (1) Pregel heavily checks vertex IDs for message delivery, and integer IDs benefit from efficient wordlevel instructions; (2) no additional space is needed to store the sequence of a mer vertex, and the sequencerelated processing can be efficiently executed by bitwise operations.
Format of a mer Vertex. Each vertex also maintains an adjacency list of its neighboring vertices (mers or contigs). Since a contig is obtained by merging unambiguous mers, it has only two neighbors along its two opposite sequencing directions, where each neighbor is either an ambiguous mer or NULL (i.e. the deadend).
In contrast, a mer vertex may have more than two neighbors. There are two cases: (1) at the beginning, all vertices (and hence all neighbors) are mers that are generated from reads; (2) in later processing, a neighbor of a mer may also be a contig that is generated by merging unambiguous mers (this later processing makes sense since error correction may render the current mer unambiguous, leading to further contig merging). Case (1) is the most memoryconsuming since the overlapping mers incur a lot of data redundancy; once unambiguous mers (which are often the majority) get merged into contigs, the memory consumption is usually no longer a problem due to the significantly reduced data volume.
Therefore, we compress the adjacency lists of mer vertices in Case (1) using compact bitmaps, which we describe next. Let us first ignore edge polarity, then a mer can have at most 4 inneighbors and 4 outneighbors (i.e., 8 neighbors). For example, the 4mer “CCGT” can have at most 4 inneighbors whose suffix matches its prefix “CCG”, i.e., “ACCG”, “TCCG”, “GCCG” and “CCCG”. Now taking the 4 possible edge polarity , , and into account, we obtain possible combinations, which we represent using the 32bit bitmap shown in Figure 8(a). A bit is 1 if the corresponding neighbor exists (and it is 0 otherwise). For example, if the rightmost bit is 1, then we can obtain the neighbor (i.e., its ID that encodes the sequence) by (i) reversecomplementing the current mer (i.e., its ID) since the left half of edge polarity is , (ii) appending “C” to its suffix, and (iii) reversecomplementing the resulting sequence since the right half of edge polarity is . In addition to the 32bit neighbor bitmap, a mer vertex also maintains a list of counts, one for each neighbor (i.e., each bit 1 in the bitmap) which records the coverage of the corresponding edge. The counts are stored as variablelength integers to save space (e.g., a small count can often be represented with just one byte). In fact, using Property III mentioned in Section III, we can actually further cut the bitmap size by half.
In uncompressed format, each mer neighbor (i.e., adjacency list item) of a mer vertex is represented by an 8bit bitmap plus the coverage of the corresponding edge. The bitmap format is shown in Figure 8(b), where the leftmost three bits are always 0, two bits are used to indicate what nucleotide gets prepended (resp. appended) to the prefix (resp. suffix) of the current mer vertex to form the neighbor, one bits is used to indicate whether the neighbor is an inneighbor or an outneighbor, and two bits are used to indicate the polarity of the corresponding edge. For example, consider the 4mer vertex “ACGG” in Figure 8(b). Its inneighbor, node 1⃝, is represented by the bitmap 00010111 which indicates that the edge polarity is , and that the neighbor’s sequence “CGGC” is obtained by reversecomplementing “ACGG” into “CCGT”, prepending G (10) to the prefix “CCG” to obtain “GCCG”, and then reversecomplementing it into “CGGC”. Finally, sometimes we need to indicate that a mer vertex reaches a deadend along one direction, in which case we use the bitmap 10000000 to indicate that the neighbor is NULL.
A mer vertex tracks its contig neighbors differently from the mer neighbors, and we shall discuss the format shortly.
Format of a Contig. Recall from Figure 7(c) that the ID of a contig does not contain its sequence information. In PPAassember, a contig vertex keeps its sequence as a variablelength bitmap. For example, Figure 9 shows a DBG path where only the two end mer vertices are ambiguous, and the other mers are merged into a contig with bitmap 11 10 01 01 10 11 00 01. We always keep the contigside edge polarity to be , so that the contig corresponds to the sequence in strand 1 rather than strand 2. Besides the bitmap, a contig vertex also maintains an inneighbor (resp. outneighbor) such as the mer vertex “CTGC” (resp. “TACA”) in Figure 9. Note that the inneighbor and outneighbor are uniquely defined since we already specify the sequencing direction of any contig, i.e., 5’to3’ in strand 1. Besides ID, each neighbor is also stored with the neighborside edge polarity (e.g., the red and blue ’s in Figure 9), and the coverage of the corresponding adjacent edge (e.g., 101 and 103 in Figure 9). Finally, a contig vertex also maintains its own coverage (e.g., 98 in Figure 9), which is computed as the minimum coverage of all edges (i.e., mers) merged by the contig.
We now consider how a mer vertex tracks its contig neighbors. Recall that tracks each mer neighbor by a bitmap plus a coverage. It tracks each contig neighbor quite differently. Referring to Figure 9 again, while we can view the contig as the neighbor of vertex “CTGC” (and “TACA”), another perspective is to treat it as a label on the edge connecting “CTGC” to “TACA”. We adopt the latter perspective, and let maintain ’s information as a triplet including (i) the mer vertex’s ID on the other end of , denoted by ; (ii) direction of edge (incoming or outgoing) and polarity (e.g., the red and blue ’s in Figure 9); (iii) ’s ID which can be used by to request for ’s sequence (e.g., for further contig merging). Other information about such as sequence length and coverage can also be materialized in its corresponding adjacency list item of to facilitate tip removing and bubble filtering, which eliminates the cost of requesting them from during processing.
Vertex Types. First consider a mer vertex , and it can be of one of the following three types: (1) : such a vertex only has one neighbor, and is thus a deadend; (2) : such a vertex has two neighbors, and when both edges agree on the polarity label for (either or ) which can be enforced using Property III, one neighbor is an inneighbor and the other is an outneighbor; such a vertex is unambiguous; (3) : such a vertex has at least two neighbors, but it does not satisfy the requirement of ; such a vertex is ambiguous. Note that cannot have no neighbor, since a mer vertex is contributed by the prefix or suffix of a mer.
Since a contig is generated by merging unambiguous mers, it can only be of type or type . Here, we say that a contig vertex is of type iff at least one of its two neighbors is NULL, i.e., the contig corresponds to a dangling path in DBG and is thus a tip candidate. Note that it is possible to have an isolated contig where both ends are deadends (i.e., NULL), and will be regarded as a tip unless it is long.
IvB Operations and Their Algorithms
PPAassembler provides a library of operations for flexible genome assembly in a distributed environment deployed with Hadoop. Each operation is implemented as a PPA (described in Section II) and is thus scalable; it can either load data from HDFS, or directly obtain input from another operation’s output in memory. Users may combine the provided operations to implement various sequencing strategies, and they may even integrate new operations or redefine existing operations (e.g., changing the criteria for judging tips and bubbles) using Pregel+’s vertexcentric API.
Overview. Figure 10 shows the data flow diagram of PPAassembler, which includes five operations: 1⃝ DBG construction, which constructs a DBG from the DNA reads, and outputs the mer vertices of the DBG along with their adjacency lists; 2⃝ contig labeling, which divides the vertices into two sets (ambiguous ones and unambiguous ones) and labels unambiguous vertices by the contigs that they belong to; 3⃝ contig merging, which merges unambiguous vertices into contigs according to the labels; 4⃝ bubble filtering, which filters any lowcoverage contig that shares both ends with another highcoverage contig that have a similar sequence; 5⃝ tip removing, which takes the ambiguous mers and the contigs (after bubble filtering), and removes tips.
In fact, the output of tip removing can be fed to the “contig labeling” operation again to grow longer contigs (see arrow 6⃝ in Figure 10), since the previous error correction operations may have converted some ambiguous mer vertices into unambiguous ones, and the operations 2⃝–5⃝ may loop as many times as needed (though we typically just loop for one more round). At the first round, the inputs to operations “contig labeling” an “contig merging” must be mers, but starting from the second round, the inputs may contain a mix of mers and contigs. For ease of discussion, we focus on the first round when discussing operations “contig labeling” an “contig merging”.
1⃝ DBG Construction. This operation loads DNA reads from HDFS, and creates a DBG from them through two mini MapReduce phases: (i) the first phase extracts mers from reads, and (ii) the second phase constructs mer vertices and their adjacency lists from the extracted mers, which form the DBG.
We first describe phase (i). In real DNA data, a read’s sequence may contain element “N” besides “A”, “T”, “G”, “C”, and such an element indicates that the nucleotide cannot be determined due to noise in measurement. For this purpose, in map(.), a read is first split into sequences by elements “N”, and each sequence is parsed to obtain the mers using a sliding window of elements as illustrated in Figure 4. The sequence of a mer is directly encoded in its 64bit integer ID, which functions as the key for shuffling. In each worker machine, if a mer is obtained for the first time, the worker creates an pair for it where ; otherwise, the mer’s count is increased by 1. After shuffling, for each mer, all its counts (from all workers) are input to reduce(.), which then sums these counts to obtain the total count of the mer; reduce(.) only outputs the mer as an pair, if the coverage where is a userdefined threshold. We filter a lowcoverage mer since it is very likely to be contributed by erroneous readers.
In Phase (ii), each remaining mer is input to map(.), which extracts two mers that correspond to its prefix and suffix. In each worker, if an extracted mer is obtained for the first time, the worker creates a mer vertex for it. A directed edge from the prefix mer vertex to the suffix mer vertex is also added into the adjacency lists of both mer vertices, where an adjacency list is represented by a 32bit bitmap as shown in Figure 8, and edge count (which equals the mer’s count) is also recorded or incremented, using a variablelength integer. The mer vertices with partially constructed adjacency lists are then shuffled by the 64bit integer ID. After shuffling, for each mer, its partial adjacency lists (from all workers) are input to reduce(.), which combines them to obtain the complete adjacency list (still represented in 32 bits), and sums the counts of the each edge to obtain the edge coverage (represented compactly using a variablelength integer).
2⃝ Contig Labeling. Let us call a path that only contains vertices of types and  as an unambiguous path. The “contig labeling” operation marks all vertices on each maximal unambiguous path with a unique label, so that they can be grouped to create a contig later. The operation is executed right after “1⃝ DBG construction”, and the input vertices are all mers. It can also be executed after “5⃝ tip removing” to find longer contigs, in which case some input vertices could already be contigs.
A vertex is at one end of a maximal unambiguous paths, if its type is , or if its type is  and at least one neighbor is of type . For example, in Figure 9, vertex “GGCA” is at the left end of contig “TGCCGTAC”, since its neighbor “CTGC” is of type . The contig labeling operation first recognizes contigends in two supersteps: (1) in superstep 1, every vertex of type  broadcasts its ID to all its neighbors, and then votes to halt; it will never be reactivated again as the remaining computation only involves unambiguous vertices; (2) in superstep 2, a vertex recognizes itself as a contigend if it is of type , or if it is of type  and receives the ID of any ambiguous vertex sent from superstep 1.
There are two methods to find all maximal unambiguous paths (i.e., contigs) in supersteps, both of which require contigend vertices to remove all their edges with ambiguous vertices, so that the DBG graph becomes a set of isolated unambiguous paths, each corresponding to a contig. The first method is to run the simplified SV algorithm described in Section II, so that every vertex is labeled with the smallest vertex ID in its connected component (i.e., isolated unambiguous path containing ). The second method is to use the idea of list ranking described in Section II to find all unambiguous paths in time, where is the length of the longest unambiguous path. We now describe the second algorithm in more detail.
We illustrate this algorithm using the example of Figure 9, which is replotted in Figure 11. Each edge is plotted along with its equivalent edge in the other direction as determined by Property III, and each vertex is denoted by its integer ID (e.g., “GGCA” is encoded with bitmap 10100100, which is 164). As mentioned two paragraphs before, in superstep 2, a vertex that recognizes itself as a contigend needs to remove edges with any ambiguous vertex. Instead of deleting such an edge from ’s adjacency list, we replace it with a selfloop edge, but we flip the second most significant bit of target ID (i.e., ’s ID, let it be ) to indicate that is a contigend. The flipped ID is denoted by . For example, in Figure 11, vertex with ID 164 has two neighbors and , and it replaces the edge with the ambiguous neighbor (who sent its ID to in superstep 1) by a selfloop edge to itself, leading to a pair of neighbor ID . Recall from Figure 7 that the second most significant bit of a 64bit ID is neither used to encode mer sequence nor used to differentiate ID types.
In our list ranking approach, each unambiguous vertex maintains a pair of IDs, which is initialized as the pair of neighbor IDs set by superstep 2, as illustrated by round 0 in Figure 11. Note that a vertex of type also has a pair of IDs, since its NULL neighbor is replaced with the selfloop edge (note that is a contigend). We then perform list ranking in both sequencing directions of a contig, and we call the process as bidirectional list ranking. We pass messages in both directions rather than from one end of a contig to the other end, since the two ends are symmetrically recognized in superstep 2, and edge direction alone is not sufficient to determine the sequencing direction as explained by Property III.
In the IDpair maintained by a vertex , each ID corresponds to ’s predecessor in one sequencing direction, which is updated as the predecessor’s predecessor after each round until it becomes the flipped ID of a contigend. We illustrate this process by considering vertex with ID 105 in Figure 11. In round 0, sends its ID to its two predecessors 164 () and 26 () in one superstep. In the next superstep, receives ’s ID 105, checks its ID pair and finds the predecessor that is not the received ID, i.e., (108), which is responded back to . Similarly, will also receive from , and it then replaces its current IDpair with the received pair of IDs . In round 1, send requests to 108 () only since it has already reached the contigend in the other direction. It receives ’s predecessor , and updates its IDpair as . Since it reaches both contigends, it votes to halt and will not participate in any future computation. In fact, all vertices reach both contig ends before round 2 and vote to halt, and thus the computation stops in 2 rounds. It is not difficult to see that the number of hops between a vertex and its precedessors gets doubled by each round, and thus the computation stops in supersteps and is thus a BPPA. When the computation terminates, the IDpair of each vertex contains the flipped IDs of its two contigends. Obviously, each IDpair uniquely defines a contig, and we use the smaller contigend vertex’s ID as the contiglabel.
Bidirectional list ranking alone is not sufficient if the DBG contains a cycle of vertices of type  (a special contig if large enough), since these vertices will never reach an end. Note that if the vertex IDpairs in any contig is not finalized, each round will have some vertices vote to halt due to reaching both contigends. Therefore, if the number of active vertices is larger than 0 and does not decrease after a round, the algorithm turns to run our simplified SV algorithm on the remaining active vertices, so that each vertex in a cycle obtains the smallest ID in the cycle. Bidirectional list ranking is preferred since each round only takes 2 supersteps, much smaller than that required by a round of the SV algorithm. On the other hand, running the SV algorithm over vertices in cycles at last is fast, since there are very few active vertices remaining.
To summarize, if only the simplified SV algorithm is adopted, then each vertex obtains its contiglabel as the smallest vertex ID in its contig; if bidirectional list ranking is adopted, each vertex in a noncycle contig obtains its contiglabel as the smaller contigend vertex’s ID, while each vertex in a cycled contig obtains its contiglabel as the smallest vertex ID in the cycle.
3⃝ Contig Merging. This operation takes the labeled unambiguous vertices as the input, and uses a mini MapReduce procedure to group the vertices by their labels. All vertices with the same contiglabel are input to reduce(.), which then merges the sequences of these vertices to obtain the contig.
We now describe the merging process in reduce(.). Firstly, a hash table is constructed over all the vertices in the contiggroup, so that we can lookup a vertex object (storing information like its sequence and neighbors) using its 64bit integer ID. We also identify a contigend vertex, which contains a neighbor not in the group (either NULL or of type ), to start the stitching with. If such a vertex cannot be found, the contig is cycled and we start stitching from an arbitrary vertex.
We then order all the vertices from the starting vertex (and meanwhile, set the edge directions properly), so that they can be stitched in order. Let us denote the starting vertex by , and denote the subsequent vertices after ordering by . Initially, we find a neighbor of that is not its selfloop, which is found as . We let ’s outneighbor be , and let the other neighbor of be its inneighbor. Edge directions and polarities are properly adjusted using Property III if they are originally inconsistent. We then obtain from the hash table for processing, using its ID stored in ’s adjacency list. Generally, for each vertex , we let be its inneighbor, and let the other neighbor (which is found as ) be the outneighbor; then can be obtained from the hash table (using its ID in ’s adjacency list) to continue the ordering process. The ordering finishes when all vertices have been processed.
If is of type , we exit reduce(.) if the aggregated contig length is not above the userspecified tiplength threshold (since the contig is a tip). In all other cases, we stitch the vertices in the order of to construct the contig. Specifically, if the edge polarity on ’s side is , we reversecomplement ’s sequence and append it to the contig’s sequence; otherwise, ’s sequence is directly appended to the contig’s sequence. For each subsequent vertex , we check whether the edge polarity on its side is or , and use ’s sequence or its reverse complement to update the contig’s sequence. Note that the two sequences overlap by elements and this should be taken into consideration. For example, in Figure 9, assume that vertex “GGCA” already appended sequence “TGCC” to the contig’s sequence (as edge polarity is on the side of “GGCA”), then vertex “CGGC” should only append the complement of the first element (not the last element as as edge polarity is on the side of “CGGC”) to the contig’s sequence, leading to “TGCCG”. We also set the contig’s coverage as the minimum edge coverage seen during the concatenation, and set the contig’s two neighbors with ’s inneighbor and ’s outneighbor.
4⃝ Bubble Filtering. The contigs previously constructed may then enter the “bubble filtering” operation for further filtering though a mini MapReduce procedure. In map(.), each contig with neighbors and (), both of type , associates itself with a key for shuffling. As a result, all contigs that share two neighboring ambiguous vertices are input to reduce(.), and let us denote them by . We then process each contig as follows: if is not already pruned, we check whether any contig () can prune . Specifically, we first compute the edit distance between ’s sequence and ’s sequence or its reverse complement (depending on whether and ’s edge directions are consistent, i.e., to or to). If the distance is smaller than a userdefined threshold, we mark (resp. ) as pruned if its coverage is smaller than (resp. ).
5⃝ Tip Removing. This operation takes both ambiguous mers and the merged contigs as input. We first need to update the adjacency lists of the ambiguous mers, to link them to the newly merged contigs. In fact, since some contigs may have been removed due to bubble filtering, some ambiguous mers may have changed their types from  to  or . Recall from Section IVA that a mer vertex stores its contig neigbhor by maintaining (1) the contig vertex’s ID (e.g., for requesting its sequence), (2) the vertex that the contig connects to on the other end, and (3) other contig information like its length. We set the adjacency lists of the mer vertices in two supersteps: (i) in superstep 1, each contig vertex sends its information mentioned above to both neighbors (if not NULL); then (ii) in superstep 2, each mer vertex collects these information into its adjacency list.
Since only path length is concerned during tip removing, we only need to check the mer vertices as each mer vertex maintains each contig neighbor in the form of ’s ID, ’s sequence length, and the mer vertex on the other end of . However, when deleting the edge (due to being part of a tip), a message should be sent to the contig vertex (using ’s ID stored in the adjacency list item) to tell it to delete itself, which we take for granted and will not emphasize in the subsequent algorithm description.
Note that the removal of tips may cause some vertices of type  to change their type to , hence generating new tips. As a result, we run our vertexcentric tip removing procedure for multiple phases, until no new typed vertex is generated at the end of a phase.
In a phase, we start message passing from vertices of type , where a message records (1) the sender’s ID, (2) cumulative sequence length, and (3) a type REQUEST. A typed vertex initializes the cumulative sequence length as (i.e., ’s sequence length). When a vertex of type  receives a REQUEST message, it relays the message to the other neighbor (i.e., not the sender) by adding the cumulative sequence length by 1 (contributed by ) plus the contig length minus if edge contains a contig (“minus ” is not to count the overlapping sequence).
The REQUEST message ends at an typed or typed vertex , which checks whether the cumulative sequence length is not larger than the tiplength threshold. If so, sends a message of type DELETE to the sender to delete the vertices on the dangling path. The DELETE message is relayed by typed vertices back till reaching the typed vertex that initiates the REQUEST message, and vertex and contig deletions are triggered along the backward message propagation.
A special case is when a tip has two typed ends. Since both vertices at the ends initiate a REQUEST message sent towards each other, when the two DELETE messages are sent back, they meet in the middle of the tip (rather than reach the other typed end).
An typed vertex also deletes its edge to the neighbor that it sends a DELETE message, and if its type becomes , it keeps itself activated to initiate the REQUEST message in the next phase.
V Experiments
The stateoftheart parallel assemblers often use adhoc design. For example, ABySS [17] builds the DBG by letting each mer send messages to its 8 possible neighbors (with A/T/G/C prepended/appended) to establish edges. This increases ambiguity (and hence reduces contig length) since an edge will be created between 2mers “CA” (e.g., contributed by 3mer “CAT”) and “AA” (e.g., contributed by “GAA”) even though the 3mer “CAA” does not exist in the DNA molecule. As another example, instead of using list ranking or SV to find maximal unambiguous paths as contigs, Spaler [1] iteratively breaks each unambiguous path by sampled vertices to form segments, and then merges segments that meet at a sampled boundary vertex. The process is repeated until 
typed vertices account for more than 1/3 of all vertices in the graph, and this heuristic provide no guarantee of path maximality. PPAassembler avoids adhoc design by welldesigned assembly algorithmic logic (that can be further customized by users), which are decoupled from lowlevel distributed communication through using the performanceoptimized vertexcentric and mini MapReduce procedures provided by Pregel+.
We compare PPAassembler with the stateoftheart parallel assemblers, ABySS (version 1.5.2), Ray (version 2.3.1) and SWAPAssembler (version 3.0). Spaler is not opensourced and is thus not included in our comparison. We use the simple workflow 1⃝2⃝3⃝4⃝5⃝6⃝2⃝3⃝ in Figure 10 for PPAassembler, i.e., to grow contigs once further after error correction. However, we remark that users may customize their own workflow or even change the existing operations (e.g., add coveragethreshold pruning to bubble filtering) or add new operations implemented in Pregel+’s API (e.g., branch splitting [1] for error correction) to implement different assembly strategies, including those of ABySS, Ray and SWAPAssembler.
All experiments were conducted on a cluster of 16 machines connected by Gigabit Ethernet, each with 48 GB DDR3 RAM and 12 cores (two Intel Xeon E52620 CPUs). We used for defining mers. For PPAassembler, edit distance threshold for bubble filtering is set as 5, and length threshold for tip removing is set as 80, since we found that the sequencing results are very stable near these parameter ranges. We used the default settings of ABySS, Ray and SWAPAssembler in their respective experiments.
We ran our experiments with the 4 datasets shown in Table I, where are listed in the increasing order of data volume. The two smaller datasets are generated from NCBI’s reference gene sequences Homo Sapiens Chromosome X (HCX)^{3}^{3}3http://www.ncbi.nlm.nih.gov/nuccore/NC_000023.11 and Homo Sapiens Chromosome 2 (HC2)^{4}^{4}4https://www.ncbi.nlm.nih.gov/nuccore/NC_000002.12, using the ART^{5}^{5}5http://www.niehs.nih.gov/research/resources/software/biostatistics/art/ software [8]. These datasets have reference sequences and thus we can measure the sequencing quality exactly. To test scalability further, we also selected the 2 larger datasets Human Chromosome 14 (HC14) ^{6}^{6}6http://gage.cbcb.umd.edu/data/Hg_chr14 and Bombus Impatiens (BI) ^{7}^{7}7http://gage.cbcb.umd.edu/data/Bombus_impatiens, which are downloaded from the GAGE project [15]. All the datasets are in FASTQ format, which includes the sequence of each DNA read.
Running Time & Scalability. We use bidirectional list ranking for contig labeling in this set of experiments. Figure 12 shows the performance of the four assemblers on the two large datasets HC14 and BI (the results on HCX and HC2 are similar and are thus omitted). For each assembler, we show the endtoend execution time of assembly when each machine runs 1, 2, 3 and 4 workers, respectively. We can see that PPAassembler is much faster than all the other assemblers in all cases, thanks to the efficient Pregel+ backend and our welldesigned algorithms. This performance difference becomes even larger when the input dataset becomes larger (e.g., compare Figure 12(b) to Figure 12(a)), which verifies the superior scalability of PPAassembler towards data size. In contrast, Ray has the poorest performance, often one order of magnitude slower than the other assemblers.
As for the scalability towards the number of workers, the performance of PPAassembler, SWAPAssembler and Ray keeps improving as the worker number increases. In contrast, the performance of ABySS is insensitive to the number of workers. In fact, more workers may even lead to a longer assembly time.
Datasets  # of Supersteps  # of Messages  Runtime (s)  
LR  SV  LR  SV  LR  SV  
HCX  26  86  2,325 M  5,913 M  93  212 
HC2  28  93  1,498 M  3,644 M  58  128 
HC14  67  93  2,342 M  6,852 M  213  415 
BI  60  86  6,705 M  22,958 M  239  723 
Datasets  # of Supersteps  # of Messages  Runtime (s)  
LR  SV  LR  SV  LR  SV  
HCX  32  44  2.16 M  5.28 M  0.51  0.67 
HC2  12  37  1.05 M  2.74 M  0.20  0.50 
HC14  22  51  6.04 M  22.46 M  1.06  1.83 
BI  38  65  74.36 M  280.04 M  3.77  10.26 
Bidirectional List Ranking v.s. Simplified SV. These are two approaches for contig labeling. As we mentioned, while both algorithms are PPAs that runs for rounds (and hence supersteps), each round of SV require a larger number of supersteps than a round in list ranking, and thus list ranking (LR) is expected to be much faster. Recall that we run PPAassembler with the simple workflow of 1⃝2⃝3⃝4⃝5⃝6⃝2⃝3⃝ in Figure 10, and “2⃝ contig labeling” is performed twice: once for labeling unambiguous mers, and once for labeling contigs (to grow longer ones).
Table II and Table III show the comparison of LR and SV for labeling mers and labeling contigs, respectively, on the four datasets, where we report (1) the number of supersteps, (2) the number of messages, and (3) the running time. We can see that LR runs for much fewer supersteps, sends much fewer messages, and is much faster than SV. The message number and runtime in Table III is three orders of magnitude less than those in Table II, since the vertex number is significantly reduced after we merge unambiguous mers into contigs. For example, the DBG of the HC2 dataset has 46.97 M vertices, which is reduced to 1.00 M vertices after merging unambiguous mers into contigs, and further to 68,264 vertices after these contigs are merged after error correction.
Sequencing Quality. We now assess the sequencing quality of the assemblers. We remark that, for PPAassembler, we are just evaluating the adopted workflow. We can easily configure PPAassembler with other assembly strategies that leads to a higher sequencing quality. Even with the adopted workflow, PPAassembler achieves comparable (if not better) quality, which we present next.
We used the popular assessment tool, QUAST [7], which reports various quality metrics commonly used in genetic analysis. These metrics include: (1) N50, which is defined as the sequence length of the contig that contains middle element of the sequence that concatenates all contigs from the longest one to the shortest one; (2) the number of contigs whose length are larger than 500 bp; (3) the length of largest contig; (4) the total length of contigs; (5) genome coverage, which is the percentage of bases in the genome covered; (6) the number of misassembled contigs; (7) unaligned length, and so on. Some of these metrics do not require a reference sequence (e.g., N50), while others do (e.g., the number of misassembled contigs).
Assembler  PPA  ABySS  Ray  SWAP 

# of contigs  22,707  29,231  26,739  12,477 
Total length  36,878,742  31,426,810  20,854,349  8,232,160 
N50  2,070  1,184  779  640 
Largest contig  16,376  7,166  3,248  1,982 
GC (%)  40.89  41.77  41.03  41.21 
# Misassemblies  1  4  1  167 
Misassembled length  1,366  3,666  520  115,998 
Unaligned length  24  427  1,227  47,810 
Genome fraction (%)  76.285  65.104  42.981  16.963 
# Mismatches per 100 kbp  0.43  13.75  1.04  43.02 
# Indels per 100 kbp  0.03  0.10  0.09  5.32 
Largest alignment  16,376  7,166  3,248  1,982 
We present the sequencing quality of the assemblers on HC2 in Table IV. Since HC2 has a reference sequence, we obtained all the various quality metrics reported by QUAST. The best results among the assemblers are highlighted in Table IV, and we can observe that PPAassembler performs the best in the majority of the metrics (and comparable in others). For example, it has the highest N50 value and the lowest misassemblies (i.e., only 1 misassembled contig). It is worth mentioning that the second round of contig merging is effective: N50 is 1074 after we merge unambiguous mers into contigs, and it improves to 2070 (i.e., is doubled) after we merge contigs after error correction. The results on HCX are similar and thus omitted.
Assembler  PPA  ABySS  Ray  SWAP 

Number of contigs  41,445  18,008  45,984  47,252 
Total length  62,667,868  26,586,604  63,456,459  63,752,569 
N50  1,891  1,847  1,641  1,605 
Largest contig  16,069  15,744  15,116  13,251 
We report the sequencing quality of the assemblers on HC14 in Table V. Since HC14 has no reference sequence, we cannot report many of the metrics such as misassemblies. The results show that PPAassembler achieves the largest N50 value, performs the best in 2 of the 4 metrics, and achieves comparable performance to the best in the other two metrics. The results on BI are similar and thus omitted.
Vi conclusion
We presented a scalable and flexible de novo genome assembler, PPAassembler, built on a popular big data framework and provides strict performance guarantee. PPAassembler is many times faster than other distributed assemblers, and achieves comparable (if not better) sequencing quality.
References
 [1] A. AbuDoleh and U. V. Catalyurek. Spaler: Spark and graphx based de novo genome assembler. In Big Data (Big Data), 2015 IEEE International Conference on, pages 1013–1018. IEEE, 2015.
 [2] S. Boisvert, F. Laviolette, and J. Corbeil. Ray: simultaneous assembly of reads from a mix of highthroughput sequencing technologies. Journal of Computational Biology, 17(11):1519–1533, 2010.
 [3] Y. Bu, V. R. Borkar, J. Jia, M. J. Carey, and T. Condie. Pregelix: Big(ger) graph analytics on a dataflow engine. PVLDB, 8(2):161–172, 2014.
 [4] A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: Graph processing at facebookscale. PVLDB, 8(12):1804–1815, 2015.
 [5] X. Feng, L. Chang, X. Lin, L. Qin, and W. Zhang. Computing connected components with linear communication cost in pregellike systems. In ICDE, pages 85–96, 2016.
 [6] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, pages 599–613, 2014.
 [7] A. Gurevich, V. Saveliev, N. Vyahhi, and G. Tesler. Quast: quality assessment tool for genome assemblies. Bioinformatics, 29(8):1072–1075, 2013.
 [8] W. Huang, L. Li, J. R. Myers, and G. T. Marth. Art: a nextgeneration sequencing read simulator. Bioinformatics, 28(4):593–594, 2012.
 [9] D. Kleftogiannis, P. Kalnis, and V. B. Bajic. Comparing memoryefficient genome assemblers on standalone and cloud infrastructures. PloS one, 8(9):e75505, 2013.
 [10] Y. Lu, J. Cheng, D. Yan, and H. Wu. Largescale distributed graph computing systems: An experimental evaluation. PVLDB, 8(3), 2015.
 [11] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for largescale graph processing. In SIGMOD Conference, pages 135–146, 2010.
 [12] J. Meng, B. Wang, Y. Wei, S. Feng, and P. Balaji. Swapassembler: scalable and efficient genome assembly towards thousands of cores. BMC bioinformatics, 15(Suppl 9):S2, 2014.
 [13] P. A. Pevzner, H. Tang, and M. S. Waterman. An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748–9753, 2001.
 [14] S. Salihoglu and J. Widom. Gps: a graph processing system. In SSDBM, page 22, 2013.
 [15] S. L. Salzberg, A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz, A. L. Delcher, M. Roberts, et al. Gage: A critical evaluation of genome assemblies and assembly algorithms. Genome research, 22(3):557–567, 2012.
 [16] Y. Shiloach and U. Vishkin. An o (logn) parallel connectivity algorithm. Journal of Algorithms, 3(1):57–67, 1982.
 [17] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and I. Birol. Abyss: a parallel assembler for short read sequence data. Genome research, 19(6):1117–1123, 2009.
 [18] D. Yan, J. Cheng, Y. Lu, and W. Ng. Effective techniques for message reduction and load balancing in distributed graph computation. In WWW, pages 1307–1317, 2015.
 [19] D. Yan, J. Cheng, K. Xing, Y. Lu, W. Ng, and Y. Bu. Pregel algorithms for graph connectivity problems with performance guarantees. PVLDB, 7(14):1821–1832, 2014.
 [20] F. Yang, J. Li, and J. Cheng. Husky: Towards a more efficient and expressive distributed computing framework. PVLDB, 9(5):420–431, 2016.
 [21] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A faulttolerant abstraction for inmemory cluster computing. In NSDI, pages 15–28, 2012.
Comments
There are no comments yet.