DeepAI
Log In Sign Up

Managing Reliability Skew in DNA Storage

DNA is emerging as an increasingly attractive medium for data storage due to a number of important and unique advantages it offers, most notably the unprecedented durability and density. While the technology is evolving rapidly, the prohibitive cost of reads and writes, the high frequency and the peculiar nature of errors occurring in DNA storage pose a significant challenge to its adoption. In this work we make a novel observation that the probability of successful recovery of a given bit from any type of a DNA-based storage system highly depends on its physical location within the DNA molecule. In other words, when used as a storage medium, some parts of DNA molecules appear significantly more reliable than others. We show that large differences in reliability between different parts of DNA molecules lead to highly inefficient use of error-correction resources, and that commonly used techniques such as unequal error-correction cannot be used to bridge the reliability gap between different locations in the context of DNA storage. We then propose two approaches to address the problem. The first approach is general and applies to any types of data; it stripes the data and ECC codewords across DNA molecules in a particular fashion such that the effects of errors are spread out evenly across different codewords and molecules, effectively de-biasing the underlying storage medium. The second approach is application-specific, and seeks to leverage the underlying reliability bias by using application-aware mapping of data onto DNA molecules such that data that requires higher reliability is stored in more reliable locations, whereas data that needs lower reliability is stored in less reliable parts of DNA molecules. We show that the proposed data mapping can be used to achieve graceful degradation in the presence of high error rates, or to implement the concept of approximate storage in DNA.

READ FULL TEXT VIEW PDF

page 3

page 7

page 8

page 9

page 12

01/16/2020

DNA-Based Storage: Models and Fundamental Limits

Due to its longevity and enormous information density, DNA is an attract...
02/27/2019

Capacity Results for the Noisy Shuffling Channel

Motivated by DNA-based storage, we study the noisy shuffling channel, wh...
11/10/2022

Information-Theoretic Foundations of DNA Data Storage

Due to its longevity and enormous information density, DNA is an attract...
01/22/2023

Durability and Availability of Erasure-Coded Storage Systems with Concurrent Maintenance

This initial version of this document was written back in 2014 for the s...
05/11/2022

DNA data storage, sequencing data-carrying DNA

DNA is a leading candidate as the next archival storage media due to its...
06/24/2019

Survey of Information Encoding Techniques for DNA

Key to DNA storage is encoding the information to a sequence of nucleoti...
11/30/2020

Batch Optimization for DNA Synthesis

Large pools of synthetic DNA molecules have been recently used to reliab...

1. Introduction

The digital universe keeps growing exponentially, mostly due to widespread use and improved capabilities of image and video capturing devices (Seagate, 2017; Jevdjic et al., 2017). At the same time, the density and the production rate of conventional storage devices increase at much slower rates (Seagate, 2017; Ceze et al., 2019; Bornholt et al., 2016), reducing our ability to preserve all the data we generate.

This widening gap between the storage demand and supply can be bridged through a new and radical storage technology that uses DNA as a storage medium (Church et al., 2013; Goldman et al., 2013) and offers a number of important and unique advantages:

  • Unparalleled Density. To illustrate, all the data stored in Facebook’s datacenter in Oregon, which is entirely dedicated to storage of high-density archived data, could fit into a sugar cube when stored in DNA, whereas our entire digital universe could fit into several bottles of DNA (Bornholt et al., 2016).

  • Unmatched Durability. Depending on the method of preservation, data stored in the DNA format could last for hundreds of thousands of years (Grass et al., 2015). This is in stark contrast to conventional storage technologies that retain data for a few years or decades, requiring new hardware acquisition and data transfer.

  • Eternally Relevant Interfaces. While the read/write interfaces of all storage devices eventually become obsolete, humans will always have an existential interest to read and write DNA (Bornholt et al., 2016).

  • Efficient Random Access. Leveraging PCR, one of the fundamental reactions in biochemistry, allows us to selectively extract and read only the object of interest among petabytes of data (Bornholt et al., 2016; Yazdi et al., 2017, 2015; Organick et al., 2018; Tomek et al., 2019, 2021). The key implications are that both the cost and the latency of read operations are constant, regardless of the amount of data stored in the system.

  • Efficient data manipulation. A number of operations, such as copying vast amounts of data or executing intelligent queries (Stewart et al., International Conference on DNA Computing and Molecular Programming; Bee et al., 2021), can be conveniently performed in the molecular domain.

While the technology is evolving rapidly, and the first fully automated end-to-end DNA storage system has recently been demonstrated (Takahashi et al., 2019), a number of major challenges remain to be overcome. The primary obstacle to DNA storage adoption is the prohibitive cost of reads and writes. Writing digital data into DNA (synthesis) and converting it back into digital form (sequencing) incurs prohibitive capital and operating costs (Ceze et al., 2019), impeding not only the commercial deployment of DNA storage, but also the research efforts. The problem is further exacerbated by high rates and complex nature of errors in DNA. Namely, the processes involved in DNA synthesis (write), data access, and DNA sequencing (read), and even the software processing of DNA sequences are approximate in nature and highly prone to errors (Goldman et al., 2013; Organick et al., 2018; Bornholt et al., 2016). The lower the cost of these processes, the higher the error rates (Organick et al., 2018; Takahashi et al., 2019). As a result, significant amounts of redundant resources must be invested to allow for full recovery of binary data from DNA molecules. As such, efficient handling of errors in DNA storage is critical to reducing the cost (Yazdi et al., 2017; Bornholt et al., 2016; Grass et al., 2015; Organick et al., 2018).

In this work we make an important observation that, from the system point of view, some parts of DNA molecules represent significantly more reliable locations to store data compared to other parts. We find that the relative order of reliability of different locations within a strand can be easily established at the time of encoding, while the magnitude of the skew depends, in a complex manner, on a few parameters explained later. Interestingly, the reason behind this skew is not directly related to the underlying chemistry, but to the fact that each piece of data must be reconstructed from multiple noisy copies that suffered from insertions and deletions of DNA bases; finding the consensus among such noisy copies requires an algorithmic step whose ability to correctly reconstruct the original bits fundamentally depends on the position of the bits within the molecule. As long as there are any insertion/deletion errors present in the process of reading and writing, all DNA storage architectures will observe reliability skew across positions within each molecule. If the storage system is not aware of it, the existence of the skew will lead to considerable inefficiencies in provisioning of costly error-correction resources, and will lead to higher costs for read and write operations.

Thinking of DNA storage as an information channel, one could theoretically compensate for the skew in reliability by employing uneven error correction, which would be a natural solution to a biased channel. Under uneven error correction, the ECC redundancy is provisioned to each data location in proportion to its reliability properties; less reliable locations receive proportionately more error-correction resources, and vice versa, potentially reversing the skew (Guo et al., 2016; Jevdjic et al., 2017)

. However, to optimally provision redundancy, we need to know the exact magnitude of the reliability skew, which in turn requires the knowledge of the precise error profile of all chemical processes involved. Given that the durability of DNA is measured in thousands of years and that DNA reading (sequencing) technologies evolve rapidly, optimally configuring unequal redundancy for its use in 1000 years is impossible. In fact, DNA storage is such a dynamic and unpredictable stochastic channel that even with the full knowledge of the synthesis and sequencing technologies, it is nearly impossible to accurately estimate the magnitude of the reliability skew.

To effectively address this reliability bias, we propose two techniques. The first technique, called , 111Inspired by the Gini wealth inequality index., removes any positional reliability bias in DNA storage. Gini amortizes the impact of errors by interleaving error-correction codewords across many DNA molecules such that the impact of errors is evenly spread across a large group of ECC codewords.

Our second proposed technique, called , builds on the insight that different pieces of may have different reliability needs (Sampson et al., 2013; Guo et al., 2016; Jevdjic et al., 2017). Having a storage system that offers different classes of reliability on one hand (the set of storage locations that occupy a given position in each molecule being a reliability class), and data with diverse reliability needs on the other hand, DnaMapper performs an application-aware mapping of data onto DNA molecules such that data requiring higher reliability is stored in more reliable locations, whereas the data that needs lower reliability is stored in less reliable parts of DNA molecules.

To summarize , this work makes the following contributions:

  • We are the first to make the observation that all DNA storage architectures experience reliability skew, such that some parts of the molecules are significantly more reliable than others and the relative order of reliability of different locations within a molecule can be easily and statically determined.

  • We propose Gini, a technique that spreads the impact of errors in DNA storage evenly across ECC codewords, such that every ECC codeword is affected by a nearly identical number of errors. As a result, every error has a similar probability of being corrected, regardless of its spatial origin. In contrast to conventional techniques like uneven error-correction (Guo et al., 2016; Jevdjic et al., 2017), Gini is guaranteed to always evenly spread the errors, regardless of the underlying error profile and sequencing technology used. Gini is applicable to any type of data and can be used to improve the reliability of the storage system, and/or to reduce the amount of expensive reads and writes.

  • We propose DnaMapper, a priority-based mapping scheme that maps data onto DNA according to the reliability needs of different bits, such that the data that requires high reliability is stored in most reliable parts of DNA molecules, and conversely, data whose corruption is more tolerable is stored in less reliable locations. This mapping scheme is general and can be applied to any type of data that has a notion of quality and is ideally suited for images and videos.

We evaluate DnaMapper using a simple and effective-enough heuristic for determining the relative order of bits in standard JPEG images according to their reliability needs. While other heuristics would most likely yield better results 

(Guo et al., 2016; Jevdjic et al., 2017), this method is much simpler and does not need to maintain any metadata to describe the mapping and therefore imposes no storage overhead. This method is also content-agnostic, which allows for approximate storage of end-to-end encrypted data, unlike previous approximate storage methods (Guo et al., 2016; Jevdjic et al., 2017).

All proposed mechanisms involve no storage overhead and can be easily integrated into any DNA storage pipeline. In fact, we integrate both Gini and DnaMapper into the state-of-the-art DNA storage pipeline (Organick et al., 2018) through simple data reshuffling. We also showcase the feasibility and practicality of the proposed techniques on a tiny scale in the wetlab, where we successfully retrieved from DNA and decoded all images stored in all proposed formats. Using simulation, we show that DnaMapper can provide graceful degradation in case of higher-than expected error rates, as well as reduce the reading cost by up to 50% while retrieving the images of the same quality as the baseline system. Gini reduces the reading cost up to 30% and writing cost up to 12.5%.

The rest of the paper is organized as follows. Section 2 provides background on DNA-based storage architectures and prevailing error-correction mechanisms used. Section 3 introduces and analyzes the problem of consensus finding and provides the intuition behind the reliability skew. Sections 4 and 5 describe the proposed mechanisms. Section 6 describes our experimental methodology, while Section 7 presents the evaluation highlights. We conclude in Section 9.

2. Background

2.1. Storage and Retrieval of Data in DNA

Writing data into DNA relies on DNA synthesis, which is the chemical process of creating artificial DNA molecules. Unlike biological DNA molecules, which contain billions of base pairs (bp), artificial DNA molecules () are limited in length due to the practical limitations of artificial DNA synthesis. Today’s cost-efficient synthesis creates strands which are only 100-200 bp long and can hold less than 50 bytes of data. Large files therefore must be split into smaller chunks (Bornholt et al., 2016; Organick et al., 2018), and to be able to reassemble the original data from the chunks, each chunk must contain the ordering information ().

Chunks of binary data can be encoded into a sequence of {A, C, G, T} bases using a variety of coding techniques. Some coding techniques seek to avoid immediate repetition of bases, also known as (e.g., AAA) (Bornholt et al., 2016) to reduce the chance of sequencing errors on some machines, while others seek to balance the ratio of G+C bases versus A+T (aka ) to improve the chances of successful synthesis (Yazdi et al., 2015, 2017). Although these constraints come at the expense of coding efficiency, they can also be useful in detecting and correcting errors that cause the imposed constraints to be violated (Yazdi et al., 2015). Without loss of generality, in this work we assume a simple coding scheme in which two bits of data are directly mapped to one DNA base (e.g., 00 = A, 01 = C, 10 = G, 11 = T), which achieves the maximum information density.

Once encoded into DNA strings, chunks belonging to different files are tagged with different special sequences, known as , with one primer prepended at the beginning and the other appended at the end of each molecule. All chunks belonging to the same file are tagged with the same pair of primers. The primers are used as parameters of the PCR reaction, which essentially provides a chemical data lookup mechanism. Each pair of primers logically represents a key in a key-value store (Bornholt et al., 2016) and enables random access within large volumes of data (Organick et al., 2018; Yazdi et al., 2015). The final DNA strings are synthesized into molecules, usually with millions of copies of each.

The first big step in retrieval of a file includes isolating the molecules with the correct primer pair through the process of selective amplification (PCR reaction). The isolated molecules are read using one of many available sequencing methods, which have different accuracy, throughput, and cost characteristics. The outcome of the sequencing process is a large collection of DNA strands (). The average number of reads per original molecule represents an important metric called sequencing . A higher coverage implies higher chances of correctly reconstructing the original strands. Unfortunately, the sequencing coverage is directly proportional to the sequencing costs. Therefore, minimizing the required sequencing coverage is crucial to reducing the cost of reading from DNA.

After filtering out the reads with incorrect primers, the remaining reads must be clustered (Rashtchian et al., 2017) based on similarity, so that reads corresponding to different data chunks can be disambiguated. The similarity metric of interest is usually assumed to be edit distance (Organick et al., 2018; Rashtchian et al., 2017), which corresponds to the minimum number of operations (insertions, deletions, substitutions) required to convert one string into another. With each cluster containing a number of noisy copies of a single chunk of the original data, the next big step is to find the consensus among the noisy copies and produce the best estimate as to what the original strand was. This important step is the subject of the next section. After identifying the most probable original strand for each data chunk (cluster), the next step is to convert each DNA strand back into the binary form and reassemble the chunks into a single file using the ordering information encoded in every chunk. Error correction can be performed either before (Yazdi et al., 2015, 2017), after (Organick et al., 2018), or during (Bornholt et al., 2016; Goldman et al., 2013) the reassembly, with the prevailing approach described below.

2.2. Error Correction in DNA Storage

Figure 1. DNA Storage Architecture with Reed-Solomon ECC (Organick et al., 2018)

Figure 1 shows the state-of-the-art DNA storage architecture (Organick et al., 2018) with Reed-Solomon Error Correcting codes. In this architecture, data is laid out into a matrix, with every DNA molecule representing a column in the matrix, while each row is a Reed-Solomon codeword. The whole matrix represents a unit of encoding/decoding. The key parameter of Reed-Solomon codes is the symbol size: for a symbol size of bits, the codeword must have -1 symbols (the entire codeword thus has (-1) bits). Based on the desired error correction capabilities, the symbols in a codeword are split between data symbols and redundant symbols, with +=-1; In each encoded unit there are data molecules and redundancy molecules for error correction. As explained in the previous subsection, each molecule must have an index required for ordering. Note that the redundant symbols are synthesized into separate DNA molecules and therefore, they need to be ordered as well, so that they can be placed in the correct column within the matrix at the decoding time. The consequence of this is that the ordering information cannot be protected by error correction. The ordering index must contain at least bits, which is ()/2 bases. Given that +=-1, the index will have bits or /2 nucleotide bases, which equals the size of the symbol. In the example in Figure 1, the indexes are placed at the beginning of each molecule, while the data symbols are mapped column-wise, molecule by molecule.

A Reed-Solomon codeword with redundant symbols can correct up to erasures within a codeword, or detect and correct up to errors. Erasures are types of errors which happen when data is missing/wrong but the location of the missing/wrong data is known. Erasures generally are less costly because they do not require detection, however in this architecture they usually happen in case of clustering errors or loss of entire clusters; when a cluster at index is missing, an erasure is present in every row of the matrix at location . It is important to note that in this architecture, errors such as insertions and deletions within each molecule (column) are detected as substitution errors in the corresponding codeword (row) and corrected as such.

Note that the data layout Figure 1 is similar to the data layout in a RAID disk setup (Patterson et al., 1988). In our case a column represents a DNA molecule, whereas in conventional storage a column would represent a sector. A sector failure would constitute an erasure, which in the DNA storage context maps to a failure to read a particular DNA strand.

3. Consensus Finding and Reliability Skew

Figure 2. Simplified example of finding a consensus string

After clustering DNA reads based on edit distance, each cluster is separately processed to find the most likely original DNA sequence corresponding to the noisy reads within that cluster. We can formulate this problem as follows: Let be an unknown string of length over alphabet ={A, C, T, G}. We are given noisy copies of , each generated independently by distorting at each position with probability , where corresponds to the sequencing coverage. In other words, for each position , we either delete the -th character of (deletion), or insert a character chosen uniformly at random from at that position (insertion), or replace the -th character of with another character from the alphabet (substitution), or keep it unchanged. For the sake of simplicity, we assume that each of the error types occurs with probability , but our model can be easily generalized to support different probabilities for each error type. Given these noisy copies, the goal is to reconstruct . More formally, the goal is to find a string of length such that the sum of edit distances to all inputs is minimal across all strings of length . A string that minimizes the sum of edit distances to given inputs across all strings (not constrained to a given length) is known as edit distance median. The problem of finding the edit distance median is shown to be NP-complete (Nicolas and Rivals, 2004), and so is our problem of finding the median of length , which we refer to as constrained edit distance median.

When the input strings originate from the same original string, the task belongs to a class of problems in information theory which are commonly referred to as trace reconstruction problems. The study of a variation of this problem was initiated by Batu et al. (Batu et al., 2004), considering only binary deletion channels. They proposed a simple algorithm called Bitwise Majority Alignment and showed that it can be used to exactly reconstruct the original string with high probability using noisy copies when the error rate is . A subsequent work (Kannan and McGregor, 2005) extended this result to binary channels with both insertions and deletions for an error rate of . Viswanathan and Swaminathan (Viswanathan and Swaminathan, 2008) improved this result by presenting an algorithm that needed the same number of traces for insertion and deletion probabilities of . Several researchers have advanced these results further as explained in a recent survey (Bhardwaj et al., 2020), proposing lower and upper bounds for the number of traces necessary and sufficient for reconstruction with high probability (Krishnamurthy et al., 2019; Duda et al., 2016). It is important to note that while all these important theoretical findings established the relationship between the length of the strings (), error rate (), and the probability of successful exact reconstruction of the entire string, they have not looked at the probability of reconstruction of individual characters within the string and how it relates to the character positions in the string.

3.1. Reliability Skew

To give an intuition as to why the reliability skew exists and why the length of the original strand poses a challenge to the reconstruction, let’s consider an example in Figure 2 with five noisy copies of an original sequence. When the noisy copies contain only substitutions (Figure 2a), we can perform a simple majority vote for each position independently and correctly reconstruct the original string even when the coverage (number of inputs) is small and the error rate is high. However, if we allow insertions and deletions to happen (Figure 2b), we cannot apply such a simple reconstruction procedure. First, the copies can have different lengths. Second, the characters may not be in their original positions even if all strings were of the same, desired length. To perform the reconstruction, we must place the characters in each string into their original positions. Because the characters at the beginning and the end cannot be misplaced by too many positions, it is best to start fixing the misplaced characters from each end of the sequence.

Looking at the first column in Figure 2

b, we see that A is the most likely character and T is an outlier in the first string. We can safely assume that the first character of the original sequence is A, but to continue using the first string we must understand what kind of a distortion it had suffered and try to undo it to guarantee the correct placement of characters. Because the second and third characters are CG in most of the sequences, including the first one, we can assume that this was most likely a case of substitution. So we correct the first string by converting T to A and move forward by one position (Figure 

2c).

Figure 3. Probability of correctly identifying a base in the string based on its position.
Figure 4. Probability of correctly identifying a base in the string based on its position in case of a 2-way reconstruction.

Looking at the second column, we can conclude that the consensus character is C (Figure 2c), with G being an outlier in the second string. Observing that the next two characters in all strings are GT except for the second one (TA), we can conclude that the second string likely suffered from a deletion of C. We therefore fix the second string by inserting C before G and move on to the next position (Figure 2d). In the third column, we can see that the consensus character is G with A being an outlier in the last string. To fix the error, we look at the next two characters, which are GT for the last string and TA for all other strings. We can assume that A was likely inserted before G in the last string. We fix the error and move on (Figure 2e).

This example illustrates the key problem with consensus finding algorithms: whenever we encounter an outlier, we have to make an assumption as to what the error was and then correct the error based on that assumption. If our assumption is not correct, the original error plus any error introduced by our mis-correction will propagate to the next position. As we advance towards the end of the sequence, the errors accumulate and our ability to find a consensus diminishes. This phenomenon is illustrated in Figure 3, which shows the probability of correctly identifying a nucleotide base based on its position within the string. As we can see, the error increases sharply as the base index increases. An obvious implication is that longer strings will have higher maximum error probability. The problem illustrated in Figure 3 is a direct result of the fact that placing a base in its correct position cannot be done independently of other bases due to deletions and insertions, which was not the case in Figure 2a, which assumes that only substitutions can happen.

Fortunately, the problem of consensus finding for linear structures is symmetric; we can align all the strings to the right and start the reconstruction process from the other end, as illustrated in Figure 2f. In this case, we can use only the first (i.e., better) half of the string reconstructed from left to right, and the other half from the string reconstructed from right to left to create the final consensus string as the best of both worlds. Figure 4 illustrates the probability of correctly identifying a base based on its position after applying the described 2-way reconstruction procedure. As we can see, the probability of error is low at at the ends of the string, then gradually increases towards the middle, and is the highest in the middle of the string.

The consensus finding approach explained above and different variations of it are commonly used in DNA data storage pipelines  (Organick et al., 2018). However, recent work on trace reconstruction has introduced other practical algorithms with various definitions and optimization criteria. Most notably, Sabary et al  (Sabary et al., 2020) have proposed an iterative reconstruction algorithm that solves the DNA Reconstruction Problem with the goal of finding an estimation with minimal edit distance from the original string. This algorithm outperforms other algorithms used in practice in terms accuracy (Sabary et al., 2020). Although this algorithm does not follow the two-sided approach explained above, the skew is still present for various parameters as shown in Figure 5, except in the case where only substitutions are present (purple line). The shape of the skew is the same, with the peak being higher for smaller and/or larger .222The experiments were performed using the available code 5. The algorithm occasionally produces the result of incorrect length, and we have excluded such strands while plotting Figure 5.

In Figure 5, the top four lines (as per the legend) assume equal distribution of substitutions, deletions, and insertions (1/3rd each). For the last two lines, purple and brown, we change the breakdown of the three error types. Note that in case of the brown line (10% substitutions, no indels), there is no observable skew and the algorithm easily reconstructs the data. This is expected, as substitutions alone, similarly to bit-flips, do not create any skew. In contrast, the orange line with the same error rate of 10% but with equal representation of the error types, shows significant skew. In fact, even a 2x lower error rate (5%) with equal representation of the error types (blue line) shows some skew in the middle and presents a bigger challenge for reconstruction compared to 10% substitution errors (brown).

Let’s now compare the green and purple lines, which contain the same number of insertions and deletions (indels), while the green line further has 5% substitution errors. While a substitution-only error rate of 10% makes no impact on the skew (brown curve), the addition of just 5% substitution errors has a significant impact the presence of indels, as measured by the difference between purple and green lines. In conclusion, while in isolation substitutions do not create any skew and are easy to correct, they amplify the skew and complicate the reconstruction in the presence of indels. Interestingly, adding an extra strand (red line) has a similarly strong impact as removing substitutions (purple line).

Figure 5. Reliability skew observed in the state-of-the-art trace reconstruction algorithm (Sabary et al., 2020).

3.2. Fundamental Nature of the Reliability Skew

A natural question to ask is whether the observed skew is solely a result of using a particular algorithm or is it an unavoidable property of trace reconstruction in the presence of insertions/deletions. To determine the answer empirically, we experiment with strings of small length so that optimal trace reconstruction is possible in reasonable time through a brute force search for an edit distance median of the target length. If we can observe the error peak in the output of optimal algorithms, it can be seen as evidence towards the assertion that the skew is a part of all optimal reconstruction algorithms.

Without loss of generality, we assume a binary alphabet ({0, 1}, instead of {A, C, G, T}). For practical reasons, we limit the length of the original bit string to 20 bits. For a large number of input bit strings of length 20, we construct a noisy cluster of size through insertions, deletions, and substitutions with total probability of =20%. We then find all strings of length L such that the sum of edit distances to all strings in the cluster is minimal. Importantly, if there are multiple such strings, we pick one in an adversarial manner, such that the selected string is more accurate towards the middle than towards the ends (compared to the original string), in an effort to create a skew opposite from the one we expect to see.

Figure 6. The probability of incorrectly reconstructing a bit as a function of its position in a bit string of length L=20 for N=2, 4, 8 and 16 and probability of error p=20%.

Figure 6

shows the probability of incorrectly reconstructing a bit as a function of its position in a bit string of length L=20 for N=2, 4, 8 and 16 and probability of error p=20%. Interestingly, while the higher number of reads reduces the peak error probability, the shape of the curve doesn’t change singificanlty. We can see that even when an optimal algorithm uses all available degrees of freedom to reverse the expected bias, the reliability skew is still present and significant.

3.3. Implications on Reliability and Future Trends

The shape of the error probability as a function of base position curve has profound implications on reliability. The bases at the beginning and the end of each molecule present reliable places to store data, whereas the bases in the middle area are significantly less reliable. The trends in DNA sequencing (reading) and synthesis (writing) suggest that the skew in reliability between different positions will have even more significant consequences in the future. First, the synthesis process improves over time producing longer molecules (Ceze et al., 2019). Longer molecules are desirable because they result in proportionately fewer data chunks per file, reducing the overheads of primers and ordering indexes. However, longer molecules make the problem of consensus finding more challenging and significantly exacerbate the problem of the reliability skew. Second, new sequencing technologies using nanopores dramatically reduce the reading costs (Jain et al., 2016), but introduce much higher (an order of magnitude) error rates, significantly complicating the consensus finding step (Yazdi et al., 2017; Duda et al., 2016) and resulting in even steeper error probability curves. Finally, lower sequencing coverage is desirable as it directly reduces the cost of sequencing (Organick et al., 2018). However, lower coverage implies significantly harder and less accurate consensus finding. All these trends suggest that the reliability bias is likely to increase in the future, and DNA-based storage systems must be aware of the inherent reliability skew to avoid significant over-provisioning of error-correcting resources.

4. Flattening the Curve

Figure 7. Example of an architecture with uneven provisioning of redundancy

4.1. Unequal Error Correction

Given the reliability bias in DNA storage, one may be tempted to apply unequal error correction, given that DNA storage allows for high-precision ECC tuning. Although the complex error-correction mechanism depicted in Figure 1 requires an entire matrix to be encoded as a unit, it is still possible to provide a custom amount of redundancy for each codeword (row) within the matrix, with high precision. Figure 7 demonstrates what an uneven ECC would look like, where the most reliable locations in all molecules (the first and the last row) receive the least amount of redundancy, while the rows in the middle receive significantly more redundancy. In this case there is no more clear separation between data-only and redundancy-only molecules. While most of the molecules still contain only data symbols, and some molecules may still contain only redundancy symbols, there are a number of molecules that contain a mix of data and redundancy.

While redundancy can be provisioned in every row in a very precise manner, there is no way to know how much redundancy each row should receive in advance. The desired variability in redundancy may change when the sequencing method changes, or even when the coverage changes. As shown in Figure 5, increasing the sequencing coverage from 5 to 6 may change the magnitude of the skew by 2x, and per-strand coverage is not possible to control (Organick et al., 2018). Yet, to implement unequal redundancy, we would have to assume a particular skew curve and fix the redundancy in each row at the time of encoding, which clearly is not a solution that can stand the test of time, given that DNA is a durable, archival storage medium that lasts for thousands of years (Grass et al., 2015) and the sequencing methods are more than likely to change multiple times during the lifetime of data.

Furthermore, even if we had the perfect knowledge of the sequencing technology and the protocol to be used at the time of reading, and even if we somehow knew the target sequencing coverage and the exact algorithm to be used for consensus finding, even in this case the unequal redundancy approach has serious problems because coverage is never fixed across all clusters. Instead, coverage follows the Gamma distribution 

(Organick et al., 2018), with a significant variation in size across individual individual cluster. As such, despite having the desired average coverage, some clusters will have more reads and some fewer, and the level of skew in individual clusters will be different, and very few of them will have the exact average coverage that we statically provisioned the skew for at the time of encoding.

4.2. Gini

The baseline architecture (Organick et al., 2018) depicted in Figure 1 provides great protection against erasures, i.e., the losses of entire molecules during sequencing. In case of erasures, a single substitution error is seen in every codeword. However, ordinary deletions and insertions will accumulate a great deal of error in the middle of each molecule (column). As a result, the rows in the middle, each of which maps to an ECC codeword, will suffer many more errors compared to the codewords towards the ends.

Figure 8. Codeword interleaving in Gini

To propose an effective solution, we take an inspiration from the coding theory in mobile communications, where a similar problem of positional error bias across both rows and columns of the encoding matrix has been observed (Yi and Lee, 1997). We make the observation that, while we cannot control the spatial distribution of errors, we can control the impact of such errors by carefully defining our codewords. Instead of distributing codewords across rows of the encoding matrix, we can distribute them diagonally, as shown in Figure 8a. Since the number of columns is usually orders of magnitude larger than the number of rows, each codeword will wrap around the matrix many times, evenly cycling through all positions in all molecules. Consequently, the errors in the middle will be equally distributed across all codewords, unlike in the baseline where all errors coming from the middle of every molecule are concentrated in the same codeword.

Gini removes all positional reliability bias in DNA storage by spreading the impact of errors evenly across a large group of ECC codewords. When it comes to erasures, Gini matches the capabilities of the baseline, as every symbol in every molecule belongs to a different codeword. Note that for this to happen, we must ensure that when wrapping a diagonal codeword around the matrix, we continue from the next column, as shown in Figure 8a. Also note that we can decide to exclude arbitrary rows from this interleaving. For example, we could exclude the first and last rows and reserve them for very important data and treat them as separate codewords, while the rest of the codewords can be interleaved across the rest of the rows, as shown Figure 8b, where we essentially created two reliability classes.

Gini does not change the number of errors that take place, but simply redistributes them in a way that equally impacts every codeword, allowing all errors to have a similar probability of successful correction, regardless of their spatial origin. Gini can be used to improve the reliability of the system and reduce the number of copies of each molecule that must be read by a sequencer, leading to commensurate savings in the reading cost. Gini can also be used to reduce the amount of error-correction resources while keeping the reliability constant, leading to savings in both DNA reads and writes. Gini is applicable to any type of data, requires no storage overhead and can be easily integrated into state-of-the-art DNA storage architectures.

5. Application-Aware Data Mapping

Figure 9. Priority-based mapping of data onto Reed-Solomon symbols with DnaMapper

In this section, we describe how we can leverage the skew reliability across different parts of DNA molecules on the one hand, and the skew in reliability needs across different data bits on the other, to produce an optimal mapping of data onto molecules without a need to remove the bias in the storage medium.

5.1. General Framework

Given data bits of known reliability needs, and given storage cells of known reliability properties, what is the optimal mapping of bits into cells that maximizes the retrieved data quality? Interestingly, the answer to this question does not depend on any absolute values indicating reliability needs of bits, or any absolute value indicating reliability properties of storage cells. It also does not matter whether cell A is 3x more reliable than cell B, or only by 2x. It is easy to show (proof by contrapositive) that the optimal mapping will always be the one in which the bit with the highest reliability needs is mapped to the cell of highest reliability, the bit with next highest needs is mapped to the cell of next highest reliability, and so on. In other words, it is only the ranking of cells by reliability and the ranking of bits by reliability needs that matters. This is of crucial importance in our context, because unlike the amplitude of the skew, the ranking of bases by reliability in DNA storage can be statically determined and does not depend on the retrieval process.

5.2. Mapping data onto DNA molecules

As described in Section 2, each file is split into chunks of fixed size and each chunk is mapped into a separate molecule. Given + molecules, bits (which is ()/2 bases) must be reserved within each molecule for the ordering information (index), and the data bits are mapped into remaining positions.

5.2.1. Baseline Mapping

The baseline mapping of data onto molecules is shown in Figure 1. We place the first chunk of data into molecule 0, the next chunk of data into molecule 1, etc. The last chunk of data is placed into molecule -1, and the remaining molecules are redundant.

5.2.2. Priority-Based Mapping

Recall that the bases at the beginning and the end of DNA molecules represent reliable data locations, whereas the bases in the middle are unreliable. Due to symmetry, each position in the molecule has a corresponding position of the same reliability. Because the ordering information is of utmost importance, we place it at the most reliable part of each molecule, which is the first (or last) location. Note that the index placement in this scheme happens to be the same as in the baseline mapping. The next most reliable locations are the last bases of each molecule. We therefore strip 2 most important data bits across molecules, placing them in the next most reliable set of bases, which is the last base of each molecule (two bits per base). The next 2 bits are placed in the second position of each molecule, next to the index information. The rest of the bits are placed in a zig-zag fashion, as shown in Figure 9. Note that regardless of how data is mapped into the matrix, the redundancy symbols are created by the Reed-Solomon encoder for each row after the mapping is done and no remapping is performed on such symbols. Once the redundancy symbols are created, every symbol in the matrix is encoded into DNA bases and each column is synthesized into a molecule.

5.3. Determining Priority of Bits

Priority of bits within a file can be determined for data types that have a notion and a metric of quality, such as images and videos. Previous work has proposed techniques for classifying bits into reliability classes based on the amount of damage that is caused by corrupting a bit in a given class for progressively encoded images 

(Guo et al., 2016) as well as for H.264 videos (Jevdjic et al., 2017). Different classes of bits are then stored separately according to their reliability needs. These techniques require additional metadata about the placement of the bits to be encoded into the storage substrate so that the files can be reconstructed from bits stored in different locations, and such metadata has to be stored in the most reliable locations (Guo et al., 2016; Jevdjic et al., 2017).

While the techniques proposed by prior work are a great use case for our system, they are not open-sourced, and they are quite complex, so we leave their integration for future work. In this work, we propose a very simple and effective technique for determining the importance of bits for standard JPEG images that does not require any additional metadata to be stored. The basic idea comes from two simple observations:

  • Pixels in JPEG images are grouped into small encoding units. Each unit is placed in a JPEG file such that each depends only on the previously encoded units.

  • JPEG uses highly efficient, but error-prone entropy-coding. Corrupting a bit in a JPEG file may confuse the entropy coder in such a way that precludes the decoding of the subsequent bits.

These two observations imply that bits that come earlier in the file tend to need more reliable storage compared to the bits that come later. To validate this observation, we profile a JPEG image by flipping one bit at a time, decoding the resulting image and measuring the quality loss with respect to the original image. Figure 10 shows the PSNR quality loss in decibels based on the position of the bit in the file. We can see that the maximum loss is incurred by corrupting the bits at the beginning, and the minimum loss for doing so with the bits at the end of the file. Based on these observations, we simply prioritize bits based on their location in the image file.

Figure 10. PSNR quality loss in dB as a function of the corrupted bit position

5.4. Using DNAMapper

In contrast to unequal error correction, DnaMapper does not require the knowledge of the exact magnitude of the reliability skew; given the ranking of data bits by their reliability needs, DnaMapper only requires the ranking of DNA storage locations by reliability (which can be easily established and does not change with the technology) to optimally map data to DNA. DnaMapper can be used to achieve graceful quality degradation in the presence of high error rates, or to implement the concept of approximate storage. DnaMapper retrieves data (e.g., images or videos) using the minimum amount of resources required for reconstructing the data at the desired quality level. Similarly, if the invested resources are not sufficient to fully recover the data or if the retrieving process introduces more errors than what can be corrected, DnaMapper is still capable of retrieving useful data of sufficiently high quality. In other words, as the noise level increases, the quality of DnaMapper-stored data gracefully degrades and still provide useful data, but of gradually lower quality.

6. Methodology

6.1. Simulation

To evaluate the proposed techniques with realistic system parameters, we perform our evaluation using simulation. Not being financially limited by the cost of synthesis, we assume longer DNA strands of up to 750 bases, and a set of large files of variable sizes. We put together a group of 10 images of different resolutions and qualities, whose size varies between 5KB and 1.5MB. All images are encrypted, and the total size of all files is 8.7MB. We encode all the files into the same encoding unit (matrix) to demonstrate how files of different sizes can be mixed in a practical manner, while being compatible with both Gini and DnaMapper. For the sake of completeness, an additional file containing the names and sizes of all files has been encoded together with the files and acts as a directory, which in case of DnaMapper was given the highest priority.

6.1.1. Storage Architecture.

Out of 750 bases in a DNA strand, 40 bases are reserved for a pair of access primers, and 8 bases (equivalent to 16 bits) for the ordering information, and the remaining 656 bases are used for data. We use Reed-Solomon codes with 16-bit symbols and -1 = 65535 symbols per codeword, as in the prominent DNA storage demonstrations (Organick et al., 2018). Each DNA molecule can hold exactly 82 Reed-Solomon symbols, and therefore, our Reed-Solomon matrix has 82 rows and holds up to 10.5MB of data and redundancy. We use 18.4% of symbols in each codeword for redundancy, leaving us with 8.7MB of pure data per unit of encoding.

We evaluate three techniques:1) the baseline architecture (Organick et al., 2018), which is unaware of the skew, 2) Gini, which interleaves the codewords, and 3) the priority-based mapping scheme described in Section 5, where the priority of a bit is approximated by its position in the image file. In case of the priority mapping, we run into an interesting problem of how to rank the bits by reliability when we have multiple files of different sizes. Among a few heuristics we tried, the following one turned out to be the fairest and best performing: given N classes of reliability (i.e., N rows in the matrix), we give each file a fraction of storage in each reliability class in proportion to the file size. This means that the high-order bits of all files will be stored in the strongest reliability class, and the low-order bits of all files will be in the weakest class. Under this heuristic, we noticed that the presence of errors affects all files similarly in terms of the image quality loss, regardless of the image size. The only exception is the directory file, which was given the highest priority for all of its bits.

6.1.2. Data retrieval and decoding.

We simulate the retrieval process by creating a number of copies of each encoded DNA string and injecting insertion, deletion, and substitution errors. The error rates of modern sequencing methods vary greatly, from around 1% (Organick et al., 2018) for high-end Illumina next-generation sequencers (NGS), to 12-15% (Duda et al., 2016) for low-cost nanopore-based sequencers (Jain et al., 2016). We simulate a spectrum of error rates to account for a variety of sequencing methods. We also vary the average coverage between 3 and 20 by creating an appropriate number of noisy copies. The decoding process starts with data clustering. Fortunately, since we know the source strand for each noisy copy, our data is perfectly clustered, which allows us to eliminate the effects of imperfect clustering algorithms. For consensus finding, we use the two-sided approach (Organick et al., 2018), as the other notable algorithm  (Sabary et al., 2020) does not always produce the output of desired length.

To simulate different reading costs, we vary the coverage by generating a large pool of noisy strands for each DNA string. We start at a low coverage, and progressively add more strands from the pool. For every coverage point, we decode the reconstructed strands back into binary data, reassemble them into one piece, correct the errors, recombine the bits into individual files based on the directory information, decrypt every file, and finally evaluate the quality of the resulting images. We repeat this process 50 times for each data point and report the averages.

6.2. Wetlab validation

To validate our tool-chains end-to-end, we performed wetlab experiments in which we synthesize two small images in DNA organized with various organizations (baseline, Gini, DnaMapper), and later retrieve, sequence using NGS (at 0.3% error rate), and successfully decode all of them. Figure 15 (left) shows one of the successfully decoded photos. Note that the software toolchains we use for data encoding and decoding are identical for both simulated and wetlab experiments. The only difference is that when simulating, we provide fake input data with the desired magnitude of errors instead of the data that would come from sequencing. We present results only for simulation since the impact of the proposed techniques on ultra-low error rates with NGS is negligible.

7. Evaluation

7.1. Gini

Figure 11. Positional distribution of errors per codeword in baseline (red) and Gini (blue) at the error rate of 9% and sequencing coverage of 20.

Figure 11 shows the number of errors each codeword receives in case of using the baseline, where each codeword is a row in the matrix, and Gini, where each codeword is diagonally striped across the matrix. The experiment is done at the error rate of 9% and sequencing coverage of 20. We can see that for the baseline, rows that are closer to the ends experience significantly fewer errors at the expense of the rows in the middle, with a prominent peak in the middle. In contrast, Gini’s interleaving of codewords across both rows and columns ensures that every codeword experiences a similar number of errors, effectively flattening the curve and removing the bias. Note that the surface under the curves, which corresponds to the total number of errors, is the same.

Figure 12. Minimum sequencing coverage required for error-free decoding as a function of error rate.

Because Gini removes the bias, it requires lower sequencing coverage to retrieve the data exactly, without any errors, compared to the baseline which must provision for the worst case. Figure 12 shows the minimum sequencing coverage (the lower the better) required for error-free decoding as a function of error rate, for the baseline and Gini. As we can see, Gini can reduce the required sequencing coverage, and therefore the reading cost, by 20% for small error rates, and up to 30% for higher error rates. Similarly, if the same sequencing coverage is used for both Gini and baseline, Gini would have significantly higher chances of exact error-free decoding, increasing the reliability of the system.

Figure 13. Minimum sequencing coverage required for error-free decoding as a function of effective redundancy. The error rate is fixed at 9%.

To evaluate Gini’s potential for savings in synthesis cost, we fix the error rate to 9% and gradually reduce the amount of Gini’s error correction resources until Gini matches the coverage of the baseline at that error rate (17). We simulate the reduction in error-correction resources by introducing erasures in a controllable manner, so that the effective redundancy is reduced. Figure 13 shows that Gini’s redundancy can be reduced from 18.4% to only 6% while matching the coverage requirements of the baseline, which is a 67% reduction in redundancy and 12.5% reduction in the entire synthesis cost.

7.2. Data Mapping

Figure 14. Quality loss of retrieved images as a function of coverage at various error rates. Full lines denote the baseline data mapping, and dashed lines denote DnaMapper, and dotted line denote Gini.
Figure 15. Original image (left), 1.2dB loss (middle), and 7.1dB loss (right).

Figure 14 shows the quality loss in decibels for images retrieved from the simulated DNA storage system in case of the baseline data mapping, and the proposed DnaMapper scheme, as well as Gini, while varying the coverage from 3 to 20. For very low error rates, all systems can successfully decode the files at any coverage. However, as we increase the error rate, we can see that the image quality degradation experienced by the baseline system is increasing sharply as we reduce coverage. In contrast, DnaMapper loses quality very gradually as coverage decreases. For example, for error rate of 12% and coverage of 13, the baseline experiences catastrophic data loss such that the image cannot be decoded. In contrast, DnaMapper only loses 0.3dB in quality, which is not even noticeable (up to 1dB loss is considered unnoticeable  (Guo et al., 2016); as a reference, Figure 15 shows an example of decoded images. The image on the left was successfully decoded with no errors. The image in the middle suffered 1.2dB loss, while the last image suffered 7.1dB loss.) The gap between DnaMapper and the baseline increases with the error rate, leading to 20-50% reduction in reading cost for the same quality target. As in the case of Gini, an identical analysis could be performed for redundancy savings when coverage is kept constant, but we omit that for brevity.

It is important to note that Gini (dotted lines) reduces the coverage needed for error-free decoding by flattening the error curve, and as long as the number of errors is below the threshold that the codewords can handle, every codeword will be decoded without a single uncorrected error. However, the moment the coverage drops beyond the threshold, all of a sudden all codewords fail to decode at the same time, as this threshold is crossed in all codewords simultaneously due to Gini’s interleaving. As a result, we can see that Gini can occasionally perform even worse than the baseline, which in the high-error regime can at least decode some rows that are far from the middle.

In contrast to Gini, DnaMapper will ensure graceful quality degradation as the level of noise increases, which is particularly interesting in scenarios when the noise levels cannot be predicted. It also allows us to trade quality for cost in a controllable manner. Compared to Gini, DnaMapper tends to suffer minor quality loss in the medium noise range, as it does not flatten the curve. For example, at coverage of 14 for error rate of 12%, there is a quality degradation of 0.03dB, and 0.1dB at the coverage of 13; in contrast, Gini results in error-free retrieval at coverage of 14, but at coverage of 13, Gini’s output is not decodable.

Note that we use PSNR as an image quality metric because it is an objective metric, known to reserchers even outside of the media-processing community. However, we believe that a subjective quality metric that includes the user perception would be more relevant. The study of such metrics is beyond the scope of this work.

7.3. Evaluating the Bit Ranking Method

Figure 16

compares our simple bit ranking heuristic against an oracle ranking. To determine the oracle ranking, we use the following brute force method: we corrupt an image file by flipping one bit at a time and evaluate the quality loss in decibels (dB) using the peak signal-to-noise ratio (PSNR) as a quality metric. We then sort all the bits according to the quality loss, such that the bits that cause higher quality loss upon flips are ranked higher and placed in positions of higher reliability. We benchmark the oracle approach using a medium size image file (300KB), for which it was feasible to compute an oracle ranking. In this experiment, we don’t use any error correction.

Note that the oracle method does not perform visibly better compared to our method, despite relying on a computationally expensive exhaustive search and requiring an unacceptable storage overhead to store the rankings. Note that errors in DNA storage are not independent from each other, nor their impact on the quality of the decoded image is independent. For example, the second of two consecutive errors in neighboring bits in an image is much less likely to affect the quality compared to the first one. This is something that our oracle cannot capture. The presented “oracle” is thus the best approximation of the actual oracle that was possible to evaluate. However, we believe that more sophisticated bit ranking methods can be developed for various types of data, but these are out of scope for this paper. Also note that our proof-of-concept ranking method incurs zero storage overhead and is extremely simple, and allows for end-to-end encrypted files to be stored, as the bit ranking is determined without looking at the content of the images.

Figure 16. Comparison of the proposed reliability ranking method with the ideal oracle ranking.

8. Discussion and Related Work

Impact on locality. This work introduces no new locality trade-offs by shuffling contiguous data bits across different molecules. The reason is that an encoding unit must be encoded, fetched, and decoded together with all (or most, erasures allowed) data present, and that’s regardless of whether or not data is internally shuffled. Locality across encoding units is also unaffected by our techniques.

Breakdown or Error Types A large-scale study has found that in a typical data retrieval workflow that uses NGS, about 25-30% of errors are indels (insertions and deletions), the rest being substitutions (Organick et al., 2018). The same study has found that over 60% of errors are indels in nanopore-based workflows. Both workflows assume conventional DNA synthesis, where most of the time and resources is spent on ensuring that every base is synthesized exactly once. The emerging enzymatic synthesis technology  (Lee et al., 2019) relaxes this rule, which dramatically inflates the number of indels at the moment of synthesis (e.g., ACGT can be synthesized as AAACTT), which we expect to further exacerbate the skew problem.

Realistic Error Rates DNA Sequencing methods have been evolving in the direction of higher noise for the past 45 years. The oldest sequencing method used today is Sanger sequencing from the 1970s, which is also the most accurate today, but impractical as it correctly reads only DNA pools dominated by a single DNA strand. Next-Generation Sequencing (NGS) appeared two decades ago with lower accuracy, but much higher throughput. Nanopore-based methods that evolved recently introduce orders of magnitude higher errors, but bring other benefits such low-cost real-time sequencing. DNA-based storage at the right scale still needs many orders of magnitude improvement in sequencing cost, latency, throughput, and environmental cost, and it’s quite possible that error rates in such methods will be over 30% (Sabary et al., 2020), or even higher if the enzymatic synthesis (Lee et al., 2019) is used.

Relationship to Approximate Storage/Memory Similarly to prior work (Guo et al., 2016; Jevdjic et al., 2017), DNAMapper also uses heuristics to rank bits by reliability for a given data type. However, all prior work on approximate storage/memories assumes uniformly reliable storage medium that is deliberately engineered into reliability classes (e.g., by selectively reducing DRAM refresh rate (Liu et al., 2011; Raha et al., 2017; Jung et al., 2016), adding different amounts of redundancy in Flash/PCM (Guo et al., 2016; Jevdjic et al., 2017), etc.) to match various reliability needs in the data. In contrast, DNA is the first substrate that has an intrinsic and dynamically changing reliability skew. DNAMapper is the first technique that stores data with varying reliability needs into a substrate with varying reliability properties.

9. Conclusion

In this paper we made a novel observation that the probability of successful recovery of a given bit from any type of a DNA-based storage system highly depends on its physical location within the DNA molecule. We showed that the reliability skew is fundamental to all DNA storage systems, and that it leads to highly inefficient use of error-correction resources and higher synthesis and sequencing costs. We proposed two approaches to address the problem. The first approach, Gini, distributes the errors across error-correction codewords in a way that equalizes the impact of errors across many codewords, without increasing the size of the encoding unit. This approach effectively removes the positional bias and reduces the associated costs. The second approach, DnaMapper, seeks to leverage the bias and relies on application-aware mapping of data onto DNA molecules such that data that requires higher reliability is stored in more reliable locations, whereas data that needs lower reliability is stored in less reliable parts of DNA molecules, reducing the cost of sequencing and providing graceful degradation. All proposed mechanisms involve no storage overhead, can be directly integrated into any DNA storage pipeline.

Acknowledgements.
The authors would like to sincerely thank Lara Dolecek, Cyrus Rashtchian, and Sergey Yekhanin for many useful discussions that took place in the earlier stages of this work, as well as Cheng-Kai Lim for his immense help with wet-lab experiments. The authors also thank our shepherd, Hung-Wei Tseng, as well as all other reviewers for their valuable feedback. This work was partially funded by the Advanced Research and Technology Innovation Centre (ARTIC) at the National University of Singapore, project FCT-RP1.

References

  • T. Batu, S. Kannan, S. Khanna, and A. McGregor (2004) Reconstructing strings from random traces. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Cited by: §3.
  • C. Bee, Y. Chen, M. Queen, D. Ward, X. Liu, L. Organick, G. Seelig, K. Strauss, and L. Ceze (2021) Molecular-level similarity search brings computing to dna data storage. In Nature Communications, Cited by: 5th item.
  • V. Bhardwaj, P. Pevzner, C. Rashtchian, and Y. Safonova (2020) Trace reconstruction problems in computational biology. In IEEE Transactions on Information Theory, Cited by: §3.
  • J. Bornholt, R. Lopez, D. Carmean, L. Ceze, G. Seelig, and K. Strauss (2016) A dna-based archival storage system. In International Conference on Architectural Support for Programming Languages and Operating Systems, Cited by: 1st item, 3rd item, 4th item, §1, §1, §2.1, §2.1, §2.1, §2.1.
  • L. Ceze, J. Nivala, and K. Strauss (2019) Molecular digital data storage using dna. In Nature Reviews Genetics, Cited by: §1, §1, §3.3.
  • G. Church, Y. Gao, and S. Kosur (2013) Next-generation digital information storage in dna. In Nature, Cited by: §1.
  • J. Duda, W. Szpankowski, and A. Grama (2016) Fundamental bounds and approaches to sequence reconstruction from nanopore sequencers. In arXiv:1601.02420v1, Cited by: §3.3, §3, §6.1.2.
  • N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. LeProust, B. Sipos, and E. Birney (2013) Towards practical, high-capacity, low-maintenance information storage in synthesized dna. In Nature, Cited by: §1, §1, §2.1.
  • R. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. Stark (2015) Robust chemical preservation of digital information on dna in silica with error-correcting codes. In Angewandte Chemie International Edition, Cited by: 2nd item, §1, §4.1.
  • K. Guo, K. Strauss, L. Ceze, and H. Malvar (2016) High-density image storage using approximate memory cells. In International Conference on Architectural Support for Programming Languages and Operating Systems, Cited by: 2nd item, §1, §1, §1, §5.3, §7.2, §8.
  • M. Jain, H. Olsen, B. Paten, and M. Akeson (2016) The oxford nanopore minion: delivery of nanopore sequencing to the genomics community. In Genome Biology, Cited by: §3.3, §6.1.2.
  • D. Jevdjic, K. Strauss, L. Ceze, and H. Malvar (2017) Approximate storage of compressed and encrypted videos. In International Conference on Architectural Support for Programming Languages and Operating Systems, Cited by: 2nd item, §1, §1, §1, §1, §5.3, §8.
  • M. Jung, D. Mathew, C. Weis, and N. Wehn (2016) Approximate computing with partially unreliable dynamic random access memory-approximate dram. In The 53rd Annual Design Automation Conference (DAC’16)., Cited by: §8.
  • S. Kannan and A. McGregor (2005) More on reconstructing strings from random traces: insertions and deletions. In Proceedings of the International Symposium on Information Theory, Cited by: §3.
  • A. Krishnamurthy, A. Mazumdar, A. McGregor, and S. Pal (2019) Trace reconstruction: generalized and parameterized. In arXiv:1904.09618v1, Cited by: §3.
  • H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. Church (2019) Terminator-free template-independent enzymatic DNA synthesis for digital information storage. In Nature Communications, Cited by: §8, §8.
  • S. Liu, K. Pattabiraman, T. Moscibroda, and B. Zorn (2011) Flikker: saving dram refresh-power through critical data partition. In International Conference on Architectural Support for Programming Languages and Operating Systems, Cited by: §8.
  • F. Nicolas and E. Rivals (2004) Hardness results for the center and median string problems under the weighted and unweighted edit distances. In Journal of Discrete Algorithms, Cited by: §3.
  • L. Organick, S. Ang, Y. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Racz, G. Kamath, P. Gopalan, B. Nguyen, C. Takahashi, S. Newman, S. Parker, C. Rashtchian, K. Stewart, G. Gupta, R. Carlson, J. Mulligan, D. Carmean, G. Seelig, L. Ceze, and K. Strauss (2018) Random access in large-scale dna data storage. In Nature biotechnology, Cited by: 4th item, §1, §1, Figure 1, §2.1, §2.1, §2.1, §2.2, §3.1, §3.3, §4.1, §4.1, §4.2, §6.1.1, §6.1.1, §6.1.2, §8.
  • D. A. Patterson, G. Gibson, and R. H. Katz (1988) A case for redundant arrays of inexpensive disks (raid). SIGMOD Rec. 17 (3). Cited by: §2.2.
  • A. Raha, S. Sutar, H. Jayakumar, and V. Raghunathan (2017) Quality configurable approximate dram. In IEEE Trans. Comput. 66, 7 (2017), 1172–1187, Cited by: §8.
  • C. Rashtchian, K. Makarychev, M. Rácz, S. Ang, D. Jevdjic, S. Yekhanin, L. Ceze, and K. Strauss (2017) Clustering billions of reads for dna data storage. In Advances in Neural Information Processing Systems, Cited by: §2.1.
  • O. Sabary, A. Yucovich, G. Shapira, and E. Yaakobi (2020) Reconstruction algorithms for dna storage systems. In International Conference on DNA Computing and Molecular Programming, Cited by: Figure 5, §3.1, §6.1.2, §8.
  • A. Sampson, J. Nelson, K. Strauss, and L. Ceze (2013) Approximate storage in solid-state memories. In International Symposium on Microarchitecture, Cited by: §1.
  • Seagate (2017) Data age 2025: don’t focus on big data; focus on the data that’s big. In IDC White Paper, Cited by: §1.
  • K. Stewart, Y. Chen, D. Ward, X. Liu, G. Seelig, K. Strauss, and L. Ceze (International Conference on DNA Computing and Molecular Programming) A content-addressable dna database with learned sequence encodings. In 2018, Cited by: 5th item.
  • C. Takahashi, B. Nguyen, K. Strauss, and L. Ceze (2019) Demonstration of end-to-end automation of dna data storage. In Nature Scientific Reports 9, Cited by: §1.
  • K. Tomek, K. Volkel, E. Indermaur, J. Tuck, and A. Keung (2021) Promiscuous molecules for smarter file operations in DNA-based data storage. In Nature Communications, Cited by: 4th item.
  • K. Tomek, K. Volkel, A. Simpson, A. Hass, E. Indermaur, J. Tuck, and A. Keung (2019) Driving the scalability of DNA-based information storage systems. In American Chemical Society, Cited by: 4th item.
  • K. Viswanathan and R. Swaminathan (2008) Improved string reconstruction over insertion-deletion channels. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’08. Cited by: §3.
  • S. Yazdi, R. Gabrys, and O. Milenkovic (2017) Portable and error-free dna-based data storage. In Nature Scientific Reports 7, Cited by: 4th item, §1, §2.1, §2.1, §3.3.
  • S. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic (2015) A rewritable, random-access dna-based storage system. In Nature Scientific Reports 5, Cited by: 4th item, §2.1, §2.1, §2.1.
  • C. Yi and J. Lee (1997) Interleaving and decoding scheme for a product code for a mobile data communication. In IEEE Transactions on Communications, Cited by: §4.2.