DeepAI
Log In Sign Up

Robust Multi-Read Reconstruction from Contaminated Clusters Using Deep Neural Network for DNA Storage

10/20/2022
by   Yun Qin, et al.
Tianjin University
0

DNA has immense potential as an emerging data storage medium. The principle of DNA storage is the conversion and flow of digital information between binary code stream, quaternary base, and actual DNA fragments. This process will inevitably introduce errors, posing challenges to accurate data recovery. Sequence reconstruction consists of inferring the DNA reference from a cluster of erroneous copies. A common assumption in existing methods is that all the strands within a cluster are noisy copies originating from the same reference, thereby contributing equally to the reconstruction. However, this is not always valid considering the existence of contaminated sequences caused, for example, by DNA fragmentation and rearrangement during the DNA storage process.This paper proposed a robust multi-read reconstruction model using DNN, which is resilient to contaminated clusters with outlier sequences, as well as to noisy reads with IDS errors. The effectiveness and robustness of the method are validated on three next-generation sequencing datasets, where a series of comparative experiments are performed by simulating varying contamination levels that occurring during the process of DNA storage.

READ FULL TEXT VIEW PDF

page 6

page 12

page 13

09/12/2021

Single-Read Reconstruction for DNA Data Storage Using Transformers

As the global need for large-scale data storage is rising exponentially,...
08/31/2021

Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning

The concept of DNA storage was first suggested in 1959 by Richard Feynma...
04/27/2017

DNA Steganalysis Using Deep Recurrent Neural Networks

The technique of hiding messages in digital data is called a steganograp...
07/11/2022

Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA Storage

Storing information in DNA molecules is of great interest because of its...
01/24/2022

Inferring taxonomic placement from DNA barcoding allowing discovery of new taxa

In ecology it has become common to apply DNA barcoding to biological sam...
09/28/2019

Deep Multiple Instance Learning for Taxonomic Classification of Metagenomic read sets

Metagenomic studies have increasingly utilized sequencing technologies i...
10/12/2020

Trace Reconstruction Problems in Computational Biology

The problem of reconstructing a string from its error-prone copies, the ...

I Introduction

Nowadays, the information explosion leads to the generation of massive data, that brings great challenges to traditional storage systems, such as mobile hard disks, USB flash memory, and integrated circuits. When utilizing these storage mediums, several problems arise inevitably, including insufficient storage duration, high energy consumption, and environmental pollution [8]. Meanwhile, Deoxyribonucleic Acid (DNA) molecule emerges as a promising storage medium, owing to its theoretically high storage density and long storage term, which fits the request of storing huge amounts of data [34, 2]. The workflow of DNA storage is summarized in Figure 1.

Generally, the DNA storage consists of firstly encoding binary stream to the alphabet {A, T, C, G} strings, chemically synthesizing short DNA oligos, namely references, and then storing the synthesized DNA strands in vitro or in vivo. To read the information via next-generation sequencing, the references should be retrieved from a large, unordered collection of error-prone reads

. This is because both synthesis and sequencing in DNA storage inevitably introduce insertion-deletion-substitution (IDS) errors to the DNA strands, with the error probability being 1%-2% in the mainstream next-generation sequencing and up to 10% for Nanopore sequencers

[6]. During sequencing, each single reference outputs an uncertain number of noisy copies, and the reads corresponding to different references are gathered without ordering [34, 18]. Clustering is usually applied on the sequencing file, such that the noisy reads originated from the same reference are grouped into clusters [22]. After that, the multi-read reconstruction, which is the topic of this paper, is performed to infer the the original reference from a cluster of noisy reads [24].

During the past ten years, a lot of research has been devoted to the sequence reconstruction problem in DNA storage. Roughly, they are divided into three categories: the consensus methods of Bitwise Majority Alignment (BMA) [9, 33, 29, 24], the statistical inference methods [26, 25, 13]

, and the recent deep learning ones 

[1, 19, 15]. The BMA and its variations are elaborated for IDS channels and applied to DNA storage systems in [9, 33, 24]. They perform position-to-position alignment among multiple reads and implement a majority voting strategy. The BMA-based methods are effective, especially for datasets with low IDS error rates.

The second category is based on statistical inference, where at each position of the sequence, the maximum a posterior (MAP) probabilities of all the possible input symbols are estimated and compared 

[26, 25, 13]. In [26], marker codes are inserted into LDPC codes at fixed intervals for error correction, and the decoder is based on a forward and backward (FB) algorithm. In [25]

, a drift vector is introduced to model the insertion/deletion errors in each received word, and a factor graph is derived for joint probability estimation. Concatenated codes are considered in 

[13]

, whose inner codes and channels are modeled as joint Hidden Markov Models (HMM) and the BCJR inference is derived. The so-called Trellis BMA marries BMA with BCJR decoding and achieves a linear complexity in the number of traces 

[30]. However, due to the computational overhead, the feasible reads number per cluster can hardly exceed ten when applying these methods in practical DNA storage systems.

With the emergence of deep learning, a few lately works have attempted to exploit deep neural networks (DNN) to address the multi-read reconstruction [1], as well as single read reconstruction [19] in DNA storage systems. Similar in spirit of this work, the main idea is to train a DNN model with good error correction capacity, that can map a cluster of noisy reads to the corresponding DNA reference. As this work also focuses on the multi-read reconstruction using DNN, the relevant works [1, 19, 15] will be reviewed in Section II.

In practice, the stability and robustness of current DNA storage systems are threatened by contaminated sequences that occur at different stages of the DNA storage. Unlike a noisy read that differs from its reference by only a few IDS errors, we refer the contaminated sequences to strands with a more significant edit distance from the original DNA references. Several factors contribute to the occurrence of contaminated sequences. In long-term storage and under certain conditions, DNA strands are susceptible to degradation, which results in strand breaks and loss [17]. Unspecific amplification inevitably causes frequent DNA breaks and rearrangements, where oligos are fragmented and rejoined to new ones [28], as shown in Figure 2. Contaminated sequences also include the complementary strands of the references produced during sequencing [16]. Considering the security issue in DNA storage, contaminated sequences are intentionally added for the purpose of data encryption in [12, 27, 32]. Obviously, the existence of contaminated sequences makes the already challenging reconstruction problem more difficult [17, 28].

In all aforementioned methods, every strand within a cluster contributes equally to the reconstruction of the reference strand, which holds only when the cluster under reconstruction comprises only the noisy copies originating from the same reference. However, such prerequisite for perfect clustering is not always achievable, accounting for the properties of current DNA storage systems. When the sequencing file contains a portion of contaminated sequences, clustering algorithms will fail to generate clusters in accordance with latent DNA references. On the other hand, as sequencing are biased towards strands with specific properties, existing perfect clustering methods (e.g.[35, 22, 21]) have the risk of losing references rarely sequenced [1, 17]. That is, clustering them into the wrong clusters. To the best of our knowledge, there is no method to differentiate the sequence quality and reliability within the cluster, in the context of sequence reconstruction for DNA storage.

This paper proposes a robust multi-read reconstruction method based on DNN. Taking advantage of the attention mechanism and the conformer block, the proposed model is resilient to contaminated clusters with outlier sequences, as well as noisy reads with IDS errors. The main contributions are as follows:

  • Integration of sequence quality to multi-read reconstruction. By far, this is the first multi-read reconstruction method that takes into account sequence reliability within the cluster. After scored according to sequence quality by the attention module, strands will contribute to the reconstruction at varying degrees. Thus the effect of various kinds of contaminated sequences can be suppressed automatically.

  • Error correction capacity of IDS errors within cluster.

    The proposed model realizes the error correction of IDS errors within the cluster. The Conformer-Encoder has strong feature extraction ability, such that the local features extracted by the convolutional layers and global features extracted from the attention module are smartly integrated. The resulting features are high-level and representative, such that the underlying reference of the noisy cluster can be well recovered by a single-layer long short-term memory (LSTM) decoder.

  • Sequence reconstruction model accommodating varying cluster sizes. The network is trained directly from clusters of different sizes, rather than summing up the reads within a cluster to form a structured input format [1]. Thereby, it is compatible with the input cluster of varying sizes at the testing stage.

  • Small network with less parameters. The proposed neural network has a small structure ( M parameters) with good generalization ability. This helps to mitigate the overfitting issue caused by the shortage of training data, when using DNN to address the sequence reconstruction problem in DNA storage.

Fig. 1: Overview of the DNA storage system. The workflow consists of five stages: encoding, synthesis, storage, sequencing, and decoding.
Fig. 2: Illustration of strands breaks and rearrangements in DNA data storage.

The rest of this paper is organized as follows. The related work is reviewed in Section II. In Section III, we present the proposed multi-read reconstruction model. Experimental results and analysis are given in Section IV. Finally,  Section II concludes the paper.

Ii Related Work

We succinctly review several deep learning-based sequence reconstruction methods in DNA storage. The most relevant literature to this paper is the so-called DNAformer, a scalable and robust solution for the DNA sequence reconstruction recently proposed in [1]. The model is based on DNN, and is well adapted to imperfect but fast clustering of copies. Benefiting from convolution, Xception and transformer, the model has good capacity to correct IDS errors (especially substitutions) within the cluster. Besides dissimilar network designs, our method differs from DNAformer in the following aspects.

  1. [leftmargin=*]

  2. The input of DNAformer is an element-wise sum of multiple copies, implying every sequence within a cluster equally important to the reconstruction. It fails to consider the differences in sequence quality caused by the existence of contaminated sequences. On the contrary, our method scores every sequence within the cluster, and accordingly, the strands contribute to the reconstruction at different levels.

  3. To overcome the shortage of training data, DNAformer applies the Synthetic Data Generator (SDG) [4] to generate sufficient DNA sequences for training the model, with the sequence error rates estimated by SOLQC [23]. Alternatively, our method circumvents this issue by designing a small but efficient network, which can be trained with much fewer labeled samples.

Nahum et al. [19] established a single-read reconstruction model for DNA based storage systems, aiming at understanding the error patterns from only a single sequence by a global, context-aware method. The model uses an encoder-decoder transformer architecture composed of two paired BERT models. The error correction is regarded as a self-supervised sequence-to-sequence task, and the network is trained using synthetic sequences generated by SDG [4].

In [15]

, a basecaller-decoder integration method is proposed for recovering Nanopore sequencing data, where the Viterbi error correction and recurrent neural network (RNN) are combined. The reference is reconstructed directly from multiple records of the raw signal, instead of being inferred from highly noisy basecalled reads. This yields a 3-fold reduction in reading cost compared to previous work.

Fig. 3: Illustration of the multi-read reconstruction problem defined in this paper. Binary files are encoded as DNA references. Multi-read reconstruction starts from a noisy cluster containing the erroneous copies originating from the original reference and the contaminated sequences occurring at different stages of DNA storage. The proposed reconstruction method aims at finding a mapping (characterized by a neural network), that minimizes the distance between the cluster and the original reference.

Iii The Proposed Model

We formulate the multi-read reconstruction problem mathematically, and then describe the proposed Robust multi-read Reconstruction from Contaminated Clusters using DNN (RRCC-DNN) method in detail.

Iii-a Problem Statement

Use to represent four DNA nucleotides. Let be a noisy cluster, which contains erroneous reads originating from the same reference , and contaminated sequences introduced at various stages of the DNA storage process, with . Based on this assumption, the DNA multi-read reconstruction algorithm is a mapping:

which receives sequences and outputs , an estimate of , as shown in Figure 3. In this work, we focus on deploying a DNN model to find such a mapping that the distance between and can be minimized.

Fig. 4: Model architecture. The proposed RRCC-DNN is composed of Attention Module, Conformer-Encoder, and LSTM-Decoder, which correspond to the three colored regions in the figure, respectively.

Iii-B Model Overview

We aim at addressing the sequence reconstruction problem defined in Section III-A by deep learning. As shown in Figure 4, the proposed neural network is based on the encoder-decoder architecture, and is mainly composed of three components, i.e., Attention Module, Conformer-Encoder and LSTM-Decoder. The attention mechanism [3]

is used to automatically suppress the effect of suspicious contaminated sequences while amplifying the contribution of sequences that likely originating from the cluster reference. Placed at the front of the model, the Attention Module scores the quality of every sequence of the input cluster, and generates a high-level, average-weighted feature accordingly. The Conformer-Encoder is expected to understand the IDS error patterns within a cluster, taking into account its powerful feature extraction ability. It interactively combines the local features extracted by the convolution with the global features generated by the attention module. The decoder is a single-layer LSTM, which outputs the predicted reference of the input cluster. Next, we present the sequence embedding and three model components in detail, as well as the loss function.

Iii-B1 Sequence Embedding

The model input is a cluster of a non-fixed number of DNA sequences with varying lengths. Before being fed to the network, each sequence is represented by the one-hot encoding to a prefixed, uniform length

, where zeros are padded to short strands. In this way, every sequence is converted to a matrix of size

, each column being a one-hot vector indicating the corresponding base at that index position.

Iii-B2 Attention Module

As illustrated in Figure 5, the attention module consists of the convolutional layer followed by an attention mechanism [31]. For every strand feature, we perform two successive 1D convolution operations with kernel sizes of 3 and 5 to model the position shifts from synchronization errors, while reducing the number of feature channels from 4 to 2 and finally to 1. The resulting one-dimensional vectors are scored by the attention mechanism in a similar way as in [5].

Let be the input feature of the -th strand in cluster, and be the corresponding vector after convolution. The attention mechanism is applied as

(1)

Here, the linear transform with parameters

and is used to project the vector to lower-dimensional space, thus reducing parameter number of the network. After a nonlinear activation layer , the feature is transformed to a sequence-wise attention score via a linear layer (parameterized by and ). By applying the softmax function, the scalar is normalized over all the strands within the cluster as

(2)

where is the cluster size, and is the final attention score of the -th sequence. Obviously, the attention score reflects the importance of each strand within the cluster. As a result, the weight-averaged feature for the given cluster becomes

(3)

Here, every sequence contributes to the representation differently according to sequence quality, with the importance of high-quality reads amplified and the effect of low-scored strands suppressed automatically. After the attention module, the linear layer and convolution upsampling is applied to represent the feature (3) in an enlarged feature space. As a 4-dimensional representation is not enough to characterize a position in the sequence, we expand the feature dimension from to in the experiments.

Fig. 5: Attention Module. Each noisy copy is converted to a two-dimensional matrix by one-hot encoding and zero-padding. After convolution, each feature is transformed into a one-dimensional vector, and is fed to the attention mechanism to estimate a scalar score. Finally, the weight-averaged feature for the given cluster is generated.

Iii-B3 Conformer-Encoder

Concerning the encoder, we adapt the convolution-augmented transformer (conformer), which is proposed for speech recognition and outperformed the CNN and transformer-based models with state-of-the-art results [11]. The conformer combines self-attention and convolution, where the former captures the global feature while the latter learns the relative-offset-based local interactions. As a result, it can correct IDS errors through the extracted rich semantic information. As shown in Figure 4, Conformer-Encoder consists of multi-head self-attention layers and convolution layers sandwiched between two feed-forward modules with shortcut connections, where layer normalization is always applied at the junction of two modules.

The multi-head attention module (MHSA) [31] is computed by scaled-dot product with

(4)

where are linear transformations of feature . In this work, we employ parallel attention heads, namely the concatenation of scaled-dot product attention results, yielding

(5)

where is computed from (4), and maps the concatenated feature back to the original dimension . In practice, we have and . As for the convolution module (Conv), we perform two deep separable convolutions with kernel size 31 to capture local correlations among sequence positions. Each feed-forward module (FFN) has two linear layers, which firstly double and then restore the original feature dimension.

Mathematically, for input to Conformer-Encoder, the output is:

(6)
(7)
(8)
(9)

Iii-B4 LSTM-Decoder

As an advanced variant of RNN, LSTM can model long-range decencies well for chronological data [10]. As DNA is context-dependent and sequential data, we employ a single-layer LSTM decoder. Although simple, it is sufficient for the reconstruction task, owing to the powerful feature extraction ability of Conformer-Encoder. The decoder reduces the feature dimension back to 4, outputting for each position the estimated probabilities for each base.

The proposed RRCC-DNN model is trained using a cross-entropy loss function defined as

(10)

where is the sequence length, is one-hot label vector indicating the base category for the -th position, and represents the predicted probability vector by the proposed neural network.

Dataset Erlich et al.[7] Organick et al.[20] Chandak et al.[3]
Number of original sequences  72000  607150  11710
Length of the original sequence  152  150  150
Synthesis  Twist Bioscience Twist Bioscience CustomArray
Sequencing  Ilumina miSeq Ilumina NextSeq Illumina iSeq
Number of original sequences aligned to reads 72000 596669 11710
Missing clusters 0 10481 0
Number of reads aligned to original sequences 13328870 14486345 1065117
TABLE I: Data description.
Dataset Erlich et al. [7] Organick et al. [20] Chandak et al. [3]
Training Set cluster number 36000 296317 5857
cluster size 5-30 5-30 5-30
number of reads 628875 5587728 101643
Testing Set cluster number 36000 296325 5853
cluster size 5-30 5-30 5-30
number of reads 630945 5586351 102744
TABLE II: Statistics of the training and testing set.

Iv Experimental Results

Iv-a Data Preparation and Training details

We use three well-known datasets for DNA-based storage provided in Erlich et al.[7], Organick et al.[20], and Chandak et al.[3]. Dataset descriptions are given in Table I. Each dataset comprises two files for sequence reconstruction, one containing the disordered collection of the noisy reads and the other recording all the original references.

As no ground-truth clusters are available in practice, we first apply Burrows-Wheeler-Alignment Tool (BWA) [14] on both files, and take the sequence alignment results as perfect clusters, where each read is matched to its closest reference.

In the experiments, we set the cluster scale to be 530, a modest range for the sequence reconstruction task, by randomly picking reads from each previously-obtained cluster. It is because the number of reads in the original sequencing file is large, signifying a commonly large cluster size with much information redundancy (e.g., averaged number of copies per reference in dataset Erlich et al.[7], Table I).

To simulate the challenging scenarios where DNA storage is under threat of contamination, we inject into each cluster a certain proportion of contaminated sequences generated from the following reasons:

  • Misclustered sequence. When the clustering is imperfect, a sequence will be assigned to the wrong cluster.

  • Reverse complementary strand. The sequencing process generates the reverse complementary sequence for a DNA [16], and the strands in opposite orders risk coming to the same cluster with imperfect clustering.

  • Random sequence. Inspired by [12], we consider the randomly generated DNA to simulate the fake information intentionally added to the original file.

  • Splicing of DNA fragments. Such errors are simulations of DNA breakages and rearrangements that frequently occur in DNA storage and PCR amplification-based DNA strand replication, as mentioned in [28].

To demonstrate the effectiveness of the proposed model under different contamination levels, we inject contaminated sequences into each cluster, with equal probability for every candidate reason. More precisely, on each of the three datasets, five simulations with contamination levels ranging within the set are performed by using the proposed method as well as the comparative sequence reconstruction approaches. Here, corresponds to the case without extra added contaminated sequences, and the clusters are composed of the reads from the original sequencing file.

Table II reports the training and testing set on three datasets. For each experiment, the proportion of training data to test data is set to 1:1. We sort the training samples according to their cluster sizes and divide the training set into batches such that each comprises clusters of the same size. The training and testing are performed on a single 2080ti GPU. We set the batch size to 64 and the initial learning rate to 0.005. The Adam optimizer is applied with parameter values and . The coefficient of regularization is chosen as to prevent model overfitting.

(a) Erlich et al.[7]
(b) Organick et al.[20]
(c) Chandak et al.[3]
Fig. 6: Changes of reconstruction success with respect to the contamination level ranging from to using the proposed RRCC-DNN, on three datasets.
(a) Erlich et al.[7]
(b) Organick et al.[20]
(c) Chandak et al.[3]
Fig. 7: Frequency histograms of the edit distance measured between the wrong prediction and the corresponding cluster reference.

Iv-B Evaluation metric and Comparative methods

The effectiveness of the proposed method is evaluated by comparing with three state-of-the-art sequence reconstruction methods, where the performance is evaluated by the success rate, given by

(11)

In this formula, a sequence will contribute to the success rate only if it is perfectly reconstructed without error at every index position.

  • Iterative Reconstruction [24]: This algorithm uses multiple methods to revise strands from clusters and return the candidate sequence most likely to be the original reference. The error vectors majority algorithm is used to correct insertion and substitution errors, while the pattern-path algorithm is applied to correct deletion errors.

  • Divider BMA [24]: This BMA-based algorithm divides the received clusters into three sub-clusters by their length. The majority voting is applied to the sequences of correct length. Then deletion and insertion error corrections are performed on the sub-clusters with shorter and larger sequence lengths, respectively.

  • BMA Lookahead [9]: This is an improved algorithm of the BMA method. For sequences whose current symbol does not match the majority of symbols, a ”prior window” looking at the next two (or more) symbols is used.

Iv-C Results anaysis

We report the reconstruction success rates of the proposed RRCC-DNN at different contamination levels on the testing set for all the three datasets, as shown in Figure 6. For Erlich et al.[7], the success rates reach 99.86%, 99.78%, 99.78%, 99.76%, and 99.74%, corresponding to 51, 78, 79, 87, and 93 wrong predictions out of 36000 clusters. The success rates are 99.82%, 99.70%, 99.53%, 99.52%, and 99.58% with 540, 897, 1391, 1408, and 1230 wrong predictions out of 296325 clusters for Organick et al. [20]. On the third dataset Chandak et al.[3], the numbers are 97.68%, 97.37%, 97.06%, 96.77%, and 96.45% with 136, 154, 172, 189, and 208 wrong predictions out of 5853 testing clusters. On all three datasets, the performance of the RRCC-DNN model remains stable with a slight decrease in success rate, as the proportion of contaminated strands gradually augmented even to 20%. This confirms the robustness and stability of the proposed RRCC-DNN model in presence of contaminated sequences. Notice that the success rates are relatively low on the third dataset at all the contamination levels. It is due to the higher IDS error rates, as well as the mismatch in sequence lengths.

Iv-C1 Wrong prediction analysis

Figure 7 illustrates the frequency histograms of the edit distance, which is measured between the incorrectly predicted sequence and the corresponding cluster reference. As observed, most of the wrong predictions have a small edit distance to their original reference, meaning that almost every sequence position can be correctly predicted by the proposed model, even if the cluster is not perfectly reconstructed.

Iv-C2 Impact of cluster size

Fig. 8: Changes of the success rate in terms of the smallest cluster size , namely the cluster size of the dataset ranges from to 30, on dataset Organick et al.[20].

We investigate how the smallest cluster size in a dataset affects the reconstruction success rate, under varying contamination conditions. The results are given in Figure 8. As observed, for all contamination rates, the success rate increases with the increase of the smallest cluster size in a dataset. With , the success rates achieves 99.98%, 99.96%, and 99.89% under the contamination levels 0%, 10% and 20%, respectively.

Iv-D Comparative study

Figure 9 reports the success rate obtained on all three datasets at varying contamination levels, by using the proposed RRCC-DNN, and other three sequence reconstruction strategies. We observe that on the dataset Chandak et al. [3], the proposed RRCC-DNN always provides the best results at all contamination levels. This is because most reads in this dataset has a larger length than that of the original reference, signifying a high IDS error rate. Thanks to the Conformer-Encoder module, our model can efficiently capture the general IDS error patterns, thus resilient to such position shifts within the strands. The BMA divider [24] fails on this dataset.

When the dataset is not contaminated at all, our method can provide comparable reconstruction results to state-of-the-art methods. On the first two datasets, the resulting success rates by RRCC-DNN are lower than that of Iterative Reconstruction [24] and BMA Lookahead [9], but slightly higher than in BMA divider [24].

The advantages of the proposed RRCC-DNN become increasingly evident as the proportion of contaminated sequences gradually augments in the dataset. For example, when the contamination proportion is increased to 10% in Erlich et al. [7] data, the performance decreases in RRCC-DNN, Iterative Reconstruction[24], BMA divider [24], and BMA Lookahead[9] are 0.1%, 0.15%, 0.37 and 0.22%, respectively. Compared to its counterparts, the proposed method is least affected by cluster contamination. With 10% contamination, the RRCC-DNN is second to Iterative Reconstruction [24] by 0.06% on dataset Erlich et al. [7], and second to BMA Lookahead[9] by 0.07% on the second data. When the contamination proportion reaches 15%, the RRCC-DNN outperforms all the comparing methods on all three datasets. As the contamination proportion continues to increase, such advantages over other methods in terms of success rate become significant.

The above discussions demonstrate the good behavior and robustness of the proposed RRCC-DNN in presence of contaminated sequences, as well as IDS errors within clusters.

(a) Contamination level: 0%
(b) Contamination level: 10%
(c) Contamination level: 15%
(d) Contamination level: 20%
Fig. 9: Comparison of success rate using the proposed RRCC-DNN, iterative reconstruction [24], BMA divider [24] and BMA lookahead [9], on three datasets at contamination levels %0, 10%, 15% and 20%.

Iv-E Ablation Study

Model Architecture Success rate
Erlich et al. [7] Organick et al. [20] Chandak et al. [3]
0% 10% 20% 0% 10% 20% 0% 10% 20%
Our Model 99.86% 99.78% 99.74% 99.82% 99.53% 99.58% 97.68% 97.06% 96.44%
-Attention 70.72% 63.76% 43.45%  80.65% 72.49% 57.53% 67.89% 65.98% 62.79%
-Attention+Normalization 99.21% 98.56% 96.34%  99.81% 99.12% 97.51% 97.68% 95.33% 90.12%
-Conformer+Transformer 99.57% 99.32% 99.00% 99.15% 99.12% 98.47% 97.45% 96.78% 96.28%
TABLE III: Ablation Study.

The ablation study is designed to demonstrate the necessity of the Attention Module and the effectiveness of the Conformer-Encoder. To this end, we first remove the attention mechanism in Attention Module and directly feed the model with the summation of all the input sequences within a cluster. As shown in Table III, the resulting model performs poorly on all datasets with varying contamination levels.

We further impose an equal, normalized weight on every input strand. As seen from Figure III, the resulting reconstruction performance is always inferior to our proposed model, especially when severe contamination is present in the dataset. The gaps in success rate between the two models are up to 3.4%, 2.07%, and 6.32% on three datasets when the contamination rate reaches 20%. In the case without additional contamination sequences, this model achieves similar results compared to our model.

Finally, we modify our model by replacing the Conformer block with a Transformer block. The difference is that Conformer has a convolution module and a pair of Feed-Forward modules, while Transformer has only one Feed-Forward module [31]. As shown in Table III, the performance of the latter model is satisfactory but inferior to our model, demonstrating the effectiveness of the proposed Conformer-Encoder.

V Conclusion

In this paper, we proposed a DNN-based multi-read reconstruction model for DNA storage, which is robust to noisy reads with IDS errors, and more importantly resilient to the contaminated sequences introduced during the DNA storage process. The proposed network has an encoder-decoder architecture with three pivotal components. The Attention Module suppresses the effect of contaminated sequences on the reconstruction, by automatically scoring the strands within the cluster and generating a representative, weight-averaged feature for subsequent tasks. The Conformer-Encoder has a sandwich structure and tackles most of the IDS errors within a cluster thanks to its advanced feature extraction capacity. The single-layer LSTM-decoder finally predicts the reference DNA of the input cluster. We prove the effectiveness and robustness of the proposed RRCC-DNN on three next-generation sequencing datasets through a series of comparative experiments, where different levels of contamination caused by various factors during the process of DNA storage are simulated. The ablation study is also provided to verify the necessity of the attention mechanism and the conformer block in the proposed model. Future works will focus on adapting the proposed sequence reconstruction model to the Nanopore sequencing data with higher error rates.

References

  • [1] D. Bar-Lev, I. Orr, O. Sabary, T. Etzion, and E. Yaakobi (2021) Deep dna storage: scalable and robust dna storage via coding theory and deep learning. arXiv preprint arXiv:2109.00031. Cited by: 3rd item, §I, §I, §I, §II.
  • [2] L. Ceze, J. Nivala, and K. Strauss (2019) Molecular digital data storage using dna. Nature Reviews Genetics 20 (8), pp. 456–466. Cited by: §I.
  • [3] S. Chandak, K. Tatwawadi, B. Lau, J. Mardia, M. Kubit, J. Neu, P. Griffin, M. Wootters, T. Weissman, and H. Ji (2019) Improved read/write cost tradeoff in dna-based data storage using ldpc codes. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 147–156. Cited by: §III-B, TABLE I, TABLE II, 6(c), 7(c), §IV-A, §IV-C, §IV-D, TABLE III.
  • [4] G. Chaykin (2021) DNA storage similator. External Links: Link Cited by: item 2, §II.
  • [5] B. Desplanques, J. Thienpondt, and K. Demuynck (2020) ECAPA-tdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In Proc. Interspeech 2020, pp. 3830–3834. External Links: Document Cited by: §III-B2.
  • [6] Y. Dong, F. Sun, Z. Ping, Q. Ouyang, and L. Qian (2020) DNA storage: research landscape and future prospects. National Science Review 7 (6), pp. 1092–1107. Cited by: §I.
  • [7] Y. Erlich and D. Zielinski (2017) DNA fountain enables a robust and efficient storage architecture. science 355 (6328), pp. 950–954. Cited by: TABLE I, TABLE II, 6(a), 7(a), §IV-A, §IV-A, §IV-C, §IV-D, TABLE III.
  • [8] K. Goda and M. Kitsuregawa (2012) The history of storage systems. Proceedings of the IEEE 100 (Special Centennial Issue), pp. 1433–1440. Cited by: §I.
  • [9] P. S. Gopalan, S. Yekhanin, S. D. Ang, N. Jojic, M. Racz, K. Strauss, and L. Ceze (2018-July 26) Trace reconstruction from noisy polynucleotide sequencer reads. Google Patents. Note: US Patent App. 15/536,115 Cited by: §I, Fig. 9, 3rd item, §IV-D, §IV-D.
  • [10] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber (2017) LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28 (10), pp. 2222–2232. External Links: Document Cited by: §III-B4.
  • [11] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §III-B3.
  • [12] J. Kim, J. H. Bae, M. Baym, and D. Y. Zhang (2020) Metastable hybridization-based dna information storage to allow rapid and permanent erasure. Nature communications 11 (1), pp. 1–8. Cited by: §I, 3rd item.
  • [13] A. Lenz, I. Maarouf, L. Welter, A. Wachter-Zeh, E. Rosnes, and A. G. i Amat (2021) Concatenated codes for recovery from multiple reads of dna sequences. In 2020 IEEE Information Theory Workshop (ITW), pp. 1–5. Cited by: §I, §I.
  • [14] H. Li and R. Durbin (2009) Fast and accurate short read alignment with burrows–wheeler transform. bioinformatics 25 (14), pp. 1754–1760. Cited by: §IV-A.
  • [15] X. Lv, Z. Chen, Y. Lu, and Y. Yang (2020) An end-to-end oxford nanopore basecaller using convolution-augmented transformer. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 337–342. Cited by: §I, §I, §II.
  • [16] V. Mallet and J. Vert (2021) Reverse-complement equivariant networks for dna sequences. Advances in Neural Information Processing Systems 34, pp. 13511–13523. Cited by: §I, 2nd item.
  • [17] K. Matange, J. M. Tuck, and A. J. Keung (2021) DNA stability: a central design consideration for dna data storage systems. Nature communications 12 (1), pp. 1–9. Cited by: §I, §I.
  • [18] L. C. Meiser, P. L. Antkowiak, J. Koch, W. D. Chen, A. X. Kohll, W. J. Stark, R. Heckel, and R. N. Grass (2020) Reading and writing digital data in dna. Nature Protocols 15 (1), pp. 86–101. Cited by: §I.
  • [19] Y. Nahum, E. Ben-Tolila, and L. Anavy (2021) Single-read reconstruction for dna data storage using transformers. arXiv preprint arXiv:2109.05478. Cited by: §I, §I, §II.
  • [20] L. Organick, S. D. Ang, Y. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Z. Racz, G. Kamath, P. Gopalan, B. Nguyen, et al. (2018) Random access in large-scale dna data storage. Nature biotechnology 36 (3), pp. 242–248. Cited by: TABLE I, TABLE II, 6(b), 7(b), Fig. 8, §IV-A, §IV-C, TABLE III.
  • [21] G. Qu, Z. Yan, and H. Wu (2022) Clover: tree structure-based efficient dna clustering for dna-based data storage. Briefings in Bioinformatics. Cited by: §I.
  • [22] C. Rashtchian, K. Makarychev, M. Racz, S. Ang, D. Jevdjic, S. Yekhanin, L. Ceze, and K. Strauss (2017) Clustering billions of reads for dna data storage. Advances in Neural Information Processing Systems 30. Cited by: §I, §I.
  • [23] O. Sabary, Y. Orlev, R. Shafir, L. Anavy, E. Yaakobi, and Z. Yakhini (2020-08) SOLQC: synthetic oligo library quality control tool. Bioinformatics 37 (5), pp. 720–722. External Links: ISSN 1367-4803, Document Cited by: item 2.
  • [24] O. Sabary, A. Yucovich, G. Shapira, and E. Yaakobi (2020) Reconstruction algorithms for dna-storage systems. bioRxiv. Cited by: §I, §I, Fig. 9, 1st item, 2nd item, §IV-D, §IV-D, §IV-D.
  • [25] R. Sakogawa and H. Kaneko (2020) Symbolwise map estimation for multiple-trace insertion/deletion/substitution channels. In 2020 IEEE International Symposium on Information Theory (ISIT), pp. 781–785. Cited by: §I, §I.
  • [26] R. Shibata, G. Hosoya, and H. Yashima (2016) Fixed-symbols-based synchronization for insertion/deletion/substitution channels. In 2016 International Symposium on Information Theory and Its Applications (ISITA), pp. 686–690. Cited by: §I, §I.
  • [27] I. Shomorony and R. Heckel (2021) DNA-based storage: models and fundamental limits. IEEE Transactions on Information Theory 67 (6), pp. 3675–3689. Cited by: §I.
  • [28] L. Song, F. Geng, Z. Gong, X. Chen, J. Tang, C. Gong, L. Zhou, R. Xia, M. Han, J. Xu, B. Li, and Y. Yuan (2022-09-12) Robust data storage in dna by de bruijn graph-based de novo strand assembly. Nature Communications 13 (1), pp. 5361. External Links: ISSN 2041-1723, Document Cited by: §I, 4th item.
  • [29] S. R. Srinivasavaradhan, M. Du, S. Diggavi, and C. Fragouli (2019) Symbolwise map for multiple deletion channels. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 181–185. Cited by: §I.
  • [30] S. R. Srinivasavaradhan, S. Gopi, H. D. Pfister, and S. Yekhanin (2021) Trellis bma: coded trace reconstruction on ids channels for dna storage. In 2021 IEEE International Symposium on Information Theory (ISIT), pp. 2453–2458. Cited by: §I.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §III-B2, §III-B3, §IV-E.
  • [32] P. K. Vippathalla and N. Kashyap (2022) The secure storage capacity of a dna wiretap channel model. arXiv preprint arXiv:2201.05995. Cited by: §I.
  • [33] S. M. Yekhanin and M. Z. Racz (2020-February 20) Trace reconstruction from reads with indeterminant errors. Google Patents. Note: US Patent App. 16/105,349 Cited by: §I.
  • [34] V. Zhirnov, R. M. Zadegan, G. S. Sandhu, G. M. Church, and W. L. Hughes (2016) Nucleic acid memory. Nature materials 15 (4), pp. 366–370. Cited by: §I, §I.
  • [35] E. Zorita, P. Cusco, and G. J. Filion (2015) Starcode: sequence clustering based on all-pairs search. Bioinformatics 31 (12), pp. 1913–1919. Cited by: §I.