Log In Sign Up

Thermodynamically Stable DNA Code Design using a Similarity Significance Model

DNA code design aims to generate a set of DNA sequences (codewords) with minimum likelihood of undesired hybridizations among sequences and their reverse-complement (RC) pairs (cross-hybridization). Inspired by the distinct hybridization affinities (or stabilities) of perfect double helix constructed by individual single-stranded DNA (ssDNA) and its RC pair, we propose a novel similarity significance (SS) model to measure the similarity between DNA sequences. Particularly, instead of directly measuring the similarity of two sequences by any metric/approach, the proposed SS works in a way to evaluate how more likely will the undesirable hybridizations occur over the desirable hybridizations in the presence of the two measured sequences and their RC pairs. With this SS model, we construct thermodynamically stable DNA codes subject to several combinatorial constraints using a sorting-based algorithm. The proposed scheme results in DNA codes with larger code sizes and wider free energy gaps (hence better cross-hybridization performance) compared to the existing methods.


page 1

page 2

page 3

page 4


DNA codes over two noncommutative rings of order four

DNA codes based on error-correcting codes have been successful in DNA-ba...

Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine Learning

Inverse design of short single-stranded RNA and DNA sequences (aptamers)...

Applications of Convolutional Codes to DNA Codes and Error-Correction

Convolutional codes are error-correcting linear codes that utilize shift...

Did Sequence Dependent Geometry Influence the Evolution of the Genetic Code?

The genetic code is the function from the set of codons to the set of am...

Evolution of Hierarchical Structure Reuse in iGEM Synthetic DNA Sequences

Many complex systems, both in technology and nature, exhibit hierarchica...

DNA Codes over the Ring ℤ_4 + wℤ_4

In this present work, we generalize the study of construction of DNA cod...

Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA Storage

Storing information in DNA molecules is of great interest because of its...

I Introduction

DNA code is an ensemble of -ary () -sequences subject to combinatorial biological constraints and with controlled maximum similarity among sequences and their reverse-complement (RC) pairs. Such sequences are sought after for wide applications including DNA computing [1], DNA memory, and DNA data storage [22, 31]. The theoretical bound of DNA codes satisfying combinatorial biochemical constraints, such as GC content, word-word distance, and word-reverse complementary word distance have been explored in [18, 15, 7]. Concurrently, several code construction approaches have been proposed, including template-based construction [3, 16, 14], search algorithm-based construction [11, 27, 28], and coding theoretic construction [12, 20].

In DNA code design, one major criterion to be controlled is the maximum similarity or minimum distance between DNA sequences or/and their reverse-complement (RC) sequences (i.e., DNA strands in the physical entity). This criterion has a significant impact on the cross-hybridization performance of a DNA code that can be evaluated thermodynamically. Conventionally, thermodynamic-based models and distance models have been used for DNA code construction, of which the latter approach is recognized with lower complexity [23]. In particular, Hamming distance model has been widely used in theoretic-based code design [12, 18, 15, 7, 27, 13], while edit distance model has been either adopted in a straight-forward fashion [28, 8, 4] or generalized to pairwise alignment algorithms [2] that are widely implemented in biological software [29, 21, 32, 33]. In a nutshell, the existing similarity/distance models might discriminate in obtaining or quantifying the similarity value while they are all built upon directly comparing two unique sequences that are under consideration. In contrast, we introduce a new similarity significance (SS) model that involves not only the two compared sequences but also their RC pairs. In this model, the RC pairs are considered as the reference pairs for quantifying the affinity-mediate similarity of the two comparing sequences.

With the proposed SS model, we designed DNA codes using a maximum SS constraint to restrain the cross-hybridization performance of a set of synthesized DNA strands (i.e., the code). Specifically, by using SS, the similarity of two different DNA strands (e.g., ) is determined by evaluating how more likely will one strand ( or ) hybridize with the RC pair of the other strand (denoted by or ) than hybridizing with its RC pair ( or ). This differs from the classic sequence comparison approach where distance/similarity metrics are directly used to determine how similar two DNA sequences are. To quantify the SS of two sequences, we used a sequence alignment between two sequences with a biased similarity weight allocation for DNA symbols that are identical under the alignment. By a sorting-based exhaustive search algorithm, the DNA codes are attained. The constructed DNA codewords satisfy the predefined maximum SS parameters while complying with the balanced GC content and maximum homopolymer run. The existence of codes with larger code sizes and wider free energy gaps (better cross-hybridization performance) using the SS model rather than traditional metrics and existing measure implies a better approximation towards the thermodynamic property.

Ii Definition, Notation

A hybridization conformation is a double helix formation which is produced from single-stranded DNA (ssDNA) strand(s) with two chemically distinct ends (i.e., 5’- end and 3’- end) under the Watson-Crick (WC) complement rule where adenine (A) binds with thymine (T) and cytosine (C) binds with guanine (G). For a DNA sequence denoted by , the desirable hybridization is the perfect hybridization that occurs between and its WC RC pair (denoted by ). For instance, = 5’-TTCCGAT-3’ and = 5’-ATCGGAA-3’. All other -involved hybridizations are considered undesirable to , rendering the cross-hybridization. The stability of a hybridization conformation could be predicted by a thermodynamic metric, termed Gibbs free energy, in which a lower value (usually negative) implies a more stable/possible helix conformation [23]. As such, the hybridization affinity of DNA strand(s) could be inferred by the minimum free/Gibbs energy (MFE) that stands for the most possible conformation structure of the considered ssDNA strand(s). For simplicity, is used to denote MFE in the following context.

Consider a DNA code consisting of unique sequences (e.g., ), there are three categories of undesirable hybridizations that might occur to sequence . First, a self-complement hybridization that occurs within , e.g., folds with itself. Second, a sequence-sequence (tag-tag) hybridization, which occurs between and the other candidate ssDNA strand . Third, a sequence-RC (tag-target) hybridization, in which hybridizes with the RC pair of the other candidate ssDNA strand ().

MFE is widely used to characterize the hybridization affinity between two ssDNA strands. To estimate the cross-hybridization performance of a DNA code (i.e., a set of DNA sequences), generalized thermodynamic criteria based on MFE have been proposed in

[26, 34]. Similarly, we use the free energy gap () to evaluate a DNA code, which is defined by,


where represents the MFE of two unique sequences and that pertain to a DNA code , and and are the WC RC pairs of and , respectively. In (1), the first three s within the inner correspondingly refer to the MFE of three undesirable hybridizations of strand as mentioned, i.e., self-fold hybridization (), sequence-sequence hybridization (), and sequence-RC hybridization (). Conversely, the out of the inner represents the MFE of the desirable hybridization of strand (i.e., the complete duplex formation with its RC pair ). Note that which indicates the MFE of the undesirable hybridization constructed by the RC pairs of codeword strands (i.e., and ) is also included in (1) to keep consistency with [34]. Notably, used in this work is more stringent than the free energy gaps used in [34] and [26], as their definitions exclude the MFE of self-fold hybridizations (i.e.,) and RC-RC hybridizations (i.e., ), respectively. In this work, is calculated based on the MFEs attained from the online tool DINAMelt [19]. According to (1), a larger represents a wider gap between the free energies of the desirable and undesirable hybridizations, and thus indicating a better DNA code. Therefore, the free energy gap defined in (1) can be used to evaluate the cross-hybridization performance of a DNA code.

Iii Proposed similarity significance (SS) model

The proposed similarity significance (SS) is inspired by the distinct affinities of individual perfect hybridization (between the WC complement pair, i.e., an ssDNA and its RC) and the definition of the free energy gap of a DNA code (i.e., the MFE difference between undesirable hybridizations and desirable hybridizations).

Unlike the conventional measures (e.g., Hamming distance) which quantify the distance/similarity of two different DNA sequences, the proposed SS quantifies the significance of the mutual similarity between two different sequences (which is undesirable) over the self similarity between identical sequences or to say sequences with themselves (which is desirable). This model inherently aligns with the thermodynamic metric of the code, i.e., the free energy gap between the undesirable hybridizations and desirable hybridizations. Specifically, is attained by computing the ratio of the similarity of sequences and against the similarity of the sequences and or the similarity of sequences and . Naturally, , which shares the same essence with the normalization in this respect. However, unlike the normalized similarity where the normalization coefficient, i.e., the denominator, is usually fixed as the length of the sequences or resolved by the sequence alignment, the denominator in SS depends on the individually varied affinities of the perfect WC conformations that relate to the compared sequences.

By definition, we compare the proposed SS model with the traditional similarity models. Consider two sequences , the similarity measured by a standard similarity model can be formed by


in which is obtained by comparing sequences and with any customized similarity model, such as Hamming model and Edit model.

Likely, the similarity measured by a normalized similarity model can be formed by


in which is same as (2) and is the normalization parameter that might depend on the alignment or the length of the two comparing sequences.

In contrast, the similarity measured by the SS model is formed by


in which is obtained by comparing two sequences similar as in (2) while it differs from (2) in quantification. To obtain in (4), a weighted similarity allocation (WSA) mechanism is indispensable. The WSA is built upon the biochemical characteristics, and it is of critical importance as it discriminates the SS from the normalized similarity.

Iii-a Quantifying similarity using SS model

The key concept of SS is to measure the similarity of two unique DNA sequences by quantifying the relative likelihood of undesirable hybridizations against perfect WC hybridizations with the presence of both sequences and their RC pairs. Particularly, the likelihood of hybridizations could be measured by thermodynamic models or approximated by any similarity model that reflects the distinct hybridization affinities. Hence, any existing measure that could differentiate the affinity of hybridizations among sequences could be used as the basis of SS computation [2, 10, 5, 17, 9]. Here, we use a best alignment criteria (BAC) and a weighted similarity allocation (WSA) for quantifying similarity using SS, where the BAC initials the base-pair comparison, and the WSA discriminates the SS from the normalized similarity. There are four steps in SS computation. First, a BAC will be applied to determine the best alignment between two compared sequences. Then, based on the WSA mechanism, the similarity score is attained under the best alignment. Next, the similarity scores of each sequence with itself are calculated using the WSA mechanism. Lastly, the ratio of the score in step 2 to the minimum score in step 3 is calculated as the result. Before presenting the formal formulation of the proposed SS, the BAC and the WSA that form the basis of SS computation are briefly explained as follows.

The best alignment criterion (BAC) determines the best alignment of two DNA sequences (e.g., and ) in terms of approximating the alignment with which the most potential hybridization (between and ) occurs. In an alignment, and denote the length of consecutively identical bases and the number of other identical bases between two compared sequences, respectively. With the insight of the biased effect of the continuous similarity over the discontinuous similarity on the conventional sequence alignment methods (e.g., BLAST [2]), a BAC with priority to is used. Specifically, we derive the BAC by maximizing the continuous similarity while minimizing the dissimilarity using the objective function , where is the length of the sequences, and is the dissimilarity. Let be the corresponding values of under the best alignment, the criterion can be summarized as,


The weighted similarity allocation (WSA) mechanism gives biased similarity scores to aligned base pairs that compose bases from the compared sequences. The aligned base-pair can be identified as identical base-pair or distinct base-pair. Only identical base-pairs are given non-zero similarity weights. Moreover, considering that the binding between G and C is tighter than A and T due to the three hydrogen bonds existing between G and C, the WSA allocates additional similarity weights to identical G or C base-pair whose last (or left) aligned base-pair is also an identical base-pair. Specifically, two additional weights and are assigned to identical G/C base-pair whose last aligned base-pair is identical G/C base-pair and identical A/T base-pair, respectively. Henceforward, and are termed as consecutive G/C weight and non-consecutive G/C weight. Owing to this, the BAC related in (5) becomes , where and

are as defined. Hence, the similarity vector

of each compared sequence pair () can be derived under the best alignment that has been determined by the BAC.

The entries of the vector are the similarity weights of the aligned base-pairs from left to right of the alignment (excluding the two overhang ends). Note that values of the additional weights in the (undesirable) similarity vector of unique sequences (i.e., ) and the (desirable) similarity vector of identical sequences (i.e., ), can be inconsistent (i.e., ). Two instances of the similarity vectors are shown in Figure 1, in which the best alignment region is gray-shadowed and the identical base-pairs under the best alignment are bold and italic. The , , and correspondingly denote the state information transferred from the last aligned base-pair to the identical G/C base-pair for regulating the weight allocation. Specifically, indicates that there is no aligned base-pair in the left or the last/left aligned base-pair is not identical, thus no extra weight is added to the comparing identical G/C base-pair; indicates that the left aligned base-pair is identical G base-pair or C base-pair, thus the consecutive G/C weight is added to the comparing identical G/C base-pair; while indicates that the left is an identical A or T base-pair, thus the non-consecutive G/C weight is added to the comparing base-pair.

Fig. 1: Two examples of similarity vectors under best alignments.

With the similarity vector under the best alignment of two comparing sequences, the similarity score is obtained by accumulating the entries of the vector. According to the above-discussed definition of the SS, the SS of two different sequences (i.e., ) of length is the ratio of undesirable similarity score (between and ) against the minimal desirable similarity score (between and or and ), and can be formulated by,


where denotes the accumulation of values in vector , in which the size of the vector is , where is the length of overhanging bases with the best alignment. Note that for and , .

Iv DNA codes with combinatorial constraints

We design DNA codes having combinatorial constraints using a sorting-based exhaustive search.

Iv-a Code with balanced GC content and maximum homopolymer run constraints

DNA sequences composing a balanced number of ’G’/’C’ and ’A’/’T’ symbols are desired by the parallel biological reactions because they are like to have a unified melting temperature (). Besides, the maximum homopolymer run constraint that restricts the maximum allowable length of the consecutively repetitive symbols in the DNA sequence (named as continuity in [24, 25, 34]), is also desirable for DNA code. Our previous result in [30] has shown an efficient construction of codes subject to these two constraints.

Iv-B Code with minimum distance/maximum similarity

The minimum distance (or maximum similarity) between sequences has been used as a constraint for the DNA code design [12, 18, 15, 7, 27, 13, 8]. The sequence-sequence distance (SSD) and sequence-RC distance (SRCD), which correspondingly imply the sequence-RC hybridization affinity and sequence-sequence/RC-RC/self-fold hybridization affinity, are required to satisfy the predefined constraint. In this work, a similarity significance (SS) () is introduced to measure the similarity of two DNA sequences. Therefore, similar to previous works [13, 23], the SS () could be categorized into the sequence-sequence SS (SSSS) and sequence-RC SS (SRCSS). As such, the DNA code design problem can be summarized as searching for a set of constrained DNA sequences with length () such that,


where is the complete set of sequences of length with symbols from an alphabet set {A, T, C, G}; and are SSSS and SRCSS, respectively, with ; is a predefined maximum SS parameter.

With the approximated mapping existing in the SS measure and the MFE , i.e.,, the free energy gap defined in (1) could be approximated using a SS gap with a renewed formula in terms of as follows,


Since , (8) becomes for a DNA code generated based on (7). Therefore, either or can be set in advance for designing a DNA code. For simplicity, maximum SS was used as the design parameter in our DNA code design.

Iv-C An exhaustive search of constrained DNA codes

The initial candidate constrained DNA sequences are generated using our method in [30]. This constrained sequence initialization outperforms the random sequence initialization of most evolutionary search algorithms [6, 25] in terms of avoiding much search complexity. Leveraging by the initialization, the DNA code design problem is turned into an exhaustive search problem with the maximum SS as the only constraint.

The following notations are used in Algorithm 1:

  • : A set of -length DNA sequences subject to GC content and maximum homopolymer run .

  • : A set of n-length constrained DNA sequences with maximum SS , i.e., DNA codes.

  • : The cardinality of a set.

  • : The sub-string starting from th symbol to th symbol of sequence .

  • : A subset of built by grouping sequences with the same suffix, where x is an index indicator.

  • : A neighboring group pair built by linking the groups with the minimum Hamming distance in the suffix.

0:  ,
0:   Initialisation: Pre-processing:
1:  for all  such that are equal do
2:     Grouping into
3:  end for
4:  Sort the groups into with an order of increasing
5:  Link each a neighboring group pair Searching:
6:  for  to  do
7:     Select a sequence .
8:     if  satisfies (7then
9:        Add the valid to , and eliminate from
10:        Select a sequence , and verify the validity
11:        Update as line 9 if valid
12:     else
13:        back to line 7
14:     end if
15:  end for
Algorithm 1 An exhaustive search of DNA codes

V Result

Given a specific sequence length, we investigate the code size and cross-hybridization performance of the codes constructed using different similarity/distance models. All codes are generated following the same searching process as discussed above except using different models as the similarity/distance measures of two sequences. The cross-hybridization is reflected by the free energy gap of . The MFEs related to the were obtained from DINAMelt by setting the hybridization temperature to the universal C. Note that as lower MFE values (negative) imply higher hybridization potentials, for convenience, the MFEs with positive values indicating nearly no hybridization potential are set to 0.

Model 6 7 8 9 10
Hamming 1.0 (4) 2.1 (9) 2.1 (22) 2.3 (18) 3.7 (15)
SS 2.4 (5) 2.1 (11) 2.3 (32) 3.5 (18) 4.8 (15)
TABLE I: Comparing free energy gap (code size) with Hamming model

V-1 Comparison with Hamming-based codes

From Table I where free energy gaps of the codes and the code sizes (parenthetical data) are shown, it could be found that, for all given lengths, the SS model enables codes constructed with no less free energy gaps and code sizes simultaneously. This indicates that by using the SS model, more DNA sequences with a given length could be generated with no adverse effect on the cross-hybridization performance. The thresholds used for different models might be varied provided that the codes generated are comparable in terms of code size and energy gap. The overall thresholds are chosen to avoid that the code sizes are too large to calculate the energy gaps. Notice that the energy gaps are currently calculated manually based on the free energy obtained from DINAMelt.

Model 6 7 8 9 10
Edit 2.7 (3) 1.1 (6) 2.4 (13) 4.2 (10) 5.4 (7)
SS 3.0 (3) 2.4 (6) 2.7 (13) 4.7 (10) 6.4 (7)
TABLE II: Comparing free energy gap (code size) with Edit model

V-2 Comparison with Edit-based codes

The edit distance is a more stringent metric compared to Hamming distance. As such, with the equal minimum distance limit, codes constructed with edit distance are likely to have larger energy gaps due to the fewer components than Hamming-based codes. Therefore, for a fair comparison, we make the constructed SS-based codes have the same code sizes with the edit-based codes using expurgation. Table II shows that for all lengths considered, SS-based expurgated codes have larger free energy gaps than edit-based codes under the assumption of the same code sizes, which implies that the SS-based codes have better cross-hybridization performance.

V-3 Comparison with codes in [34]

We set the maximum SS constraint based on the minimum distance (out of ) that was used in [34]. We generate each DNA code with maximum SS with values of parameter and sequence length same as [34]. All codes generated follow , where a biased similarity weight is allocated to the consecutively identical G/C base-pairs. The free energy gap and the size of the codes are shown in Table III and Table IV, respectively. The parenthetical data are from [34].

4 5 6 7 8
4 2.7 (1.54) - - - -
5 3.3 (1.94) 4.5 (2.45) - - -
6 2.4 (1.73) 5.5 (3.04) 4.5 (3.47) - -
7 2.1 (1.94) 4.1 (2.69) 6.3 (3.89) 6.5 (4.36) -
8 2.3 (2.59) 4.1 (3.25) 6.0 (3.98) 7.5 (5.12) 7.5 (5.59)
TABLE III: Comparing the free energy gap of DNA codes

Table III shows that all generated codes except have better cross-hybridization performance (i.e., the free energy gap ) over the codes generated in [34]. However, with a 11% decrease of the energy gap (i.e., 2.3 versus 2.59), the size of increases over 255% (3.5 fold) (i.e., 32 versus 9 in Table IV). Generally, the gap is expected lower with a larger code size . However, Table III together with Table IV indicates that for specific sequence length , the proposed method could generate DNA codes with larger size and wider gap . More specifically, for , we generate codes with and against and in [34], respectively. For , we generate codes with against . Note that the thermodynamic property of DNA codes in [34] might be worse in practice as they neglected the self-fold hybridizations in their code design and the free energy gap definition. Moreover, unlike [34], our codes satisfy the maximum homopolymer run/continuity constraint that is desirable for achieving better stability, which explains the smaller sizes of codes for few cases in Table IV.

4 5 6 7 8
4 1 (1) - - - -
5 1 (2) 1 (1) - - -
6 5 (5) 1 (2) 1 (1) - -
7 11 (5) 3 (4) 1 (2) 1 (1) -
8 32 (9) 7 (6) 2 (4) 1 (2) 1 (1)
TABLE IV: The comparison on the size of DNA codes

Vi Conclusion

We have introduced a new model for designing DNA codes with controlled cross-hybridization performance. This SS model weighs the significance of the undesirable similarity against the desirable similarity, leveraging the fact that bias exists in the hybridization affinities between individual oligo and its RC pair. An improved BAC and a biased similarity weight allocation were incorporated for the realization of SS. Based on the proposed model and a sorting-based search algorithm, DNA codes with different sequence lengths and maximum SS were generated, while satisfying several combinatorial constraints. The free energy gaps and code sizes of these codes imply that the proposed SS presents better approximation towards the thermodynamic property than the traditional models.


  • [1] L. M. Adleman (1994) Molecular computation of solutions to combinatorial problems. Science 266 (5187), pp. 1021–1024. Cited by: §I.
  • [2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman (1990) Basic local alignment search tool. Journal of molecular biology 215 (3), pp. 403–410. Cited by: §I, §III-A, §III-A.
  • [3] M. Arita and S. Kobayashi (2002) DNA sequence design using templates. New Generation Computing 20 (3), pp. 263–277. Cited by: §I.
  • [4] N. Bennenni, K. Guenda, and T. A. Gulliver (2019) Greedy construction of dna codes and new bounds. Applicable Algebra in Engineering, Communication and Computing 30 (3), pp. 207–216. Cited by: §I.
  • [5] M. A. Bishop, A. G. D’yachkov, A. J. Macula, T. E. Renz, and V. V. Rykov (2007) Free energy gap and statistical thermodynamic fidelity of dna codes. Journal of Computational Biology 14 (8), pp. 1088–1104. Cited by: §III-A.
  • [6] V. M. Cervantes-Salido, O. Jaime, C. A. Brizuela, and I. M. Martinez-Perez (2013) Improving the design of sequences for DNA computing: A multiobjective evolutionary approach. Applied Soft Computing 13 (12), pp. 4594–4607. Cited by: §IV-C.
  • [7] Y. M. Chee and S. Ling (2008) Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory 54 (1), pp. 391–394. Cited by: §I, §I, §IV-B.
  • [8] A. D’yachkov, A. Macula, T. Renz, P. Vilenkin, and I. Ismagilov (2005) New results on DNA codes. In Proceedings. International Symposium on Information Theory, 2005. ISIT 2005., pp. 283–287. Cited by: §I, §IV-B.
  • [9] A. G. D’yachkov, A. N. Kuzina, N. Polyansky, A. Macula, and V. V. Rykov (2014) DNA codes for nonadditive stem similarity. Problems of Information Transmission 50 (3), pp. 247–269. Cited by: §III-A.
  • [10] A. G. D’yachkov, A. J. Macula, W. K. Pogozelski, T. E. Renz, V. V. Rykov, and D. C. Torney (2004) A weighted insertion-deletion stacked pair thermodynamic metric for dna codes. In International Workshop on DNA-Based Computers, pp. 90–103. Cited by: §III-A.
  • [11] R. Deaton, M. Garzon, R. C. Murphy, J. A. Rose, D. R. Franceschetti, and S. E. Stevens Jr (1996) Genetic search of reliable encodings for DNA-based computation. In

    Proceedings of the First Annual Conference on Genetic Programming

    Vol. 419, pp. 421. Cited by: §I.
  • [12] P. Gaborit and O. D. King (2005) Linear constructions for DNA codes. Theoretical Computer Science 334 (1-3), pp. 99–113. Cited by: §I, §I, §IV-B.
  • [13] S. Kawashimo, H. Ono, K. Sadakane, and M. Yamashita (2006) DNA sequence design by dynamic neighborhood searches. In International Workshop on DNA-Based Computers, pp. 157–171. Cited by: §I, §IV-B.
  • [14] O. D. King and P. Gaborit (2007) Binary templates for comma-free DNA codes. Discrete Applied Mathematics 155 (6-7), pp. 831–839. Cited by: §I.
  • [15] O. D. King (2003) Bounds for DNA codes with constant GC-content. the electronic journal of combinatorics 10 (1), pp. 33. Cited by: §I, §I, §IV-B.
  • [16] S. Kobayashi, T. Kondo, and M. Arita (2002) On template method for DNA sequence design. In International Workshop on DNA-Based Computers, pp. 205–214. Cited by: §I.
  • [17] A. J. Macula, A. Schliep, M. A. Bishop, and T. E. Renz (2008) New, improved, and practical k-stem sequence similarity measures for probe design. Journal of Computational Biology 15 (5), pp. 525–534. Cited by: §III-A.
  • [18] A. Marathe, A. E. Condon, and R. M. Corn (2001) On combinatorial DNA word design. Journal of Computational Biology 8 (3), pp. 201–219. Cited by: §I, §I, §IV-B.
  • [19] N. R. Markham and M. Zuker (2005) DINAMelt web server for nucleic acid melting prediction. Nucleic acids research 33 (suppl_2), pp. W577–W581. Cited by: §II.
  • [20] O. Milenkovic and N. Kashyap (2005) On the design of codes for DNA computing. In International Workshop on Coding and Cryptography, pp. 100–119. Cited by: §I.
  • [21] H. B. Nielsen, R. Wernersson, and S. Knudsen (2003) Design of oligonucleotides for microarrays and perspectives for design of multi-transcriptome arrays. Nucleic Acids Research 31 (13), pp. 3491–3496. Cited by: §I.
  • [22] L. Organick, S. D. Ang, Y. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Z. Racz, G. Kamath, P. Gopalan, B. Nguyen, et al. (2018) Random access in large-scale dna data storage. Nature biotechnology 36 (3), pp. 242. Cited by: §I.
  • [23] V. Phan and M. H. Garzon (2009) On codeword design in metric DNA spaces. Natural Computing 8 (3), pp. 571. Cited by: §I, §II, §IV-B.
  • [24] S. Shin, D. Kim, I. Lee, and B. Zhang (2002) Evolutionary sequence generation for reliable DNA computing. In

    Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600)

    Vol. 1, pp. 79–84. Cited by: §IV-A.
  • [25] S. Shin, I. Lee, D. Kim, and B. Zhang (2005) Multiobjective evolutionary optimization of DNA sequences for reliable DNA computing. IEEE transactions on evolutionary computation 9 (2), pp. 143–158. Cited by: §IV-A, §IV-C.
  • [26] M. R. Shortreed, S. B. Chang, D. Hong, M. Phillips, B. Campion, D. C. Tulpan, M. Andronescu, A. Condon, H. H. Hoos, and L. M. Smith (2005) A thermodynamic approach to designing structure-free combinatorial DNA word sets. Nucleic acids research 33 (15), pp. 4965–4977. Cited by: §II, §II.
  • [27] D. C. Tulpan, H. H. Hoos, and A. E. Condon (2002) Stochastic local search algorithms for DNA word design. In International Workshop on DNA-Based Computers, pp. 229–241. Cited by: §I, §I, §IV-B.
  • [28] B. Wang, X. Zheng, S. Zhou, C. Zhou, X. Wei, Q. Zhang, and Z. Wei (2018)

    Constructing DNA barcode sets based on particle swarm optimization

    IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 15 (3), pp. 999–1002. Cited by: §I, §I.
  • [29] X. Wang and B. Seed (2003) Selection of oligonucleotide probes for protein coding sequences. Bioinformatics 19 (7), pp. 796–802. Cited by: §I.
  • [30] Y. Wang, M. Noor-A-Rahim, E. Gunawan, Y. L. Guan, and C. L. Poh (2019) Construction of bio-constrained code for DNA data storage. IEEE Communications Letters. Cited by: §IV-A, §IV-C.
  • [31] Y. Wang, M. Noor-A-Rahim, J. Zhang, E. Gunawan, Y. L. Guan, and C. L. Poh (2019) High capacity dna data storage with variable-length oligonucleotides using repeat accumulate code and hybrid mapping. Journal of Biological Engineering 13 (1), pp. 89. Cited by: §I.
  • [32] Q. Xu, M. R. Schlabach, G. J. Hannon, and S. J. Elledge (2009) Design of 240,000 orthogonal 25mer DNA barcode probes. Proceedings of the National Academy of Sciences 106 (7), pp. 2289–2294. Cited by: §I.
  • [33] J. Ye, G. Coulouris, I. Zaretskaya, I. Cutcutache, S. Rozen, and T. L. Madden (2012) Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. BMC bioinformatics 13 (1), pp. 134. Cited by: §I.
  • [34] Q. Zhang, B. Wang, X. Wei, and C. Zhou (2013) A novel constraint for thermodynamically designing DNA sequences. PloS one 8 (8), pp. e72180. Cited by: §II, §II, §IV-A, §V-3, §V-3, §V-3.