, and introduced a DNA coder for the autoencoder’s latent space tensor.
2.1 DNA data storage
The memory of humanity relies on our ability to manage increasingly large amounts of data, over periods of time ranging from a few years to several centuries. Current tools are no longer sufficient and it is necessary to consider game-changing solutions that can become operational quickly. One of the most promising solutions is to store information in the form of DNA, just like the genome by living beings. Indeed, DNA provides a very stable storage medium over very long periods with simple implementation conditions .
The first step in the encoding workflow is the construction of a dictionary of codewords composed by the symbols A, G, C and T also called nucleotides. The DNA coded information stream must respect some biochemical constraints on the combinations of bases that form a DNA fragment: homopolymers, high/low GC content and repeated patterns should be avoided. One must also take into account that the process involves some biochemical procedures which can corrupt the encoded data. Synthesis, sequencing, storage and the manipulation of DNA (mainly PCR amplification) may introduce errors by introducing substitutions or indels (insertions or deletions of nucleotides), and may jeopardize the integrity of the stored content .
However, to ensure better adaptation to the characteristics of the storage medium, i.e., DNA, and possibly achieve higher storage efficiencies, it is better to design coding algorithms specific for DNA storage. Indeed, since DNA synthesis cost is relatively high, it is also important to take full advantage of an optimal compression that can be achieved before synthesizing the sequence into DNA. As an example, among other relevant works,  proposed Discrete Wavelet Transform (DWT) image decomposition where the DWT coefficients are scalar quantized and then encoded using a quaternary code. On the contrary, following the example of what is done today in the context of JPEG standardization for image coding with the development of JPEG AI111https://jpeg.org/jpegai/index.html
, we propose in this work to use neural networks such as autoencoders for image coding on quaternary codes.
2.2 Compressive autoencoders for DNA image storage
The general idea of our proposed encoding process is depicted in the figure (1) and can be very roughly described by the following steps. Firstly, the input image has to be compressed using a lossy/near lossless image coder. Here, we propose to use a compressive autoencoder to learn the image characteristics as well as the biochemical noisy process. Convolutionnal autoencoders have shown good properties for image denoising and could then appear as a good solution when it comes to deal with the noise introduced by the DNA data storage process. Secondly, a quantization operation is introduced to quantize the latent vector in the latent space. The reason for that is that the latent space doesn’t suffice as a compressed representation because it uses floating numbers. The quantization goal is to compress it and reduce its cost on the one hand and to facilitate its encoding on the other hand. Thirdly, in the case of DNA storage, the latent vector is encoded using a quaternary code using the alphabet. Decoding is ensured thanks to the decoder. The quantization operation is one of the major problems when training compressive autoencoders. Indeed, to be able to train a model, all the operations have to be differentiable, which is not the case for the quantization (most of the times it is equivalent to a rounding function). However, this can be tackled by the use of a linear approximation function as in .
During training, the compressive autoencoder has to minimize two quantities:
The distortion which corresponds to the squared difference between the original image and the image reconstructed after compression and decompression.
The entropy of the quantized latent space computed in a quaternary basis. The entropy being closely linked to the rate, we will use indifferently both terms in the rest of the paper.
The structure of the compressive autoencoder model we propose in this paper is closely linked the one proposed by Theis et al. in . It is described in the following section 3. The introduction of the biochemichal noise in the model is described in the section 4.
3 Non-noisy latent space
3.1 Neural Network
We designed our proposed autoencoder to obtain a latent space with higher dimensionality than the one proposed in , the objective of such a model was to obtain higher bit-rates. Because we did not reduce the size of the latent space as much as in , our new model is shallower. This means that our new model now has a reduced number of parameters, which could cause some overfitting problems during the training process. To compensate such a risk, we decided to add new residual blocks to the model with skipped connections.
When encoding data, the common practice is to operate with integers, not with floating point numbers. For that reason, the output of an autoencoder group of layers in the latent space cannot be encoded using quaternary code as it is and a quantization operator must then be introduced in the process. Given an input image , the output image (decoded image) of the autoencoder can be expressed as a combination of several functions such as follows:
where, represents the encoding part of the autoencoder and the decoding, represents the quantizer that rounds the components z of the latent vector into integer values and is defined as:
Finally, and represent the DNA-encoding and decoding algorithms that encode a sequence of symbols into a quaternary stream composed by the letters of the alphabet . When no noise, it is clear that . In this work we have used the DNA fixed-length code proposed in .
3.2 Loss function
We used in our work the following classical loss function:
where and are respectively the input and output images and corresponds to the entropy of the quantized latent space computed in base as follows:
the probability of a quantized component. The entropyis expressed in nucleotides per component.
To encode the quantized values, we use a fixed-length coding system. Every value is coded with a DNA codeword of the same length . The set of all different codewords of length is a finite set. In the case of quaternary code, there are different codewords, but only a subpart are DNA codewords that respect biochemical constraints (mainly no codes containing homopolymers runs such as AAAA or TTTT for example can be generated). We call a codebook the set of constrained DNA codewords of length that can be used for this fixed-length encoding. Ensuring encodability of any compressed image means that the number of different possible values output by the quantizer has to be smaller or equal to the number of codewords available.
The compression model output was bounded using a hyperbolic tangent function. The uniform quantization was then applied to the output of that bounded function, giving a finite number of different possible values. The number of possible values can be adjusted with the quantization step.
4 Introducing biochemical noise
As described previously, one must also take into account the process involves some biochemical procedures which can corrupt the encoded data. Synthesis, sequencing, storage and the manipulation of DNA may introduce errors by introducing substitutions or indels (insertions or deletions of nucleotides or ), and may jeopardize the integrity of the stored content . DNA storage can then be viewed as a naturally noisy channel for which appropriately resilient encoding solutions need to be defined.
In order to adapt the autoencoders to the noisy channel of DNA data storage, one needs to introduce a noise model between the encoding and decoding parts. Since the substitution noise is prevalent in the noise of the DNA storage channel, we decided in this work to focus on it. A substitution, as mentioned earlier, is the phenomenon of one nucleotide being changed into another one (see figure 2).
Furthermore, since we are using the fixed length codes of  for encoding the quantized values, a substitution error affects only one code and has no effect on the decoding of its neighbors as shown in figure 2. This means that the model for the substitution noise can be established and applied at the quantized tensor level (coefficients in the latent space) and not necessarily at the nucleotide level.
Let’s call the substitution noise introduced by the biochemical process. One can rewrite the autoencoder input/output function given in equation (1) as:
where the noise is introduced at the level of the latent space and the quaternary decoder is designed to decode a noisy code. Because of the biochemical constraints, the DNA code is not a quaternary complete code and thus, a noisy code might not be decodable. To ensure decodability, when a noisy codeword is not decodable, we replace it with the closest valid possible code in the codebook, in terms of Hamming distance.
In this work we assumed that this noise can be modeled by a i.i.d Gaussian process. The optimization of the autoencoder during training is still performed by minimizing the loss function given by equation (3).
5 Experimental results
5.1 Implementation of the training
We trained the model with the 30k Flickr image dataset. During the training step, we used batches of 32 random crops of size 96x96 from the Flickr images. The training process has been separated into two steps: the first with a learning rate of 1e-4 during 200 epochs and the second one of 1e-5 that would be used for 500 epochs. Models have been trained independently for each quantization step.
The model was then evaluated using the Kodak dataset. For each compression rate chosen, we computed the performances for each image of the dataset, and also the average on the whole dataset. In figure 4, the gray curve (called avg) represents the average performance of the model and it’s resistance to different levels of noise.
5.2 Training process without substitution noise
In this section, we describe the performance of the networks trained as described in section 3 using the model of formula (1), and how their performance is maintained when noise is introduced into the channel at the encoding step and not during the training. The experiments were conducted on different autoencoder models, each one trained to a given compression rate (given quantization step as defined in formula (2)).
The result are presented in the figure 4 where the PSNR is evaluated for different noise levels and averaged on all the images of the Kodak dataset. The results are provided for two different rates, 6 and 4 bits/nucleotide. The reconstruction maintains a good visual quality until the substitution noise level reaches around 5%, where a lot of artifacts start to affect drastically the reconstruction, only maintaining an approximation of the general image features as shown in the figure 3(a). What we can quickly understand from those results is that the substitution noise has a big influence on the performances of our models. After a few percentages of error, the images technically become unusable. This justifies the development for compression methods adapted to noise.
5.3 Training process including substitution noise
In this section, we evaluate the performance of our autoencoder models trained with a noisy latent space as proposed in section 4 and formula (5). For each quantization step (or equivalently each rate in nucleotides per component) we trained a specific autoencoder.
Here, the most important parameter is the noise level. It increases throughout the training from 0 (meaning that at the beginning of the training, the model is learning with a non-noisy data) and a value , which is the maximum noise level, used during the last epochs of the training session. Note that a more complex alternative would be to train several models each one for different noise levels, instead of training a unique model for all the possible noise levels.
What can be taken out from the results shown in the figure 4 is that by training our models to adapt to a substitution noise we managed to obtain a gain between 0.5 and 1.5 dB on the PSNR depending on the level of noise. On the other hand, the models adapted to noise seem to underperform when no noise is introduced in the latent space. Furthermore, the visual results provided by the optimized autoencoder remain very interesting for a substitution noise level around 5% (see figure 3(b)) showing a strong robustness to high substitution noise levels and the good performance of the proposed solution.
In this work, we have developed a compression solution for image storage on synthetic DNA, robust to substitution noise. The proposed approach is based on compressive autoencoder optimized for DNA fixed-length encoding technologies. A noise model was developed and introduced in the autoencoder optimization to analyze its effects on the quality of reconstruction of the decoded image. Training the compression neural network including the noise model showed improvements over the network trained without the noise model (figure 4).
In future works, experimenting with entropy-based DNA coding systems instead of fixed-length encoding might show some improvements since the compression network minimizes entropy. Introducing new types of noise (insertions and deletions), and solutions to minimize their effects is also another field of improvement.
-  Lucas Theis, Whenzhe Shi, Andrew Cunningham, and Ferenc Huszár, “Lossy image compression with compressive autoencoders,” International Conference on Learning Representations, 2017.
Ballé Johannes, David Minnen, Singh Saurabh, Sun Jin Hwang, and Nick Johnston,
“Variational image compression with a scale hyperprior,”International Conference on Learning Representations, 2018.
-  George Toderici, Damien Vincent, Nick Johnston, Sun Jin Hwang, David Minnen, Joel Shor, and Michelle Covell, “Full resolution image compression with recurrent neural networks,” , 2017.
-  Fei Yang, Luis Herranz, Joost van de Weijer, José A. Iglesias Guitián, Antonio M. López, and Mikhail G. Mozerov, “Variable rate deep image compression with modulated autoencoder,” IEEE Signal Processing Letters, vol. 27, pp. 331–335, 2020.
-  Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee, “Variable rate deep image compression with a conditional autoencoder,” IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
-  Thierry Dumas, Aline Roumy, and Christine Guillemot, “Autoencoder based image compression: can the learning be quantization independent?,” 2018.
-  V Oliveira, T Oberlin, M. Chabert, Charly Poulliat, Mickaël Bruno, C Latry, M Carlavan, S Henrot, F Falzon, and Roberto Camarero, “Simplified entropy model for reduced-complexity end-to-end variational autoencoder with application to on-board satellite image compression,” 09 2020.
-  S. M. Hossein Tabatabaei Yazdi, Ryan Gabrys, and Olgica Milenkovic, “Portable and error-free dna-based data storage,” Nature, 2017.
-  N. Goldman, P. Bertone, and S. Chen, “Towards practical, high-capacity, low-maintenance information storage in synthesized dna,” Nature, 2013.
-  Melpomeni Dimopoulou, Marc Antonini, Pascal Barbry, and Raja Appuswamy, “A biologically constrained encoding solution for long-term storage of images onto synthetic dna,” European Signal Processing Conference (EUSIPCO), 2019.