Unsupervised Representation Learning of DNA Sequences

by   Vishal Agarwal, et al.

Recently several deep learning models have been used for DNA sequence based classification tasks. Often such tasks require long and variable length DNA sequences in the input. In this work, we use a sequence-to-sequence autoencoder model to learn a latent representation of a fixed dimension for long and variable length DNA sequences in an unsupervised manner. We evaluate both quantitatively and qualitatively the learned latent representation for a supervised task of splice site classification. The quantitative evaluation is done under two different settings. Our experiments show that these representations can be used as features or priors in closely related tasks such as splice site classification. Further, in our qualitative analysis, we use a model attribution technique Integrated Gradients to infer significant sequence signatures influencing the classification accuracy. We show the identified splice signatures resemble well with the existing knowledge.



There are no comments yet.


page 1

page 2

page 3

page 4


Encoding DNA sequences by integer chaos game representation

DNA sequences are fundamental for encoding genetic information. The gene...

Memory Matching Networks for Genomic Sequence Classification

When analyzing the genome, researchers have discovered that proteins bin...

Map of Life: Measuring and Visualizing Species' Relatedness with "Molecular Distance Maps"

We propose a novel combination of methods that (i) portrays quantitative...

Deep metric learning improves lab of origin prediction of genetically engineered plasmids

Genome engineering is undergoing unprecedented development and is now be...

A New Burrows Wheeler Transform Markov Distance

Prior work inspired by compression algorithms has described how the Burr...

Spectral Learning of Binomial HMMs for DNA Methylation Data

We consider learning parameters of Binomial Hidden Markov Models, which ...

Machine Learning Prediction of DNA Charge Transport

First-principles calculations of charge transfer in DNA molecules are co...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently there is a surge in studies using deep learning models for DNA sequence based classification tasks. One of the primary reason for the adoption of such methods is representation learning or feature learning from raw data. In the case of DNA sequence based classification tasks, DNA sequence containing 4 nucleotides A, T, G, C constitute raw data. Most studies often choose fixed-length DNA sequences as input by choosing a context window. However, in many cases, important nucleotides may not lie within the same context window size in all input sequences. Hence, there is a requirement of models which can handle long as well as variable length DNA sequences as inputs. Such a model can then take into account of both short (local) and long (global) range dependencies.

In this work, we primarily focus on learning representation for long, variable length DNA sequences. We use an autoencoder-based sequence-to-sequence LSTM model to learn representations by encoding the input sequence in a fixed-length latent embedding and then reconstruct back the original input sequence using just the embedding. The representations learned include model parameters and the fixed-length latent embedding. This allows the model to aggregate information by implicit learning its parameters as features which summarizes the input sequence well as a fixed-dimensional latent representation. The learned representations can then be used as features or apriori information in various tasks related to DNA sequences such as splice site prediction.

We evaluate our model on splice site classification task. In genomics, splicing is an important phenomenon, leading to protein diversity in the body. We performed two quantitative and a qualitative evaluation of the learned latent representation of input DNA sequence. The quantitative evaluation is carried in two different settings. For qualitative analysis, we use Integrated Gradients, a model attribution technique proposed by (Sundararajan et al., 2017). This provides attribution of input feature to the predicted classification score and identify relevant region and motifs influencing splicing.

Figure 1:

Graphical Illustration of the entire work. Model 1 corresponds to the sequence-to-sequence autoencoder. Model 2(a) and 2(b) shows the classifier for evaluation of learned representation on a supervised task. Integrated Gradient uses the prediction score to provide model attribution and aid in visualization.

2 Related work

Most of the previous work for feature learning and motif identification has been in the context of supervised learning.

(Dutta et al., 2018b) proposed a method for learning distributed feature representation of splice junctions using k-mers. Spline transformation was introduced by (Barekatain et al., 2017)

to improve upon traditional neural networks to learn better representations and improve prediction accuracy.

(Lanchantin et al., 2017) used a memory matching network to dynamically learn a memory bank of motifs using sequence classification task. (Lee & Yoon, 2015)

proposed a Restricted Boltzmann Machine based model with a new training method called boosted Contrastive Divergence to predict non-canonical splice sites and learn non-canonical feature vectors that couldn’t be identified by traditional methods. Some works have also used Convolutional Neural Networks(CNN) in order to learn representations implicitly

(Kim et al., 2018; Alipanahi et al., 2015; Zhang et al., 2016; Lanchantin et al., 2016; Dutta et al., 2018a).

Previous work leverages supervised labeled data to learn feature representation and motifs. Also, the features learned are task specific for a classification task in most cases. As per our knowledge, no work has been done on learning representations in an unsupervised setting for general use case. In our work, we try to learn representations in a more general unsupervised setting so that they are not task specific and can be used as features or priors in problems related to genomics.

3 Model Description

In this section, we give a description of our model to learn fixed-length latent embedding of sequences. Figure 1 shows a complete graphical overview of the work.

3.1 Sequence Autoencoder

We use an autoencoder-like sequence-to-sequence model to learn fixed-length representations of sequences. The model (Model 1 in Figure 1) consists of an encoder and a decoder LSTM inspired by (Sutskever et al., 2014). The encoder network uses a bidirectional LSTM to process the input sequence from both ends and map it to a fixed-length embedding. Bidirectional LSTMs are used since it understands the context better by processing the input from both ends, from past to future and from future to past, and maintaining two hidden states. The encoder summarizes the input sequence and captures important motif information. The decoder network uses a unidirectional LSTM to reconstruct the input sequence using the latent embedding only. The motivation behind this is to capture relevant features that summarize the input sequence well-enough to be able to reconstruct it back.

4 Experiment Design

4.1 Dataset

We use GENCODE annotations (Harrow et al., 2012) based on human genome data GRCh38 to prepare the dataset. The model is trained on an earlier release version 20 and validated on a newer release version 26 for test after removing all common junction pairs present in version 20. Human genome data corresponding to 24 chromosomes namely 1 to 22, X and Y are used. Each of these chromosome consists of multiple genes and each gene is made up of multiple exons and introns. The intron length varied from 1 to about a million in length but we choose only those introns whose length is above 30. For true data or positive sequences, we choose intron sequences beginning with GT, corresponding to donor site (exon-intron boundary), and ending with AG, corresponding to an acceptor site (intron-exon boundary). This constitutes a splice site sample containing both donor and acceptor sites together. The junction pairs are extracted from protein-coding gene only. This left us with 290,502 positive samples from version 20 which we use for autoencoder training and 5,612 positive samples from version 26 which we use for testing or quantitative analysis discussed in section 5.

For negative data generation, we use a technique based on existing works (Zhang et al., 2016; Dutta et al., 2018b)

. In this approach, false data is randomly sampled from the genome data based on some heuristics. For each false junction pair, the consensus dimer GT and AG is searched randomly such that both donor and acceptor are in the same chromosome and its distance is in the range of true data length range. We create a huge list of negative sequences and then sampled some sequences such that the number of negative samples is equal to the number of positive samples.

4.2 Unsupervised Learning

The autoencoder model takes DNA sequences as input and uses it in an unsupervised way without labels to learn latent representations. The one-hot representation is fed to an embedding layer which then goes into the encoder bi-LSTM. The encoder outputs a fixed-length 256-dimensional vector which is the latent embedding summarizing the input. The latent vector is then fed to the decoder LSTM to predict the actual input sequence at each time step. At the decoder side, the output is computed as softmax over nucleotide A, T, G, and C.

The loss function criterion used was minimizing the cross-entropy. The model was trained over 300 epochs with Adam as the optimization algorithm. The entire experiment was performed on NVIDIA TitanXP GPU with 12GB memory. The hyperparameter of the model is the dimension of latent representation. We experiment with 128, 256 and 512 dimensional representation and found 256 to be performing best, giving the lowest cross-entropy loss.

The next section describes the quantitative and qualitative analysis of the learned representation. This is done in order to make sure that the representations learned are useful.

5 Results and Discussion

In this section, we describe the quantitative and qualitative analysis of the learned representation and discuss results for the same.

5.1 Splice site classification

The quantitative analysis of learned representations is done on a supervised task under two different settings. First, an LSTM is trained for splice site prediction in a DNA sequence. Instead of initializing the LSTM with random weights, it is initialized with the trained encoder weights to add apriori information (Model 2a in Figure 1). This provides a good starting point for the discriminative model to converge faster and improves classification accuracy. We compare this model with a baseline model of similar architecture but randomly initialized parameters. Table 1 shows the former model performs better. We also experimented with different architectures such as LSTM, bidirectional LSTM and bidirectional LSTM with Attention and compared the results. Table 1 shows the comparison of different types of models.

In the second evaluation setting, the latent embeddings are used as features on the same task of splice site identification (Model 2b in Figure 1

). The difference between the previous and this model is that the former uses the DNA sequence as input whereas the latter uses just 256-dimensional fixed length latent embedding as the input feature vector. We use Support Vector Machine(SVM), 2-layer Artificial Neural Network(ANN) and a vanilla Recurrent Neural Network(RNN) model to conclude the effectiveness of latent representations. If it did capture motif information, then we expect the classifier to perform well. The results for this setting are shown in table


Model Accuracy
Random LSTM 95.43%
Weights Bi-LSTM 96.04%
Bi-LSTM Attention 97.23%
Autoencoder LSTM 98.54%
Initialized Bi-LSTM 98.60%
Bi-LSTM Attention 99.07%
Table 1: Classification accuracy for encoder initialized LSTM model

Model Accuracy
SVM 98.63%
ANN 98.88%
Vanilla RNN 98.93%
Table 2: Classification accuracy for simple classifier model

5.2 Model Attribution

This section describes the qualitative analysis of our model to provide attribution of the input sequence to the predicted output of the supervised task. To achieve this, we use a popular visualization technique - Integrated Gradients proposed by (Sundararajan et al., 2017). The visualizations provide model attribution by capturing important regions in the sequence. This helps us to identify motifs present in the sequence. These motifs can be interpreted as signals which influence splicing. 3 Integrated gradient requires a baseline against which it compares the prediction of the network and accordingly provides attribution to the feature which differs in the input and the baseline. In our case, we chose the baseline as zero embedding matrix. It calculates the attribution score by accumulating gradient of network prediction score with respect to the embedding at each point in the straight-line path from baseline to input, then multiplied by the difference in the baseline and input feature value. In our experiment, we used 200 steps for gradient calculation along the path.

(a) Donor
(b) Acceptor
Figure 2: Average attribution score per position
(a) Donor
(b) Acceptor
Figure 3: Average attribution score per position for positive and negative sequences
(a) Donor
(b) Acceptor
Figure 4: Average attribution score per position for each nucleotide

For visualization, we take 40nt sequence window upstream and downstream around both at acceptor and donor sites. Figure 2(a) and 2(b) shows the attribution score averaged over all data for donor and acceptor respectively. It is evident that the nucleotides closer to the sites influenced the model most, which confirms with the existing knowledge. The similar analysis for positive and negative example sequences are shown in Figure 3(a) and 3(b). We also perform analysis of attribution score for all nucleotide A, T, G and C separately and the comparison result for both donor and acceptor sites is shown in Figure 4(a) and 4(b). In this case also, the observed result confirms with the existing knowledge. For example, the nucleotide G is very important for the donor site. Finally, we generate a sequence logo from the attribution score to identify important motifs or splicing signals. Figure 5(a) and 5(b) shows the sequence logo for both donor and acceptor around the most relevant region given by the attribution score. The donor results show the presence of strong GT signal and the acceptor results shows the presence of strong AG signal. This validates the known consensus motif.

(a) Donor
(b) Acceptor
Figure 5: Sequence logo to visualize important motifs attributed by the model

6 Conclusion

In this work, we presented an unsupervised representation learning approach to learn representations of DNA sequences in a latent space. We leveraged deep learning techniques to use a sequence-to-sequence autoencoder-like framework to learn representations in an unsupervised setting. We exploit this autoencoder model in two ways: first the learned weight parameters of this model can be used to initialize a classifier with similar architecture, and second, latent representation was used as input features for three different classifiers SVM, ANN and vanilla RNN. The results indicate that the use of pre-trained weight parameters help in faster convergence with improved accuracy. Furthermore, our analysis shows that the learned latent embeddings are good features as three different classifiers gave similar performance using it as input feature. Finally, attributional analysis shows that the model is able to pick significant regions, confirming with the existing knowledge, of input DNA sequence for the splice site classification task.


  • Alipanahi et al. (2015) Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. Predicting the sequence specificities of dna- and rna-binding proteins by deep learning. Nature Biotechnology, 33:831 EP –, Jul 2015.
  • Barekatain et al. (2017) Barekatain, M., Gagneur, J., Cheng, J., and Avsec, . Modeling positional effects of regulatory sequences with spline transformations increases prediction accuracy of deep neural networks. Bioinformatics, 34(8):1261–1269, 11 2017. ISSN 1367-4803. doi: 10.1093/bioinformatics/btx727. URL https://doi.org/10.1093/bioinformatics/btx727.
  • Dutta et al. (2018a) Dutta, A., Dalmia, A., Athul, R., Singh, K. K., and Anand, A. Inference of splicing motifs through visualization of recurrent networks. bioRxiv, 2018a. doi: 10.1101/451906. URL https://www.biorxiv.org/content/early/2018/10/25/451906.
  • Dutta et al. (2018b) Dutta, A., Dubey, T., Singh, K. K., and Anand, A. Splicevec: Distributed feature representations for splice junction prediction. Computational Biology and Chemistry, 74:434 – 441, 2018b. ISSN 1476-9271.
  • Harrow et al. (2012) Harrow, J., Frankish, A., Gonzalez, J. M., Tapanari, E., Diekhans, M., Kokocinski, F., Aken, B. L., Barrell, D., Zadissa, A., Searle, S., Barnes, I., Bignell, A., Boychenko, V., Hunt, T., Kay, M., Mukherjee, G., Rajan, J., Despacio-Reyes, G., Saunders, G., Steward, C., Harte, R., Lin, M., Howald, C., Tanzer, A., Derrien, T., Chrast, J., Walters, N., Balasubramanian, S., Pei, B., Tress, M., Rodriguez, J. M., Ezkurdia, I., van Baren, J., Brent, M., Haussler, D., Kellis, M., Valencia, A., Reymond, A., Gerstein, M., Guigó, R., and Hubbard, T. J. Gencode: the reference human genome annotation for the encode project. Genome Res, 22(9):1760–1774, Sep 2012. ISSN 1549-5469. doi: 10.1101/gr.135350.111. URL https://www.ncbi.nlm.nih.gov/pubmed/22955987. 22955987[pmid].
  • Kim et al. (2018) Kim, M., De Neve, W., Zuallaert, J., Godin, F., Soete, A., and Saeys, Y. SpliceRover: interpretable convolutional neural networks for improved splice site prediction. Bioinformatics, 34(24):4180–4188, 06 2018. ISSN 1367-4803. doi: 10.1093/bioinformatics/bty497. URL https://doi.org/10.1093/bioinformatics/bty497.
  • Lanchantin et al. (2016) Lanchantin, J., Singh, R., Lin, Z., and Qi, Y. Deep motif: Visualizing genomic sequence classifications. CoRR, abs/1605.01133, 2016. URL http://arxiv.org/abs/1605.01133.
  • Lanchantin et al. (2017) Lanchantin, J., Singh, R., and Qi, Y. Memory matching networks for genomic sequence classification. CoRR, abs/1702.06760, 2017. URL http://arxiv.org/abs/1702.06760.
  • Lee & Yoon (2015) Lee, T. and Yoon, S. Boosted categorical restricted boltzmann machine for computational prediction of splice junctions. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2483–2492, Lille, France, 07–09 Jul 2015. PMLR.
  • Srivastava et al. (2015) Srivastava, N., Mansimov, E., and Salakhutdinov, R. Unsupervised learning of video representations using lstms. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 843–852. JMLR.org, 2015.
  • Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3319–3328, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 3104–3112. Curran Associates, Inc., 2014.
  • Zhang et al. (2016) Zhang, Y., Liu, X., MacLeod, J. N., and Liu, J. Deepsplice: Deep classification of novel splice junctions revealed by rna-seq. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 330–333, Dec 2016. doi: 10.1109/BIBM.2016.7822541.