Unaligned Sequence Similarity Search Using Deep Learning

09/16/2019 ∙ by James K. Senter, et al. ∙ The University of Tennessee, Knoxville 0

Gene annotation has traditionally required direct comparison of DNA sequences between an unknown gene and a database of known ones using string comparison methods. However, these methods do not provide useful information when a gene does not have a close match in the database. In addition, each comparison can be costly when the database is large since it requires alignments and a series of string comparisons. In this work we propose a novel approach: using recurrent neural networks to embed DNA or amino-acid sequences in a low-dimensional space in which distances correlate with functional similarity. This embedding space overcomes both shortcomings of the method of aligning sequences and comparing homology. First, it allows us to obtain information about genes which do not have exact matches by measuring their similarity to other ones in the database. If our database is labeled this can provide labels for a query gene as is done in traditional methods. However, even if the database is unlabeled it allows us to find clusters and infer some characteristics of the gene population. In addition, each comparison is much faster than traditional methods since the distance metric is reduced to the Euclidean distance, and thus efficient approximate nearest neighbor algorithms can be used to find the best match. We present results showing the advantage of our algorithm. More specifically we show how our embedding can be useful for both classification tasks when our labels are known, and clustering tasks where our sequences belong to classes which have not been seen before.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The central dogma of biology states that all organisms contain DNA, which is transcribed into RNA and then translated into proteins, which catalyze the chemical reactions that define life. DNA sequences that encode for specific proteins are known as genes. Thus, understanding the function of DNA sequences that encode genes is a fundamental task across all fields of biology.

In order to interpret sequence data, it is usually necessary to annotate sequences identified as genes. This is commonly done by aligning unknown sequences to ones of known function using algorithms such as Basic Local Alignment Search Tool (BLAST) [1] and comparing them based on the fraction of identical nucleotides (or amino acids, after in silico translation). Sequence identity comparisons are typically very accurate when sequence identity is high, which is one of the reasons these methods are so common in comparative genomics [27]. However, these methods do have downsides which prevent them from being useful under certain conditions.

First, when using these methods there are still many genes which cannot be annotated. For instance, in the Tara Oceans data set [24], an average of 50% and up to 80% of bacterioplankton genes lacked sufficient homology to genes in databases of known function to be confidently annotated [36]. This should be no surprise: gene function is overwhelmingly studied in genes that derive from microbes that grow in culture, whereas the vast majority of microbes on Earth belong to uncultured taxa [35, 14].

Second, even when using annotated data-sets, annotations are often vague. For instance, the 100 most common annotations in the RefSeq database of high-quality genomes [20] include such vague annotations as ‘acyl carrier protein’, ‘porin’, and ‘peptidase’. Any of these broad categories contain an enormous diversity of gene sequences and tertiary protein structures.

In addition, even high-quality annotations lose important information, because many types of important sequence information, such as relative amino acid content or factors that affect temperature optima of gene products, are discarded during the annotation process. These factors may be an important part of differences in ecosystem function, but they would not show up in ecosystem analyses based on annotations.

Finally, BLAST searches can be slow, especially when we wish to compare our sequence to multiple others and not just find the best match. This is mainly due to the need for alignment and string comparison.

Here we embed DNA or amino acid sequences in a low-dimensional space with a neural network combining convolutional and recurrent layers. The embedded data can be searched much more rapidly thanks to dimensionality reduction. Furthermore, the convolutional layers of the network allow recognition of important features while remaining robust to sequencing error such as insertions or deletions, and the LSTM layers capture long-term correlations in sequences that may be biologically important but difficult to identify in sequence alignments. Finally, this approach can identify sequences that are similar in some biological respect but which may not have any measurable sequence alignment, an ability which may be useful in sequence analysis tasks other than sequence annotation, such as identifying properties of gene products like temperature optima or enzyme lifetime.

Fig. 1:

In BLAST (left), a database of proteins must be searched for those with similar seeds (short strings) to the unknown protein. In our approach (right), the known proteins and the unknown protein are all transformed by a neural network into a low-dimensional vector representation where proteins can be grouped by similarity.

Our paper is structured as follows. In Sec. II

we discuss the previous research on which our work is based. This includes both work in bioinformatics regarding DNA sequence interpretation, and recent machine learning research on sequence classification and embedding space learning. In Sec.

III we motivate and describe the network architecture we use, and the training method which allows us to learn a useful embedding space. In Sec. IV we present results on multiple different experiments, and analyze some of the parameters in order to choose the ideal ones. Finally, we conclude in Sec. V.

Ii Related Work

Ii-a DNA Sequence Interpretation

BLAST and its variants (e.g. gapped BLAST and PSI-BLAST [2], BLAST+ [5]) have traditionally been used to identify regions of sequence homology between a query sequence and database of reference sequences. The algorithm includes 3 main steps. (1) A list is compiled of important seeds (short strings of nucleotides or amino acids) appearing in the query sequences. (2) The reference database is scanned to find locations of of the same seeds, aided by an index of seed locations. (3) Matches between query seeds and reference seeds are extended to determine whether the areas neighboring the seeds match as well as the seeds. For each query sequence, the reference sequences with the best matches are returned.

Several improvements have been made to the original BLAST. USEARCH [7] reduces search time by returning only a few high-quality matches rather than considering all possible matches. DIAMOND [4] constructs a double index to traverse query and reference seeds more quickly. GPU-BLAST [39], HPC-BLAST [29], and H-BLAST [41] parallelize the database search on high-performance systems.

BLAST and its improvements all have the same search limitation: They search for very close matches using certain confidence levels, and do not provide distances to the entire set (or subset). Comparing to all sequences would be very expensive since it would require multiple alignments and string comparisons. Our goal is to allow faster comparisons by providing a significantly smaller numeric representation of each protein in the database by preprocessing it with our neural network. Euclidean distance on short vectors is a faster similarity metric than string comparison on long sequences. Also, clustering-based algorithms such as fast nearest neighbors [19] or approximate nearest neighbor search algorithms such as Neighborhood Graph and Tree [12] allow for nearest neighbor search of the database in sub-linear time. Figure 1 illustrates how our approach differs from BLAST.

As a practical matter, Hidden Markov Models (HMMs) are often part of gene annotation strategy,

[6]. For instance, the popular annotation package PROKKA [34] uses a hierarchical strategy, beginning with BLAST+ searches of increasingly expansive databases and ending with HMM searches of protein family databases, e.g. [10]. However, these have their own limitations as well. Mainly, since the DNA sequence is assumed to be Markovian, it means that the state transitions depend only on the current state and not anything in the past. This is most likely not physically true for genetic data, and although we can alter the HMM’s to consider previous states as well, we must define exactly how many previous states will be included.

Our approach is to embed sequences in a lower-dimensional space, and to use those vectors for comparison. This has been done in a few previous works. One technique to embedding a DNA sequence is [40], which uses a word2vec model to encode short sequences. DSPACE [33] is more similar to our approach, training a neural network to embed amino acid sequences for multiple supervised tasks. Unlike these models, we introduce LSTM layers which allows dealing with sequences of different lengths in addition to capturing dependencies which are distant in the sequence (as opposed to convolutional layers which only capture local patterns) . Our network structure is inspired by DanQ [25], which uses 1-dimensional convolutional layers followed by LSTM layers. The convolutional layers recognize short-term patterns, while the LSTM layers recognize long-term patterns, making the combination stronger than either half alone. We adapt this structure by adding bidirectionality and the possibility to deal with varying length sequences to produce embeddings and show that this network is more accurate than DSPACE, which only includes convolutional and dense layers.

Ii-B Deep Learning on Sequences

Machine learning using sequence data is not a new problem and has been studied extensively for many years, particularly in the natural language processing community. One of the early methods used for this type of data was the HMM

[3] which is a statistical model in which the process is Markovian with observable (the signal) and hidden (the prediction) events. Machine learning can be used to find the HMM parameters and dynamic programming (for example the Viterbi algorithm [38]) can then be used to find the maximum likelihood predictions. These methods have been used for many different types of sequence data such as speech recognition [26] and gene finding [15].

More recently, with the advent of deep learning algorithms, a new set of machine learning algorithms has been developed which is not limited by the same constraints and therefore is able to achieve much better results. More specifically, recurrent neural networks (RNNs) have been able to achieve excellent prediction results on sequences since they can utilize high dimensional hidden states which can remember an unlimited amount of past information [37]

. This allows the network to discover much longer temporal dependencies as compared to HMM’s. One type of RNN, the long short-term memory network (LSTM)

[11], has been especially successful in achieving state of the art results on a variety of tasks as it is able to better learn long term dependencies [8]. Finally, bidirectional RNN’s [32] have been used for prediction in both the forward and backward directions of the sequence.

In this work we adopt these methods to work on DNA sequences. More specifically, we use bidirectional LSTM’s on top of convolutional layers to predict the protein class of an unknown gene.

Ii-C Learning Embedding Spaces

The process of training a neural network on one task and using an intermediate layer of that network to create an embedding space for different tasks has been applied in different domains. For example, Word2vec [17] learns a vector representation of words by training a network to predict future words from past words, and the resulting embedding places words with similar meaning closer together.

VGG-Face [21]

creates an embedding of faces, for use in face recognition: learning the distinguishing features of each face regardless of position, lighting, etc. As not all identities are known at training time it is not possible to build a simple face classifier, and therefore the task is to determine if two faces are of the same person. This is accomplished by learning an embedding in which faces of the same person are close to each other, while faces of different people are far in the embedding space. They use two different types of training: either using triplet loss to train the embedding directly, or training a classifier and removing the final layer to get the embedding. In our work we adopt the latter.

The concept is the same across domains: a high-dimensional input space is converted to a low-dimensional space containing the most important features of each element for a specific task. The distance between two elements in this low dimensional space is a measure of their similarity, useful for many tasks beyond the original training task. For example Word2vec vectors can be used for sentiment analysis

[42], while VGG-Face embeddings can be used to search for lookalikes [28].

In this work we use deep neural networks to learn an embedding space for DNA sequences. Similar to VGG-Face [21] we assume that we don’t know all protein classes ahead of time and therefore cannot rely on a classifier. In addition, as has been shown, we expect these embeddings to be useful for other tasks as well. Although we use a similar training setup to VGG-Face, our network architecture is different than theirs as we are dealing with sequence data and therefore use RNN’s.

Iii Methods

We frame the original problem we are trying to solve in the following way. Given a query DNA sequence , we wish to label it with a protein label (i.e. N-acetyltransferase) . In addition, we have a database of other DNA sequences , each with its own label . We wish to compare to all sequences in in order to find a match, and transfer the label. For example, if the best match to our query sequence is , we can simply give it the label . This matching process can be done using algorithms such as FASTA [22] and BLAST [1] which provide a matching score between two sequences.

However, as these algorithms require alignments and string comparisons they can only return a few best matches. In addition, these algorithms have no direct way to measure the importance of certain subsequences. Therefore, in this work we propose a different way to compare the sequences. We first learn an embedding method . The function embeds a varying length sequence into a -dimensional Euclidean space. Once we have learned such an embedding, the similarity between two sequences (regardless of their length) can be found simply by calculating their Euclidean distance .

The function’s parameters can be learned using our labeled database. The goal of learning would be to create a space where genes with similar function are close to each other as compared to ones with different function. That is:


Where and are both labeled with and is not labeled with . Learning a space in which Eq. 1 is true gives us the advantage of being able to do more than simply label the sequence using our known database. For example, given distances to multiple other sequences in we can infer something about the functionality of the query sequence. In addition, given a group of unknown genes, we can use the embedding space to cluster them, thus finding out how many types of genes there are and how they relate to each other.

Similarly to other works in computer vision we can do this by first training a deep neural network classifier on a large number of classes using our labeled database

. We can view the output of the network’s penultimate layer as the embedding vector . The final dense layer can be viewed as a linear classifier over the embedding layer, and therefore we expect genes of similar function to be close in the embedding space.

Therefore, although we use the classification layer for training and to test our classification task, it is removed when comparing sequences to one another. By simply calculating the Euclidean distance of the output vectors from the embedding layer we can measure the similarity between sequences.

Iii-a Network Architecture

Fig. 2: Network architecture, including sizes for each layer and output. All layers are presented except for the non-linear activations. Notice that the top layer (Dense) is only used for training and during the classification task. For embedding we remove this layer and directly use the 256 length vector produced by the LSTM. Details are provided in Sec. III-A.

Our network is detailed in Figure 2. The input to the network is a

matrix. The columns represent a one-hot encoding of the 4 nucleotides

and the rows represent the maximum length of a sequence. For example, a value of at position means that the first nucleotide in the sequence is and positions will be zero. If a sequence is shorter than 4500, all the extra columns are set to

. Masking is applied so that the zero padding does not affect the final result.

The first layer is a 1D convolution (stride 3, kernel size 3, 26 filters) to represent encoding of every 3 nucleotides into amino acids. Another 1D convolutional layer (stride 1, kernel size 26, 320 filters) followed by a max pooling layer (stride 13, kernel size 13) reduces the size of the sequence and represents short patterns of amino acids. The sequence then passes through a bidirectional LSTM layer (output size 640), retaining an output for each step of the sequence. This layer captures long-term trends in the sequence regardless of direction. Next, a forward LSTM layer, retaining only the output from the last step, collects a summary of the sequence. The output from the final LSTM layer is the embedding: a 256 length vector.

For training through classification, the embedding layer is followed by a dense layer with output size equal to the number of classes. Training is performed by minimizing the cross entropy loss using batch gradient decent. Given that the final dense layer is a simple linear classifier, the embedding should learn features that are useful for classification without itself being tied to specific classes. The convolutional layers have ReLU activation, the LSTM layers have tanh activation, and the final class layer has softmax activation.

For comparison we also trained a DSPACE model, using the same architecture as in the source code of [33], with an extra stride 3 convolutional layer after the input to account for using DNA sequences instead of amino acid sequences. The inputs to this sequence are the same: length-4500 sequences with extra space padded with zeros, but masking was not possible as this model requires inputs of a fixed length. This network contains several 1D convolutional layers followed by several dense layers, culminating in an embedding layer. After the embedding layer, we replaced their output layers with a dense layer for class prediction.

We plan on releasing all of our code and models as part of the publication of this paper. In addition we will release the data-sets we used to train and test the models to ensure reproducibility.

Iii-B Classification Training

We used protein sequences from the RefSeq database [20], v83, filtered to contain only bacteria and archaea. Our first training set consisted of approximately 40 million sequences from the 30 most common classes. As an example we show the top 10 class names in Table I. The training set and the test set had the same proportion of each class. We then collected separate datasets containing the most common 100 and 1000 classes. For better training efficiency, we did not use all sequences from these classes, keeping only about 16 million sequences per dataset with equal representation for each class. Each training dataset had a corresponding test set of approximately 1 million sequences, also with equal numbers of examples from each class.

ABC transporter ATP-binding protein MFS transporter
LysR family transcriptional regulator transcriptional regulator
ABC transporter permease membrane protein
DNA-binding response regulator N-acetyltransferase
TetR/AcrR family transcriptional regulator alpha/beta hydrolase
TABLE I: The top 10 protein classes in our subset of the RefSeq dataset [20]. These provide an example of the type of classes we are using for classification.

We trained our model on each of the 30, 100, and 1000 class datasets, plus a DSPACE model on the 100 class dataset. The models were trained to minimize categorical cross entropy loss using an Adam (for LSTM) or Nadam (for DSPACE) optimizer with learning rate 0.001. Each network was trained on 200,000 random batches of 100 sequences. Training a model on a Quadro P5000 GPU took approximately two days.

Iii-C Embedding Analysis

To test the quality of each embedding on unseen classes, we arbitrarily chose 1000/10,000 classes not used for training. From each class we chose a random pair of sequences, and treated the first of each pair as a query , while the second sequences of all pairs were the database . If the embedding accurately reflects the biological function of the sequence, two sequences belonging to the same class should be closer in the embedding space than two from different classes (Eq. 1).

We therefore found the Euclidean distance from each query sequence to the entire database, producing a ranking of the most similar genes from the database. Then, we determined how many queries had the correct answer (the sequence from the same class) within the top N closest database sequences, where N = 1, 10, 20, or 50. The fraction of gene pairs placed close together by the embedding (with several definitions of closeness) is similar to the information retrieval measure ”recall-at-n” and is a way to quantify the embedding, or how well the model can group proteins from classes never seen before.

Iv Results

Iv-a Classification Results

We first present the results of using our neural network architecture for classification; that is, both the training set and the test set contain DNA sequences with the same protein labels. The number of classes and division between training and test set are described in Sec. III-B.

Table II shows the accuracy on the test set for a different number of classes. In addition, we compare our results with the DSPACE model [33]. As expected, as the number of classes increases our results slightly decrease since there is more chance for error. However, even when the number of classes is multiplied by 10 (from 100 to 1000) the accuracy only drops by a few percent showing that the network scales well. Our model clearly outperforms the DSPACE baseline and shows that the network is able to label DNA sequences from classes it has been trained on.

Model Test Accuracy
30 class LSTM .968
100 class LSTM .914
1000 class LSTM .896
100 class DSPACE .832
TABLE II: Accuracy on test data when training and testing our network with different amounts of classes. We also compare to other previously published network architecture DSPACE [33].

Fig. 3:

The confusion matrix for our 30 class model. The colors scale from dark blue (100%) to white (0%). Notice how all classes are nearly classified perfectly except for one class due to our ground truth labels. More information is given in Sec.



Fig. 3 shows the confusion matrix based on our 30 class model. It adds another way to interpret the results, and emphasizes a flaw in the data and labels we are using. Although our total classification rate for the 30 class model is 96.8%, when looking at the confusion matrix it is clear that most classes achieve a very high precision, with only one class achieving a 32% accuracy (class 18), which is mostly classified as class 7. However, the labels of these classes are ”N-acetyltransferase” (7) and ”GNAT family N-acetyltransferase”(18), which are essentially the same protein but are separated into two different classes in the RefSeq dataset. This ‘error’ suggests that the network has correctly learned that two differently named classes refer to the same broad set of sequences.

Iv-B DNA Sequence Embedding Results

Next we present results to show that our embedding space is meaningful even for DNA sequences which are not from the classes which appeared in the training data. We remove the final classification layer from the network and use the testing strategy discussed in Sec. III-C.

Tables III and IV

show results of the embedding analysis with pairs of sequences from 1000 and 10,000 classes, respectively. As expected the probability of having the closest sequence be of the same class as the query one is lower than our classification results. However, it is important to look at the entire table to realize the advantages our method provides. For example, when training on 1000 classes the most similar sequence is of the same class 53% of the time (Table

III). When comparing this to a random chance of 0.1% this is an impressive result especially given that these classes have never been seen by the network and there is only one matching sequence in our database.

In addition, if we do not only focus on the closest sequence, but instead look at the top sequences, we see that although the correct match is not always ranked the highest, it is usually ranked high. For example, when examining the top 50 classes out of 10,000 (0.5% of the results) the correct result is there 70% of the time. This result could be extremely useful since if needed we can then perform a more traditional sequence comparison (for example BLAST) on a much smaller subset of our database and still find the correct match.

A few other observations can be made when analyzing the results. First, as the number of classes used for training increases (30, 100, 1000) so does the accuracy of the embeddings. This is reasonable since the network can generalize better when ”seeing” more classes during training. In addition, when we increase our database size by 10 fold from 1000 to 10,000 (comparing table III to IV), the accuracy does not drop by much showing that our method is relatively robust to the size of the database.

Table V repeats the embedding experiment comparing different embedding layer sizes on the 100 class LSTM model. Embedding quality increased from 128 to 256, but decreased from 256 to 512, suggesting that if the embedding becomes too large, overfitting on the training data reduces the ability to embed unknown classes.

Fig. 4: t-SNE visualization of embedded sequences from 10 protein classes using our 100-class model, each in a different color.

In order to visualize our embedding space, we use t-SNE from Scikit-learn [23], to transform 100 length-256 embeddings from each of 10 classes into a 2-dimensional space, where close vectors in the 256-dimensional space are also close in the 2-dimensional space. The embeddings were generated by the 100-class model using sequences from the test data, from 10 classes that were also used in training. The result in Figure 4 reveals that our embedding can in fact place similar sequences close together.

Model N=1 N=10 N=20 N=50
30 class LSTM .252 .423 .488 .591
100 class LSTM .350 .598 .680 .793
1000 class LSTM .530 .715 .744 .790
100 class DSPACE .231 .360 .416 .502
TABLE III: Fraction of query sequences with a correct match in the top N matches for each model, out of 1000 class pairs.
Model N=1 N=10 N=20 N=50
30 class LSTM .190 .272 .301 .364
100 class LSTM .253 .378 .434 .526
1000 class LSTM .389 .592 .641 .701
100 class DSPACE .202 .273 .299 .346
TABLE IV: Fraction of query sequences with a correct match in the top N matches for each model, out of 10000 class pairs.
Embedding Size N=1 N=10 N=20 N=50
128 .338 .591 .674 .767
256 .350 .598 .680 .793
512 .326 .508 .585 .671
TABLE V: Fraction of query sequences with a correct match in the top N matches for each model, out of 1000 class pairs. Comparison of different embedding sizes using the 100-class LSTM model.

Iv-C Length Analysis

Fig. 5:

The probability distributions of sequence length (left) and distance from mean sequence length (right) for correct and incorrect examples (each adds up to 1). The fact that the percentage of correct vs. incorrect classifications stays relatively constant, shows that our network is not relying heavily on the sequence length.

One of our concerns was that the classifier was making decisions simply based on the length of each sequence rather than its content. Therefore, we compared the lengths of sequences correctly classified by the 100-class LSTM model to the lengths of incorrectly classified sequences in the test set with 1 million examples. The results in Figure 5 indicate that while correct sequences are more likely to be close to the mean, for a given difference in length the accuracy does not change much. Therefore, length is not an important factor in the classifier’s decision.

Iv-D Noise Analysis

Fig. 6: The accuracy of our classification model under different noise sources as a function of the amount of noise. See Sec. IV-D for more details.

To test how robust our model is to noise and mutations in a DNA sequence, we performed alterations on each sequence in the 100-class test set and examined how they effect test accuracy. We tried two noise sources:

  • A simple probabilistic model of read noise from [18]. The basic idea is that we go through the sequence and change each base independently with a given probability .

  • We also look at how the network behaves when part of the sequence is missing. We randomly select a starting base from our sequence and then remove percent of the total sequence length.

Figure 6 shows how accuracy decreases as a function of for both noise sources. As expected, large mutations harm accuracy, but overall the model is able to handle small mutations without significant error.

One thing to note is that when removing parts of the sequence we chose a starting base at a location which is divisible by 3. Since our model depends on converting groups of 3 nucleotides to an amino acid, accuracy for deletions decreased significantly if the deleted segment was not aligned by groups of 3. For example, removing a single nucleotide from the sequence will cause all amino acid encodings after that nucleotide to be changed.

A simple way around this issue in the future is to use amino acid sequences instead of nucleotide sequences. Another possibility is to change the stride on the first layer of our network from 3 to 1, so that every possible group of 3 nucleotides is considered. When we tried training with stride 1, test accuracy decreased slightly and training took significantly longer, so although this approach is costly it is possibly helpful if this type of noise is a concern.

V Conclusion

In this work we presented a novel method for classifying and embedding DNA sequences. Using this method in conjunction with fast nearest neighbor algorithms we can find best matches to a query sequence even if it is from a previously unseen class. More importantly, using this embedding space provides not only a simple best match, but also distances to other sequences, thus providing functional information even for sequences which do not have an exact match in our database. We measure the robustness of our method both to dealing with unseen classes and to dealing with different noise sources.

We expect this type of work to be useful for many applications besides simple gene annotation. For example, we have begun working on using these embeddings to characterize microbial biogeographic provinces in the ocean. The basic idea is to embed ocean sequence data collected from different regions and examine if certain regions are more similar in our embedding space than others. This problem would be hard to solve using traditional comparative genomics methods as many of the genes are unknown and it would be very expensive to do a comprehensive BLAST comparison. Our initial results show that these embeddings lead to plausible groupings of regions.

Although we show promising results in this paper, more work needs to be done to understand the full potential from our method. For example, as described in Sec. IV-A, given the dataset used, accuracy is not expected to be perfect because of the ground truth classes given. The class labels used for training the classifier are not mutually exclusive and therefore confusion between such classes is expected due to the overlap. This is a difficult biological problem, because experimental determination of gene function is labor intensive [16] and virtually all gene databases are polluted by annotation errors [9, 13, 30]. More accurate results will require non-overlapping, error-free databases.

In addition, 1000 random classes (the largest number of classes we used for training) might not be representative enough of the entire database. More work can be done both in examining the effect of using larger amounts of classes for training and how to select a diverse set of classes to better represent the entire genome.

Finally, we are currently further examining the parameter space and architecture changes which will lead to better embeddings. For example, instead of training using a classification layer we are experimenting with using triplet-loss [31] to train the embedding layer directly. This has led to improved embeddings for face recognition and therefore we expect to see an improvement in our embeddings as well.

However, even with these limitations, the results presented in this work reveal that our embedding is indeed capable of placing sequences with similar function close together, even when that class is not seen in training. Beyond simply speeding up the search for similar proteins, an embedding could allow prediction of protein function and properties, and lead to new ways of using sequence data in biology research.

Vi Acknowledgements

This material is based upon work supported by (1) the University of Tennessee, Knoxville College of Arts and Sciences, (2) Tickle College of Engineering, and (3) the Joint Institute for Computational Sciences. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the University of Tennessee or Intel Corporation.

©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.


  • [1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman (1990) Basic local alignment search tool. Journal of molecular biology 215 (3), pp. 403–410. Cited by: §I, §III.
  • [2] S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman (1997-09) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25 (17), pp. 3389–3402. External Links: ISSN 0305-1048, Document, Link, http://oup.prod.sis.lan/nar/article-pdf/25/17/3389/3639509/25-17-3389.pdf Cited by: §II-A.
  • [3] L. E. Baum and J. A. Eagon (1967)

    An inequality with applications to statistical estimation for probabilistic functions of markov processes and to a model for ecology

    Bulletin of the American Mathematical Society 73 (3), pp. 360–363. Cited by: §II-B.
  • [4] B. Buchfink, C. Xie, and D. H. Huson (2015) Fast and sensitive protein alignment using diamond. Nature methods 12 (1), pp. 59. Cited by: §II-A.
  • [5] C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T. L. Madden (2009) BLAST+: architecture and applications. BMC bioinformatics 10 (1), pp. 421. Cited by: §II-A.
  • [6] S. R. Eddy (2011) Accelerated profile hmm searches. PLoS computational biology 7 (10), pp. e1002195. Cited by: §II-A.
  • [7] R. C. Edgar (2010-08) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26 (19), pp. 2460–2461. External Links: ISSN 1367-4803, Document, Link, http://oup.prod.sis.lan/bioinformatics/article-pdf/26/19/2460/16896486/btq461.pdf Cited by: §II-A.
  • [8] A. Graves (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §II-B.
  • [9] M. Green and P. Karp (2005) Genome annotation errors in pathway databases due to semantic ambiguity in partial ec numbers. Nucleic acids research 33 (13), pp. 4035–4039. Cited by: §V.
  • [10] D. H. Haft, J. D. Selengut, R. A. Richter, D. Harkins, M. K. Basu, and E. Beck (2012) TIGRFAMs and genome properties in 2013. Nucleic acids research 41 (D1), pp. D387–D395. Cited by: §II-A.
  • [11] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II-B.
  • [12] M. Iwasaki and D. Miyazaki (2018)

    Optimization of indexing based on k-nearest neighbor graph for proximity search in high-dimensional data

    arXiv preprint arXiv:1810.07355. Cited by: §II-A.
  • [13] C. E. Jones, A. L. Brown, and U. Baumann (2007) Estimating the annotation error rate of curated go database sequence annotations. BMC bioinformatics 8 (1), pp. 170. Cited by: §V.
  • [14] K. G. Lloyd, A. D. Steen, J. Ladau, J. Yin, and L. Crosby (2018) Phylogenetically novel uncultured microbial cells dominate earth microbiomes. MSystems 3 (5), pp. e00055–18. Cited by: §I.
  • [15] A. V. Lukashin and M. Borodovsky (1998) GeneMark. hmm: new solutions for gene finding. Nucleic acids research 26 (4), pp. 1107–1115. Cited by: §II-B.
  • [16] K. Michalska, A. D. Steen, G. Chhor, M. Endres, A. T. Webber, J. Bird, K. G. Lloyd, and A. Joachimiak (2015) New aminopeptidase from “microbial dark matter” archaeon. The FASEB Journal 29 (9), pp. 4071–4079. Cited by: §V.
  • [17] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §II-C.
  • [18] A. Motahari, K. Ramchandran, D. Tse, and N. Ma (2013-07) Optimal dna shotgun sequencing: noisy reads are as good as noiseless reads. In 2013 IEEE International Symposium on Information Theory, Vol. , pp. 1640–1644. External Links: Document, ISSN 2157-8117 Cited by: 1st item.
  • [19] M. Muja and D. G. Lowe (2009) Fast approximate nearest neighbors with automatic algorithm configuration.. VISAPP (1) 2 (331-340), pp. 2. Cited by: §II-A.
  • [20] N. A. O’Leary, M. W. Wright, J. R. Brister, S. Ciufo, D. Haddad, R. McVeigh, B. Rajput, B. Robbertse, B. Smith-White, D. Ako-Adjei, A. Astashyn, A. Badretdin, Y. Bao, O. Blinkova, V. Brover, V. Chetvernin, J. Choi, E. Cox, O. D. Ermolaeva, C. M. Farrell, T. Goldfarb, T. Gupta, D. H. Haft, E. Hatcher, W. Hlavina, V. S. Joardar, V. K. Kodali, W. Li, D. R. Maglott, P. Masterson, K. M. McGarvey, M. R. Murphy, K. O’Neill, S. Pujar, S. H. Rangwala, D. Rausch, L. D. Riddick, C. L. Schoch, A. Shkeda, S. S. Storz, H. Sun, F. Thibaud-Nissen, I. Tolstoy, R. E. Tully, A. R. Vatsan, C. Wallin, D. Webb, W. Wu, M. J. Landrum, A. Kimchi, T. A. Tatusova, M. DiCuccio, P. A. Kitts, T. D. Murphy, and K. D. Pruitt (2016) Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation.. Nucleic Acids Research 44 (Database-Issue), pp. 733–745. Cited by: §I, §III-B, TABLE I.
  • [21] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. (2015) Deep face recognition.. In bmvc, Vol. 1, pp. 6. Cited by: §II-C, §II-C.
  • [22] W. Pearson (2003) Finding protein and nucleotide similarities with fasta. Current protocols in bioinformatics 4 (1), pp. 3–9. Cited by: §III.
  • [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §IV-B.
  • [24] S. Pesant, F. Not, M. Picheral, S. Kandels-Lewis, N. Le Bescot, G. Gorsky, D. Iudicone, E. Karsenti, S. Speich, R. Troublé, et al. (2015) Open science resources for the discovery and analysis of tara oceans data. Scientific data 2, pp. 150023. Cited by: §I.
  • [25] D. Quang and X. Xie (2016) DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences. Nucleic acids research 44 (11), pp. e107–e107. Cited by: §II-A.
  • [26] L. R. Rabiner (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), pp. 257–286. Cited by: §II-B.
  • [27] B. Rost (2002) Enzyme function less conserved than anticipated. Journal of molecular biology 318 (2), pp. 595–608. Cited by: §I.
  • [28] A. Sadovnik, W. Gharbi, T. Vu, and A. Gallagher (2018) Finding your lookalike: measuring face similarity rather than face identity. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 2345–2353. Cited by: §II-C.
  • [29] S. E. Sawyer, B. Rekepalli, M. D. Horton, and R. G. Brook (2015) HPC-blast: distributed blast for xeon phi clusters. In Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 512–513. Cited by: §II-A.
  • [30] A. M. Schnoes, S. D. Brown, I. Dodevski, and P. C. Babbitt (2009) Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS computational biology 5 (12), pp. e1000605. Cited by: §V.
  • [31] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §V.
  • [32] M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §II-B.
  • [33] A. S. Schwartz, G. J. Hannum, Z. R. Dwiel, M. E. Smoot, A. R. Grant, J. M. Knight, S. A. Becker, J. R. Eads, M. C. LaFave, H. Eavani, et al. (2018) Deep semantic protein representation for annotation, discovery, and engineering. BioRxiv, pp. 365965. Cited by: §II-A, §III-A, §IV-A, TABLE II.
  • [34] T. Seemann (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30 (14), pp. 2068–2069. Cited by: §II-A.
  • [35] A. D. Steen, A. Crits-Christoph, P. Carini, K. M. DeAngelis, N. Fierer, K. G. Lloyd, and J. C. Thrash (2019) High proportions of bacteria and archaea across most biomes remain uncultured. The ISME journal, pp. 1–5. Cited by: §I.
  • [36] S. Sunagawa, L. P. Coelho, S. Chaffron, J. R. Kultima, K. Labadie, G. Salazar, B. Djahanschiri, G. Zeller, D. R. Mende, A. Alberti, et al. (2015) Structure and function of the global ocean microbiome. Science 348 (6237), pp. 1261359. Cited by: §I.
  • [37] I. Sutskever, J. Martens, and G. E. Hinton (2011) Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024. Cited by: §II-B.
  • [38] A. Viterbi (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory 13 (2), pp. 260–269. Cited by: §II-B.
  • [39] P. D. Vouzis and N. V. Sahinidis (2010) GPU-blast: using graphics processors to accelerate protein sequence alignment. Bioinformatics 27 (2), pp. 182–188. Cited by: §II-A.
  • [40] S. Woloszynek, Z. Zhao, J. Chen, and G. L. Rosen (2019) 16S rrna sequence embeddings: meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses. PLoS computational biology 15 (2), pp. e1006721. Cited by: §II-A.
  • [41] W. Ye, Y. Chen, Y. Zhang, and Y. Xu (2017-01) H-BLAST: a fast protein sequence alignment toolkit on heterogeneous computers with GPUs. Bioinformatics 33 (8), pp. 1130–1138. External Links: Document, Link, http://oup.prod.sis.lan/bioinformatics/article-pdf/33/8/1130/25150344/btw769.pdf Cited by: §II-A.
  • [42] D. Zhang, H. Xu, Z. Su, and Y. Xu (2015) Chinese comments sentiment classification based on word2vec and svmperf. Expert Systems with Applications 42 (4), pp. 1857 – 1863. External Links: ISSN 0957-4174, Document, Link Cited by: §II-C.