1 Introduction
Over the course of evolution, biological sequences constantly mutate and a large part of biological research is based on the analysis of these mutations. Biologists have developed accurate statistical models to estimate the evolutionary distance between pairs of sequences based on their edit distance
: the minimum number of (weighted) insertions, deletions or substitutions required to transform a string into another string .However, the computation of this edit distance kernel with traditional methods is bound to a quadratic complexity and hardly parallelizable, making its computation a bottleneck in large scale analyses, such as microbiome studies sberro2019large ; pasolli2020large ; kurilshikov2021large . Furthermore, the accurate computation of similarities among multiple sequences, at the foundation of critical tasks such as hierarchical clustering and multiple sequence alignment, is computationally intractable even for relatively small numbers of sequences. Problems that in other spaces are relatively simple become combinatorially hard in the space of sequences defined by the edit distance. For example, finding the Steiner string, a classical problem in bioinformatics that can be thought of as computing the geometric median in the space of sequences, is NPcomplete.
Classical algorithms and heuristics needleman1970general ; kariin1995dinucleotide ; sokal1958statistical ; chenna2003multiple widely used in bioinformatics for these tasks are dataindependent and, therefore, cannot exploit the lowdimensional manifold assumption that characterises realworld data verma2020robust ; tillquist2019low ; klimovskaia2020poincare . Leveraging the available data to produce efficient and datadependent heuristics and representations would greatly accelerate largescale analyses that are critical to biological research.
While the number of available biological sequences has grown exponentially over the past decades, machine learning approaches to problems related to string matching zheng2019sense ; koide2018neural
have not been adopted widely in bioinformatics due to their limitation in accuracy and speed. In contrast to most tasks in computer vision and NLP, string matching problems are typically formalised as combinatorial optimisation problems. These discrete formulations do not fit well with the current deep learning approaches. Moreover, representation learning methods based on Euclidean spaces struggle to capture the hierarchical structure that characterises realworld biological datasets due to evolution. Finally, common selfsupervised learning approaches, very successful in NLP, are less effective in the biological context where relations tend to be between sequences rather than between bases
mcdermott2021rethinking .In this work, we present Neural Distance Embeddings (NeuroSEED), a general framework to produce representations for biological sequences where the distance in the embedding space is correlated with the evolutionary distance between sequences. NeuroSEED provides fast approximations of the distance kernel , lowdimensional representations for biological sequences, tractable analysis of the relationship between multiple sequences in the embedding geometry and a way to decode novel sequences.
Firstly, we reformulate several existing approaches into NeuroSEED highlighting their contributions and limitations. Then, we examine the task of embedding sequences to preserve the edit distance that is the basis of the framework. This analysis reveals the importance of datadependent approaches and of using a geometry that matches the underlying data distribution well. The hyperbolic space is able to capture the implicit hierarchical structure given by biological evolution and provides an average 22% reduction in embedding RMSE against the best competing geometry.
We show the potential of the framework and its wide applicability by analysing two fundamental tasks in bioinformatics involving the relations between multiple sequences: hierarchical clustering and multiple sequence alignment. For both tasks, unsupervised approaches using NeuroSEED encoders are able to match the accuracy of common heuristics while being orders of magnitude faster. For hierarchical clustering, we also explore a method based on the continuous relaxation of Dasgupta’s cost in the hyperbolic space which provides a 15x runtime reduction at similar quality levels. Finally, for multiple sequence alignment
, we devise an original approach based on variational autoencoders that matches the performance of competitive baselines while significantly reducing the runtime complexity.
As a summary our contributions are: (i) We introduce NeuroSEED, a general framework to map sequences in geometric vector spaces, and reformulate existing approaches into it. (ii) We show how the hyperbolic space can bring significant improvements to the datadependent analysis of biological sequences. (iii) We propose several heuristic approaches to classical bioinformatics problems that can be constructed on top of NeuroSEED embeddings and provide significant running time reduction against classical baselines.
2 Bioinformatics tasks
The field of bioinformatics has developed a wide range of algorithms to tackle the classical problems that we explore. Here we present the tasks and briefly mention their motivation and some of the baselines we test. More details are provided in Appendix B.
Edit distance approximation
In this work, we always deal with the classical edit distance where the same weight is given to every string operation, but all the approaches developed can be applied to any distance function of choice (which is given as an oracle). For example, when using amino acid sequences, one of the different metric variants of the classical substitution matrices such as mPAM250 xu2004metric would be a good choice. As baseline approximation heuristics, we take kmer kariin1995dinucleotide , which is the most commonly used alignmentfree method and represents sequences by the frequency vector of subsequences of a certain length, and FFP sims2009alignment , another alignmentfree method which looks at the JensenShannon divergence between distributions of kmers.
Hierarchical clustering (HC)
Discovering the intrinsic hierarchical structure given by evolutionary history is a critical step of many biological analyses. Hierarchical clustering (HC) consists of, given a pairwise distance function, defining a tree with internal points corresponding to clusters and leaves to datapoints. Dasgupta’s cost dasgupta2016cost measures how well the tree generated respects the similarities between datapoints. As baselines we consider classical agglomerative clustering algorithms (Single florek1951liaison , Complete sorensen1948method and Average Linkage sokal1958statistical ) and the recent technique chami2020trees that uses a continuous relaxation of Dasgupta’s cost in the hyperbolic space.
Multiple sequence alignment (MSA)
Aligning three or more sequences is used for the identification of active and binding sites as well as conserved protein structures, but finding its optimal solution is NPcomplete. A related task to MSA is the approximation of the Steiner string which minimises the sum of the distances (consensus error) to the sequences in a set.
Datasets
To evaluate the heuristics we chose three datasets containing different portions of the 16S rRNA gene, crucial in microbiome analysis clemente2015microbiome , one of the most promising applications of our approach. The first, Qiita clemente2015microbiome , contains more than 6M sequences of up to 152 bp that cover the V4 hypervariable region. The second, RT988 zheng2019sense , has only 6.7k publicly available sequences of length up to 465 bp covering the V3V4 regions. Both datasets were generated by Illumina MiSeq ravi2018miseq and contain sequences of approximately the same length. Qiita was collected from skin, saliva and faeces samples, while RT988 from oral plaques. The third dataset is the Greengenes fulllength 16S rRNA database mcdonald2012improved , which contains more than 1M sequences of length between 1,111 to 2,368. Moreover, we used a dataset of synthetically generated sequences to test the importance of datadependent approaches. A full description of the data splits for each of the tasks is provided in Appendix B.4.
3 Neural Distance Embeddings
The underlying idea behind the NeuroSEED framework, represented in Figure 1, is to map sequences in a continuous space so that the distance between embedded points is correlated to the one between sequences. Given a distribution of sequences and a distance function between them, any NeuroSEED
approach is formed by four main components: an embedding geometry, an encoder model, a decoder model, and a loss function.
Embedding geometry
The distance function between the embedded points defines the geometry of the embedding space. While this factor has been mostly ignored by previous work zheng2019sense ; chen2020predicting ; zhang2019neural ; dai2020convolutional ; gomez2017lsde , we show that it is critical for this geometry to reflect the relations between the sequences in the domain. In our experiments, we provide a comparison between Euclidean, Manhattan, cosine, squared Euclidean (referred to as Square) and hyperbolic distances (details in Appendix D).
Encoder model
The encoder model maps sequences to points in the embedding space. In this work we test a variety of models as encoder functions: linear layer, MLP, CNN, GRU cho2014learning and transformer vaswani2017attention with local and global attention. The details on how the models are adapted to the sequences are provided in Appendix C. Chen et al. chen2020predicting proposed CSM, an encoder architecture based on the convolution of subsequences. However, as also noted by Koide et al. koide2018neural , this model does not perform well when various layers are stacked and, due to the interdependence of cells in the dynamic programming routine, it cannot be efficiently parallelised on GPU.
Decoder model
For some tasks it is also useful to decode sequences from the embedding space. This idea, employed in Section 7.2 and novel among the works related to NeuroSEED, enables to apply the framework to a wider set of problems.
Loss function
The simplest way to train a NeuroSEED encoder is to directly minimise the MSE between the sequences’ distance and its approximation as the distance between the embeddings:
(1) 
where is a constant or learnable scalar. Depending on the application that the learned embeddings are used for, the MSE loss may be combined or substituted with other loss functions such as the triplet loss for closest string retrieval (Appendix F), the relaxation of Dasgupta’s cost for hierarchical clustering (Section 7.1) or the sequence reconstruction loss for multiple sequence alignment (Section 7.2).
There are at least five previous works zheng2019sense ; chen2020predicting ; zhang2019neural ; dai2020convolutional ; gomez2017lsde that have used approaches that can be described using the NeuroSEED framework. These methods, summarised in Table 1, show the potential of approaches based on the idea of NeuroSEED, but share two significant limitations. The first is the lack of analysis of the geometry of the embedding space, which we show to be critical. The second is that the range of tasks is limited to edit distance approximation (EDA) and closest string retrieval (CSR). We highlight how this framework has the flexibility to be adapted to significantly more complex tasks involving relations between multiple sequences such as hierarchical clustering and multiple sequence alignment.
Method  Geometry  Encoder  Decoder  Loss  Tasks 

Zheng et al. zheng2019sense  Jaccard  CNN  ✗  MSE  EDA 
Chen et al. chen2020predicting  Cosine  CSM  ✗  MSE  EDA 
Zhang et al. zhang2019neural  Euclidean  GRU  ✗  MAE + triplet  EDA & CSR 
Dai et al. dai2020convolutional  Euclidean  CNN  ✗  MAE + triplet  EDA & CSR 
Gomez et al. gomez2017lsde  Square  CNN  ✗  MSE  EDA & CSR 
Section 5  Hyperbolic  CNN & transformer  ✗  MSE  EDA 
Section 6  Hyperbolic  CNN & transformer  ✗  MSE  HC & MSA 
Section 7.1  Hyperbolic  Linear  ✗  Relaxed Dasgupta  HC 
Section 7.2  Cosine  Linear  ✓  MSE + reconstr.  MSA 
Appendix F  Hyperbolic  CNN & transformer  ✗  MSE & triplet  CSR 
4 Related work
Hyperbolic embeddings
Hyperbolic geometry is a nonEuclidean geometry with constant negative sectional curvature and is often referred to as a continuous version of a tree for its ability to embed trees with arbitrarily low distortion. The advantages of mapping objects with implicit or explicit hierarchical structure in the hyperbolic space have also been shown in other domains nickel2017poincare ; chamberlain2017neural ; chami2020low ; klimovskaia2020poincare . In comparison, this work deals with a very different space defined by the edit distance in biological sequences and, unlike most of the previous works, we do not only derive embeddings for a particular set of datapoints, but train an encoder to map arbitrary sequences from the domain in the space.
Sequence Distance Embeddings
The clear advantage of working in more computationally tractable spaces has motivated significant research in Sequence Distance Embeddings cormode2003sequence (also known as lowdistortion embeddings) approaches to variants of the edit distance muthukrishnan2000approximate ; cormode2001permutation ; ostrovsky2007low ; batu2006oblivious ; jowhari2012efficient ; andoni2012approximating ; chakraborty2016streaming . However, they are all dataindependent and have shown weak performance on the ‘unconstrained’ edit distance.
Hashing and metric learning
NeuroSEED also fits well into the wider research on learning to hash wang2017survey , where the goal is typically to map a high dimensional vector space into a smaller one preserving distances. Another field related to NeuroSEED is metric learning kulis2012metric ; bellet2013survey , where models are trained to learn embeddings from examples of similar and dissimilar pairs.
Localitysensitive hashing
Finally, the work presented is complementary to the line of research in localitysensitive hashing methods. Researchers have developed a series of heuristics especially for the tasks of sequence clustering li2006cd ; steinegger2018clustering and local alignment altschul1990basic . These use as subroutines embeddings/features based on alignmentbased or alignmentfree methods, and therefore, fall into the limitations we highlight in the paper. Future work could analyse how to improve these heuristics with NeuroSEED embeddings.
5 Edit distance approximation
In this section we test^{1}^{1}1
Code, datasets and tuned hyperparameters can be found at
https://github.com/gcorso/NeuroSEED. the performance of different encoder models and distance functions to preserve an approximation of the edit distance in the NeuroSEED framework trained with the MSE loss. To make the results more interpretable and comparable across datasets, we report results using % RMSE:(2) 
where is the maximum sequence length. This can be interpreted as an approximate average error in the distance prediction as a percentage of the size of the sequences.
Datadependent vs dataindependent methods
Figures 2 and 3 show that, across the datasets and the distance functions, the datadependent models learn significantly better representations than dataindependent baselines. The main reason for this is their ability to focus on and dedicate the embedding space to a manifold of much lower dimensionality than the complete string space. This observation is further supported by the results in Appendix E, where the same models are trained on synthetic random sequences and the dataindependent baselines are able to better generalise.
Our analysis also confirms the results from Zheng et al. zheng2019sense and Dai et al. dai2020convolutional which showed that convolutional models outperform feedforward and recurrent models. We also show that transformers, even when with local attention, produce, in many cases, better representations. Attention could provide significant advantages when considering more complex definitions of distance that include, for example, inversions dobzhansky1938inversions , duplications and transpositions pevzner2000computational .
Hyperbolic space
The most interesting and novel results come from the analysis of the geometry of the embedding space. In these biological datasets, there is an implicit hierarchical structure that is well reflected by the hyperbolic space. Thanks to this close correspondence even relatively simple models perform very well with the hyperbolic distance. In convolutional models, the hyperbolic space provides a 22% average RMSE reduction against the best competing geometry for each model.
The benefit of using the hyperbolic space is clear when analysing the dimension required (Figure 4). The hyperbolic space provides significantly more efficient embeddings, with the model reaching the ‘elbow’ at dimension 32 and matching the performance of the other spaces with dimension 128 with only 4 to 16. Given that the space to store the embeddings and the time to compute distances between them scale linearly with the dimension, this provides a significant improvement in downstream tasks over other NeuroSEED approaches.
Running time
The computational complexity of approximating the pairwise distance matrix^{2}^{2}2Computing the pairwise distance matrix of a set of sequences is a critical step of many algorithms including the ones analysed in the next section. for sequences of length is reduced from to assuming constant embedding size and a model linear with respect to the sequence length. This translates into a very significant runtime improvement even when comparing only 5000 sequences, as can be seen from the last column in Figure 2. Moreover, while the training takes longer than the baselines due to the more complex optimisation, in applications such as microbiome analysis, biologists typically analyse data coming from the same distribution (e.g. the 16S rRNA gene) for multiple individuals, therefore the initial cost would be significantly amortised.
6 Unsupervised heuristics
In this section, we show how competitive heuristics for hierarchical clustering and multiple sequence alignment can be built on the lowdistortion embeddings produced by the models trained in the previous section.
Hierarchical clustering
Agglomerative clustering, the most commonly used class of HC algorithms, can be accelerated when run directly on NeuroSEED embeddings produced by the pretrained model. This reduces the complexity to generate the pairwise distance matrix from to and allows to accelerate the clustering itself using geometric optimisations like localitysensitive hashing.
We test models with no further tuning from Section 5 on a dataset of 10k unseen sequences from the Qiita dataset. The results (Figure 5) show that there is no statistical difference in the quality of the hierarchical clustering produced with ground truth distances compared to that with convolutional or attention hyperbolic NeuroSEED embeddings. Instead, the difference in Dasgupta’s cost between different architectures and geometries is statistically significant and it results in a large performance gap when these trees are used for tasks such as MSA shown below. The total CPU time taken to construct the tree is reduced from more than 30 minutes to less than one in this dataset and the difference gets significantly larger when scaling to datasets of more and longer sequences.
Multiple sequence alignment
Clustal, the most popular MSA heuristic, is formed by a phylogenetic tree estimation phase that produces a guide tree then used by a progressive alignment phase to compute the complete alignment. However, the first of the two phases, based on hierarchical clustering, is typically the bottleneck of the algorithm. On 1200 RT988 sequences (used below), the construction of the guide tree takes 35 minutes compared to 24s for the progressive alignment. Therefore, it can be significantly accelerated using NeuroSEED heuristics to generate matrix and guide tree. In these experiments, we construct the tree running the Neighbour Joining algorithm (NJ) saitou1987neighbor on the NeuroSEED embeddings and then pass it on the Clustal commandline routine that performs the alignment and returns an alignment score.
Again, the results reported in Figure 6 show that the alignment scores obtained when using the NeuroSEED
heuristics with attention models are not statistically different from those obtained with the ground truth distances. Most of the models also show a relatively large variance in performance across different runs. This has positive and negative consequences: the alignment obtained using a single run may not be very accurate, but, by training an ensemble of models and applying each of them, we are likely to obtain a significantly better alignment than the baseline while still only taking a fraction of the time.
7 Supervised heuristics
In this section we propose and evaluate two methods to adapt NeuroSEED to the tasks of hierarchical clustering and multiple sequence alignment with tailored loss functions.
7.1 Relaxed approach to hierarchical clustering
An alternative approach to hierarchical clustering uses the continuous relaxation chami2020trees of Dasgupta’s cost dasgupta2016cost as loss function to embed sequences in the hyperbolic space (for more details see Appendix B.2 and chami2020trees ). In comparison to Chami et al. chami2020trees , we show that it is possible to significantly decrease the number of pairwise distances required by directly mapping the sequences into the space. This allows to considerably accelerate the construction, especially when dealing with a large number of sequences without requiring any pretrained model. Figure 1 shows an example of the relaxed approach when applied to a small dataset of proteins.
The results, plotted in Figure 7, show that a simple linear layer mapping sequences to the hyperbolic space is capable of obtaining with only pairwise distances very similar results to those from agglomerative clustering ( distances) and hyperbolic embedding baselines ( distances). In the RT988 dataset this corresponds to, respectively, 6700x and 82x fewer labels and leads to a reduction of the total running time from several hours (>2.5h on CPU for agglomerative clustering, 14h on GPU for hyperbolic embeddings) to less than 10 minutes on CPU for the relaxed NeuroSEED approach (including label generation, training and tree inference) with no pretraining required. Finally, using more complex encoding architectures such as MLPs or CNNs does not provide significant improvements.
7.2 Steiner string approach to multiple sequence alignment
An alternative approach to multiple sequence alignment uses a decoder from the vector space to convert the Steiner string approximation problem (Appendix B.3) in a continuous optimisation task.
This method, summarised in Figure 8 and detailed in Appendix G, consists of training an autoencoder to map sequences to and from a continuous space preserving distances using only pairs of sequence at a time (where calculating the distance is feasible). This is achieved by combining in the loss function a component for the latent space edit distance approximation and one for the reconstruction accuracy of the decoder. The first is expressed as the MSE between the edit distance and the vector distance in the latent space. The second consists of the mean elementwise crossentropy loss of the decoder’s outputs with the real sequences. At test time the encoder embeds all the sequences in the set, the geometric median point (minimising the sum of distances in the embedding space) is found with a relatively simple optimisation procedure and then the decoder is used to find an approximation of the Steiner string. During training, Gaussian noise is added to the embedded point in the latent space forcing the decoder to be robust to points not directly produced by the encoder.
As baselines, we report the average consensus error (average distance to the strings in the set) obtained using: a random sequence in the set (random), the centre string of the set (centre) and two competitive greedy heuristics (greedy1 and greedy2) proposed respectively by kruzslicz1999improved and casacuberta1997greedy .
The results show that the models consistently outperform the centre string baseline and are close to the performance of the greedy heuristics suggesting that they are effectively decoding useful information from the embedding space. The computational complexity for strings of size is reduced from for the centre string and for the greedy baselines to for the proposed method. Future work could employ models that directly operate in the hyperbolic space peng2021hyperbolic to further improve the performance.
8 Conclusion
In this work, we proposed and explored Neural Distance Embeddings, a framework that exploits the recent advances in representation learning to embed biological sequences in geometric vector spaces. By studying the capacity to approximate the evolutionary edit distance between sequences, we showed the strong advantage provided by the hyperbolic space which reflects the biological hierarchical structure.
We then demonstrated the effectiveness and wide applicability of NeuroSEED on the problems of hierarchical clustering and multiple sequence alignment. For each task, we experimented with two different approaches: one unsupervised tying NeuroSEED embeddings into existing heuristics and a second based on direct exploitation of the geometry of the embedding space via a tailored loss function. In all cases, the proposed approach performed on par with or better than existing baselines while being significantly faster.
Finally, NeuroSEED provides representations that are well suited for human interaction as the embeddings produced can be visualised and easily interpreted. Towards this goal, the very compact representation of hyperbolic spaces is of critical importance klimovskaia2020poincare . This work also opens many opportunities for future research direction with different types of sequences, labels, architectures and tasks. We present and motivate these directions in Appendix A.
Acknowledgements
The authors thank Saro Passaro for his insightful comments and suggestions and Frank Stajano, Andrei Margeloiu, Hannes Stärk and Cristian Bodnar for their review of the manuscript.
References
 [1] Hila Sberro, Brayon J Fremin, Soumaya Zlitni, Fredrik Edfors, Nicholas Greenfield, Michael P Snyder, Georgios A Pavlopoulos, Nikos C Kyrpides, and Ami S Bhatt. Largescale analyses of human microbiomes reveal thousands of small, novel genes. Cell, 2019.
 [2] Edoardo Pasolli, Francesca De Filippis, Italia E Mauriello, Fabio Cumbo, Aaron M Walsh, John Leech, Paul D Cotter, Nicola Segata, and Danilo Ercolini. Largescale genomewide analysis links lactic acid bacteria from food with the gut microbiome. Nature communications, 2020.
 [3] Alexander Kurilshikov, Carolina MedinaGomez, Rodrigo Bacigalupe, Djawad Radjabzadeh, Jun Wang, Ayse Demirkan, Caroline I Le Roy, Juan Antonio Raygoza Garay, Casey T Finnicum, Xingrong Liu, et al. Largescale association analyses identify host factors influencing human gut microbiome composition. Nature Genetics, 2021.
 [4] Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 1970.
 [5] Samuel Kariin and Chris Burge. Dinucleotide relative abundance extremes: a genomic signature. Trends in genetics, 1995.
 [6] Robert R Sokal. A statistical method for evaluating systematic relationships. Univ. Kansas, Sci. Bull., 1958.
 [7] Ramu Chenna, Hideaki Sugawara, Tadashi Koike, Rodrigo Lopez, Toby J Gibson, Desmond G Higgins, and Julie D Thompson. Multiple sequence alignment with the Clustal series of programs. Nucleic acids research, 2003.
 [8] Archit Verma and Barbara E Engelhardt. A robust nonlinear lowdimensional manifold for single cell rnaseq data. BMC bioinformatics, 2020.
 [9] Richard C Tillquist. Lowdimensional representation of biological sequence data. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 2019.
 [10] Anna Klimovskaia, David LopezPaz, Léon Bottou, and Maximilian Nickel. Poincaré maps for analyzing complex hierarchies in singlecell data. Nature communications, 2020.

[11]
Wei Zheng, Le Yang, Robert J Genco, Jean WactawskiWende, Michael Buck, and
Yijun Sun.
SENSE: Siamese neural network for sequence embedding and alignmentfree comparison.
Bioinformatics, 2019.  [12] Satoshi Koide, Keisuke Kawano, and Takuro Kutsuna. Neural edit operations for biological sequences. In NeurIPS, 2018.
 [13] Matthew McDermott, Brendan Yap, Peter Szolovits, and Marinka Zitnik. Rethinking relational encoding in language model: Pretraining for general sequences. arXiv preprint arXiv:2103.10334, 2021.
 [14] UniProt Consortium. Uniprot: a hub for protein information. Nucleic acids research, 2015.
 [15] Weijia Xu and Daniel P Miranker. A metric model of amino acid substitution. Bioinformatics, 2004.
 [16] Gregory E Sims, SeRan Jun, Guohong A Wu, and SungHou Kim. Alignmentfree genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences, 2009.

[17]
Sanjoy Dasgupta.
A cost function for similaritybased hierarchical clustering.
In
Proc. Annual ACM SIGACT Symposium on Theory of Computing
, 2016.  [18] Kazimierz Florek, Jan Łukaszewicz, Julian Perkal, Hugo Steinhaus, and Stefan Zubrzycki. Sur la liaison et la division des points d’un ensemble fini. In Colloquium mathematicum, 1951.
 [19] Thorvald Julius Sørensen. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. 1948.
 [20] Ines Chami, Albert Gu, Vaggos Chatziafratis, and Christopher Ré. From trees to continuous embeddings and back: Hyperbolic hierarchical clustering. In NeurIPS, 2020.
 [21] Jose C Clemente, Erica C Pehrsson, Martin J Blaser, Kuldip Sandhu, Zhan Gao, Bin Wang, Magda Magris, Glida Hidalgo, Monica Contreras, Óscar NoyaAlarcón, et al. The microbiome of uncontacted amerindians. Science advances, 2015.
 [22] Rupesh Kanchi Ravi, Kendra Walton, and Mahdieh Khosroheidari. Miseq: a next generation sequencing platform for genomic analysis. Disease gene identification, 2018.
 [23] Daniel McDonald, Morgan N Price, Julia Goodrich, Eric P Nawrocki, Todd Z DeSantis, Alexander Probst, Gary L Andersen, Rob Knight, and Philip Hugenholtz. An improved greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. The ISME journal, 2012.
 [24] Jian Chen, Le Yang, Lu Li, and Yijun Sun. Predicting alignment distances via continuous sequence matching. bioRxiv, 2020.
 [25] Xiyuan Zhang, Yang Yuan, and Piotr Indyk. Neural embeddings for nearest neighbor search under edit distance. 2019.
 [26] Xinyan Dai, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, and James Cheng. Convolutional embedding for edit distance. In Proc. of SIGIR, 2020.
 [27] Lluís Gómez, Marçal Rusinol, and Dimosthenis Karatzas. Lsde: Levenshtein space deep embedding for querybystring word spotting. In ICDAR, 2017.
 [28] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. of EMNLP, 2014.
 [29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
 [30] Maximilian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. arXiv preprint arXiv:1705.08039, 2017.
 [31] Benjamin Paul Chamberlain, James Clough, and Marc Peter Deisenroth. Neural embeddings of graphs in hyperbolic space. arXiv preprint arXiv:1705.10359, 2017.

[32]
Ines Chami, Adva Wolf, DaCheng Juan, Frederic Sala, Sujith Ravi, and
Christopher Ré.
Lowdimensional hyperbolic knowledge graph embeddings.
In Proc. of ACL, 2020.  [33] Graham Cormode. Sequence distance embeddings. PhD thesis, Department of Computer Science, University of Warwick, 2003.
 [34] S Muthukrishnan and Süleyman Cenk Sahinalp. Approximate nearest neighbors and sequence comparison with block operations. In Proc. annual ACM symposium on Theory of computing, 2000.
 [35] Graham Cormode, Shan Muthukrishnan, and Süleyman Cenk Sahinalp. Permutation editing and matching via embeddings. In International Colloquium on Automata, Languages, and Programming, 2001.
 [36] Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. Journal of the ACM (JACM), 2007.
 [37] Tugkan Batu, Funda Ergun, and Cenk Sahinalp. Oblivious string embeddings and edit distance approximations. In SODA, 2006.
 [38] Hossein Jowhari. Efficient communication protocols for deciding edit distance. In European Symposium on Algorithms, 2012.
 [39] Alexandr Andoni and Krzysztof Onak. Approximating edit distance in nearlinear time. SIAM Journal on Computing, 2012.
 [40] Diptarka Chakraborty, Elazar Goldenberg, and Michal Koucký. Streaming algorithms for embedding and computing edit distance in the low distance regime. In Proc. of STOC, 2016.
 [41] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. A survey on learning to hash. IEEE transactions on pattern analysis and machine intelligence, 2017.
 [42] Brian Kulis et al. Metric learning: A survey. Foundations and trends in machine learning, 2012.
 [43] Aurélien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709, 2013.
 [44] Weizhong Li and Adam Godzik. Cdhit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006.
 [45] Martin Steinegger and Johannes Söding. Clustering huge protein sequence sets in linear time. Nature communications, 2018.
 [46] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 1990.
 [47] Th Dobzhansky and Alfred H Sturtevant. Inversions in the chromosomes of Drosophila pseudoobscura. Genetics, 1938.
 [48] Pavel Pevzner. Computational molecular biology: an algorithmic approach. 2000.
 [49] Naruya Saitou and Masatoshi Nei. The neighborjoining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 1987.
 [50] Ferenc Kruzslicz. Improved greedy algorithm for computing approximate median strings. Acta Cybernetica, 1999.

[51]
Francisco Casacuberta and M De Antonio.
A greedy algorithm for computing approximate median strings.
In
Proc. of National Symposium on Pattern Recognition and Image Analysis
, 1997.  [52] Wei Peng, Tuomas Varanka, Abdelrahman Mostafa, Henglin Shi, and Guoying Zhao. Hyperbolic deep neural networks: A survey. arXiv preprint arXiv:2101.04562, 2021.
 [53] Albert Gu, Frederic Sala, Beliz Gunel, and Christopher Ré. Learning mixedcurvature representations in product spaces. In Proc. of ICLR, 2019.
 [54] Ondrej Skopek, OctavianEugen Ganea, and Gary Bécigneul. Mixedcurvature variational autoencoders. In Proc. of ICLR, 2020.
 [55] OctavianEugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks. In NeurIPS, 2018.

[56]
Ines Chami, Zhitao Ying, Christopher Ré, and Jure Leskovec.
Hyperbolic graph convolutional neural networks.
In NeurIPS, 2019.  [57] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond Euclidean data. IEEE Signal Processing Magazine, 2017.
 [58] Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges, 2021.
 [59] Dan Gusfield. Algorithms on stings, trees, and sequences: Computer science and computational biology. Acm Sigact News, 1997.

[60]
Phillip Compeau and PA Pevzner.
Bioinformatics algorithms: an active learning approach
. 2018.  [61] William J Masek and Michael S Paterson. A faster algorithm computing string edit distances. Journal of Computer and System sciences, 1980.
 [62] Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proc. of STOC, 2015.
 [63] Igor Ulitsky, David Burstein, Tamir Tuller, and Benny Chor. The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology, 2006.
 [64] ChrisAndre Leimeister and Burkhard Morgenstern. Kmacs: the kmismatch average common substring approach to alignmentfree sequence comparison. Bioinformatics, 2014.
 [65] Lusheng Wang and Tao Jiang. On the complexity of multiple sequence alignment. Journal of computational biology, 1994.
 [66] Robert C Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 2004.
 [67] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 2014.
 [68] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. ICML, 2015.
 [69] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 [70] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. of ICLR, 2015.
 [71] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International workshop on similaritybased pattern recognition, 2015.
 [72] Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. Learning local feature descriptors with triplets and shallow convolutional neural networks. In Proc. of BMVC, 2016.

[73]
Florian Schroff, Dmitry Kalenichenko, and James Philbin.
Facenet: A unified embedding for face recognition and clustering.
In CVPR, 2015.  [74] Diederik P. Kingma and Max Welling. Autoencoding variational bayes. In Proc. of ICLR, 2014.
 [75] Emile Mathieu, Charline Le Lan, Chris J Maddison, Ryota Tomioka, and Yee Whye Teh. Continuous hierarchical representations with Poincaré variational autoencoders. In NeurIPS, 2019.

[76]
Michael JD Powell.
A direct search optimization method that models the objective and constraint functions by linear interpolation.
In Advances in optimization and numerical analysis. 1994.  [77] Charles George Broyden. The convergence of a class of doublerank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 1970.
 [78] Roger Fletcher. A new approach to variable metric algorithms. The computer journal, 1970.
 [79] Donald Goldfarb. A family of variablemetric methods derived by variational means. Mathematics of computation, 1970.
 [80] David F Shanno. Conditioning of quasinewton methods for function minimization. Mathematics of computation, 1970.
 [81] Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in Python. Nature methods, 2020.
Appendix A Limitations and future work
We believe that the NeuroSEED framework has the potential to be applied to numerous problems and this work constitutes an initial analysis of its geometrical properties and applications. Here, we list some of the limitations of the current analysis and the potential directions of research to cover them.
Type of sequences
All realworld datasets analysed consist of sequence reads of the same part of the genome. This is a widespread setup for sequence analysis but not ubiquitous. Shotgun metagenomics consists of sequencing random parts of the genome. This would generate sequences lying on a lowdimensional manifold where the hierarchical relationship of evolution is combined with the relationship based on the specific position in the whole genome. Therefore, more complex geometries, such as product spaces gu2018learning ; skopek2019mixed , might be best suited.
Type of labels
In this project, we work with edit distances between sequences, these are too expensive for largescale analysis, but it is feasible to produce a large enough training set. For different definitions of distance, however, this might not be the case, future work could explore the robustness of this framework to inexact estimates of the distances as labels.
Architectures
Throughout the project, we used models that have been shown to work well for other types of sequences and tasks. However, the correct inductive biases that models should have to perform NeuroSEED might be different and even dependent on the type of distance they try to preserve. chen2020predicting ; koide2018neural provide some initial work in this direction with respect to the edit distance. Moreover, the capacity of the hyperbolic space could be further exploited using models that directly operate in the space peng2021hyperbolic ; ganea2018hyperbolic ; chami2019hyperbolic .
Selfsupervised embeddings
Finally, the direct use of the embeddings produced by NeuroSEED for downstream tasks would enable the application of a wide range of geometric data processing tools to the analysis of biological sequences.
Longterm impact
We believe the combination of NeuroSEED embeddings and geometric deep learning bronstein2017geometric ; bronstein2021geometric techniques could be beneficial to analyse and track the spectrum of mutations in a wide variety of biological and medical applications. This would have positive societal impacts in domains like microbiome analysis and managing epidemics. However, this could also have unethical applications in fields such as genome profiling.
Appendix B Bioinformatics tasks
The field of bioinformatics has developed a wide range of algorithms to tackle the classical problems that we explore. We describe here the methods that are most closely related to our work. For a more comprehensive overview, the interested reader is recommended Gusfield gusfield1997algorithms and Compeau et al. compeau2018bioinformatics .
b.1 Edit distance approximation
The task of finding the distance or similarity between two strings and the related task of global alignment lies at the foundation of bioinformatics.
Alignmentbased methods
Classical algorithms to find the edit distance, such as Needleman–Wunsch needleman1970general , are based on the process of finding an alignment between the two strings via dynamic programming. However, these are bound to a quadratic complexity w.r.t. the length of the input sequence, the best algorithm masek1980faster has a complexity and there is evidence that this cannot be improved backurs2015edit .
Alignmentfree methods
With the rapid improvement of sequencing technologies and the subsequent increase in demand for largescale sequence analyses, alternative computationally efficient sequence comparison methods have been developed under the category of alignmentfree methods.
kmer kariin1995dinucleotide is the most commonly used alignmentfree method and basis for many other algorithms (such as FFP sims2009alignment , ACS ulitsky2006average and kmacs leimeister2014kmacs ). It considers all the sequences of a fixed length k, kmers, and constructs a vector where each entry corresponds with the number of occurrences of a particular kmer in the sequence. The distance between the strings is then approximated by some type of distance between the vectors. Therefore, kmer generates vectors of size and estimates the edit distance as where is the only parameter of the model whose optimal value can be obtained with a single pass of the training set ^{3}^{3}3Except when using the hyperbolic space, in which case the radius of the hypersphere to which points are projected and are learned via gradient descent.:
(3) 
where . Therefore:
(4) 
b.2 Hierarchical clustering
Single, Complete and Average Linkage
The most common class of algorithms for hierarchical clustering, referred to as agglomerative methods, works in a bottomup manner recursively merging similar clusters. These differ by the heuristics used to choose clusters to merge and include Single florek1951liaison , Complete sorensen1948method and Average Linkage (or UPGMA) sokal1958statistical . They typically run in and require the whole distance matrix as input. Thus, with the edit distance, the total complexity is ).
Dasgupta’s cost
Dasgupta dasgupta2016cost proposed a global objective function that can be associated with the HC trees. Given a rooted binary tree , for two datapoints and let be their pairwise similarity, their lowest common ancestor in and the subtree rooted at . Dasgupta’s cost of given is then defined as:
(5) 
In this work is taken to be where is the normalised distance between sequences and .
b.3 Multiple sequence alignment
Multiple Sequence Alignment (MSA) consists of aligning three or more sequences and is regularly used for phylogenetic tree estimation, secondary structure prediction and critical residue identification. Finding the global optimum alignment of sequences is NPcomplete wang1994complexity , therefore many heuristics have been proposed.
Progressive alignment
The most commonly used programs such as the Clustal series chenna2003multiple and MUSCLE edgar2004muscle are based on a phylogenetic tree estimation phase from the pairwise distances which produces a guide tree, which is then used to guide a progressive alignment phase. To replicate the classical edit distance used, Clustal is run with a substitution matrix with all the entries 1 except 0 on the main diagonal and gap opening and extension penalties equal to 1.
Consensus error and Steiner string
It is hard to quantify the goodness of a particular multiple alignment and there is no single wellaccepted measure gusfield1997algorithms . One option is to find the sequence that minimises the consensus error to the set of strings : . The optimal string is known as Steiner string, while the centre string is the one in which minimises and has an upper bound gusfield1997algorithms . Algorithms to find an approximation of the Steiner string typically use greedy heuristics casacuberta1997greedy ; kruzslicz1999improved .
b.4 Datasets
As realworld datasets we used the Qiita, RT988 and Greengenes datasets of 16S rRNA subsequences. Experiments were also run on synthetic datasets formed by sequences randomly generated. In all datasets the splitting of sequences between train/val/test was random and duplicate sequences were discarded. Below we list the sizes of the datasets used for the results presented, these datasets can be downloaded from the public code repository.
Edit distance approximation
RT988 and Greengenes 5000/500/1200 sequences (train/val/test, 25M training pairwise distances), Qiita 7000/700/1500 sequences (49M distances), synthetic 70k/10k/20k sequences (3.5M distances).
Hierarchical clustering
the RT988 dataset is formed by 6.7k sequences to cluster while the Qiita one contains 10k sequences. The Qiita dataset used in the unsupervised approach is disjoint from the training set of the models.
Multiple sequence alignment
for the unsupervised approach the test set from the edit distance RT988 dataset was used, while the Steiner string approach was tested on the RT988 dataset using 4500/700 sequences for training/validation and 50 groups of 30 sequences for each of which the model computes an approximation of the Steiner string.
Appendix C Neural architectures
The framework of NeuroSEED is independent of the choice of architecture for the encoder. For each approach proposed in this project, we experiment with a series of models among the most commonly used in the literature for the analysis of sequences. In this section, we give some detail on how each model was adapted to the task at hand.
Linear & MLP
operate on the input sequence using the onehot encodings, padding to the maximum sequence length and flattening as a vector.
Cnn
is also applied to the padded sequence of onehot elements. They are conceptually similar to the kmer baseline with a few distinctions: CNNs can learn the kernels to apply, CNNs are equivariant not invariant to the translation of the patterns and, with multiple layers, CNNs can exploit hierarchical patterns in the data.
Gru
cho2014learning operates on the sequence of onehot sequence elements.
Transformer
vaswani2017attention every token is formed by 416 bases and is given a specific positional encoding using sinusoidal functions. We test both global attention where every token queries all the others and local where it only queries its 2 neighbours. Local attention allows the model to have a complexity linear w.r.t. the number of tokens.
All the models are integrated with various forms of regularisation including weight decay, dropout srivastava2014dropout , batch normalisation ioffe2015batch and layer normalisation ba2016layer and optimised using the Adam optimiser kingma2014adam . In the hyperbolic space, the embedded points are first projected on a hypersphere of learnable radius and then to the hyperbolic space.
Appendix D Distance functions
The key idea behind NeuroSEED is to map sequences into a vector space so that the distances in the sequence and the vector space are correlated. In this appendix, we present various definitions of distance in the vector space that we explored: L1 (referred as Manhattan), L2 (Euclidean), L2 squared (square), cosine and hyperbolic distances. For the hyperbolic space, we use the Poincaré ball model that embeds the points of the ndimensional Riemannian manifold in an ndimensional unit sphere where denotes the Euclidean norm. Given a pair of vectors and of dimension , the definitions for the distances are:
(6) 
(7) 
(8) 
(9) 
(10) 
Appendix E Distortion on synthetic datasets
We used a dataset of randomly generated sequences to test the importance of datadependent approaches and understand whether the improvements shown in Section 5 are brought by a better capacity of the neural models to model the edit distance mutation process or their ability to focus on the lowerdimensional manifold that the realworld data lies on.
In Figure 10, the picture that emerges from the results on the synthetic dataset is dramatically different from the one of realworld datasets and confirms the hypothesis that the advantage of neural models in realworld datasets is mainly due to their capacity to exploit the lowdimensional assumption. Instead, in the synthetic dataset, the best neural models perform only on par (taking into account the difference embedding space dimension) with the baselines. This is caused by two related challenges: the incredibly large space of sequences () that the model is trying to encode and the diversity between training and test sequences due to the random sampling. These make the task of learning a good encoding task too tough for currently feasible sizes of models and training data.
Appendix F Closest string retrieval
This task consists of finding the sequence that is closest to a given query among a large number of reference sequences and is very commonly used by biologists to classify newly sequenced genes.
Task formulation
Given a pretrained encoder , its closest string prediction is taken to be the string that minimises for each . This allows for sublinear retrieval via localitysensitive hashing or other data structures which is critical in realworld applications where databases can have billions of reference sequences. As performance measures, we report the top1, top5 and top10 percentage accuracies, where top indicates the percentage of times the closest string is ranked in the top predictions.
Triplet loss
The triplet loss hoffer2015deep ; balntas2016learning ; schroff2015facenet is widely used in the field of metric learning kulis2012metric ; bellet2013survey to learn embeddings that can be considered as a more direct form of supervision for this task. Given three examples with feature vectors a (anchor), p (positive) and n (negative) where the p is supposed to be closer to a than n, the triplet loss is typically defined as:
(11) 
where is the safety margin and a given distance function between vectors (typically Euclidean or cosine).
Results
Figure 11 shows that convolutional and attentionbased datadependent models significantly outperform the baselines even when these operate on larger dimensions. In terms of distance functions, the cosine distance achieves performances on par with the hyperbolic. An explanation is that for a set of points on the same hypersphere, the ones with the smallest cosine or hyperbolic distance are the same. The models trained with MSE of pairwise distances and the ones with triplet loss from Section 5 performed similarly except for the hyperbolic space where the triplet loss produces unstable training. The stabilisation of the triplet loss in the hyperbolic space and further comparisons between the two training frameworks are left to future work.
Appendix G Steiner string approach to MSA
In this section we explain more in details the Steiner string approach to multiple sequence alignment introduced in Section 7.2.
Training
For this approach, it is necessary to train not only an encoder model but also a decoder. The resulting autoencoder is trained with pairs of sequences (and their true edit distance) which are encoded into the latent vector space and then decoded. The loss combines an edit distance approximation component and a sequence reconstruction one. The first is expressed as the MSE between the real edit distance and the vector distance between the latent embeddings. The second is expressed as the mean elementwise crossentropy loss of the outputs with the real sequences. While this elementwise loss does not perfectly reflect the edit distance, it is an effective solution to the problem of lack of differentiability of the latter. Therefore, given two strings and of length and a vector distance , the loss of a model with encoder and decoder is:
(12) 
where is a hyperparameter that controls the tradeoff between the two components and represents the crossentropy.
One issue with this strategy is that the decoder is not learning to decode any point in the continuous space, but only those of the discrete subspace of points to which the generator maps some sequence from the domain. This creates a problem when, at test time, we try to decode points that are outside the subspace hoping to retrieve the string that maps to the point in the subspace closest to it. To alleviate this issue, during training, Gaussian noise is added to the embedded point in the latent space before decoding it, which forces the decoder to be robust to points not produced by the encoder. To make the noisy model trainable with gradient descent, we employ the reparameterization trick commonly used for Variational AutoEncoders kingma2013auto making the randomness an input to the model. Therefore, the reconstruction loss becomes:
(13) 
where and is a hyperparameter.
In the hyperbolic space adding the Euclidean Gaussian distribution would not distribute uniformly, therefore we Wrapped Normal generalisation of the Gaussian distribution to the Poincaré ball
mathieu2019continuous was used. Finally, for the cosine space, we normalise the outputs of the encoder and the input of the decoder to the unit hypersphere.Testing
At test time, given a set of strings, we want to obtain an approximation of the Steiner string, which minimises the consensus error (sum of distance to the strings in the set). In the sequence space with the edit distance finding the median point is a hard combinatorial optimisation problem. However in the space of real vectors with the distance functions used in this project, it becomes a relatively simple procedure which can be done explicitly in some cases (e.g. with square distance) or using classical optimisation algorithms^{4}^{4}4If the distance function is convex such as in the Euclidean case, the resulting optimisation problem is also convex.. Therefore, the Steiner string of a set of strings is approximated by:
(14) 
The continuous optimisation is performed using the COBYLA powell1994direct (for the hyperbolic distance) and BFGS broyden1970convergence ; fletcher1970new ; goldfarb1970family ; shanno1970conditioning (for all the others) algorithms implemented in the Python library SciPy virtanen2020scipy .
The produced predictions are then discretised to obtain actual sequences taking the most likely character for each element in the sequence and then evaluated by computing their average consensus error:
(15) 