Vector Embeddings by Sequence Similarity and Context for Improved Compression, Similarity Search, Clustering, Organization, and Manipulation of cDNA Libraries

08/08/2023
by   Daniel H. Um, et al.
0

This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). FASTA/FASTQ files have several current limitations, such as their large file sizes, slow processing speeds for mapping and alignment, and contextual dependencies. These challenges significantly hinder investigations and tasks that involve finding similar sequences. The solution lies in transforming sequences into an alternative representation that facilitates easier clustering into similar groups compared to the raw sequences themselves. By assigning a unique vector embedding to each short sequence, it is possible to more efficiently cluster and improve upon compression performance for the string representations of cDNA libraries. Furthermore, through learning alternative coordinate vector embeddings based on the contexts of codon triplets, we can demonstrate clustering based on amino acid properties. Finally, using this sequence embedding method to encode barcodes and cDNA sequences, we can improve the time complexity of the similarity search by coupling vector embeddings with an algorithm that determines the proximity of vectors in Euclidean space; this allows us to perform sequence similarity searches in a quicker and more modular fashion.

READ FULL TEXT
research
11/03/2019

Attributed Sequence Embedding

Mining tasks over sequential data, such as clickstreams and gene sequenc...
research
01/31/2020

Convolutional Embedding for Edit Distance

Edit-distance-based string similarity search has many applications such ...
research
01/31/2020

Edit Distance Embedding using Convolutional Neural Networks

Edit-distance-based string similarity search has many applications such ...
research
09/16/2019

Unaligned Sequence Similarity Search Using Deep Learning

Gene annotation has traditionally required direct comparison of DNA sequ...
research
08/31/2020

Complex-valued embeddings of generic proximity data

Proximities are at the heart of almost all machine learning methods. If ...
research
06/24/2021

byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings

This article introduces byteSteady – a fast model for classification usi...
research
09/30/2020

Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices

Identifying similar protein sequences is a core step in many computation...

Please sign up or login with your details

Forgot password? Click here to reset