Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19 Spike Sequences

With the rapid global spread of COVID-19, more and more data related to this virus is becoming available, including genomic sequence data. The total number of genomic sequences that are publicly available on platforms such as GISAID is currently several million, and is increasing with every day. The availability of such Big Data creates a new opportunity for researchers to study this virus in detail. This is particularly important with all of the dynamics of the COVID-19 variants which emerge and circulate. This rich data source will give us insights on the best ways to perform genomic surveillance for this and future pandemic threats, with the ultimate goal of mitigating or eliminating such threats. Analyzing and processing the several million genomic sequences is a challenging task. Although traditional methods for sequence classification are proven to be effective, they are not designed to deal with these specific types of genomic sequences. Moreover, most of the existing methods also face the issue of scalability. Previous studies which were tailored to coronavirus genomic data proposed to use spike sequences (corresponding to a subsequence of the genome), rather than using the complete genomic sequence, to perform different machine learning (ML) tasks such as classification and clustering. However, those methods suffer from scalability issues. In this paper, we propose an approach called Spike2Vec, an efficient and scalable feature vector representation for each spike sequence that can be used for downstream ML tasks. Through experiments, we show that Spike2Vec is not only scalable on several million spike sequences, but also outperforms the baseline models in terms of prediction accuracy, F1-score, etc.

READ FULL TEXT

page 1

page 4

research
03/23/2023

Human Behavior in the Time of COVID-19: Learning from Big Data

Since the World Health Organization (WHO) characterized COVID-19 as a pa...
research
03/12/2020

COVID-19 Evolves in Human Hosts

Today, we are all threatened by an unprecedented pandemic: COVID-19. How...
research
09/19/2020

Detecting Malicious URLs of COVID-19 Pandemic using ML technologies

Throughout the COVID-19 outbreak, malicious attacks have become more per...
research
08/01/2022

Unsupervised machine learning framework for discriminating major variants of concern during COVID-19

Due to the rapid evolution of the SARS-CoV-2 (COVID-19) virus, a number ...
research
05/30/2018

Recurrent Deep Embedding Networks for Genotype Clustering and Ethnicity Prediction

The understanding of variations in genome sequences assists us in identi...
research
08/30/2021

Statistical Challenges in Tracking the Evolution of SARS-CoV-2

Genomic surveillance of SARS-CoV-2 has been instrumental in tracking the...

Please sign up or login with your details

Forgot password? Click here to reset