GaKCo: a Fast GApped k-mer string Kernel using COunting

04/24/2017
by   Ritambhara Singh, et al.
0

String Kernel (SK) techniques, especially those using gapped k-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size (Σ) or allow more mismatches (M). This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to O(Σ^M). We propose a fast algorithm for calculating Gapped k-mer Kernel using Counting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger Σ and M, and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK. Theoretically, the time cost of GaKCo is independent of the Σ^M term that slows down the trie-based approach. Experimentally, we observe that GaKCo achieves the same accuracy as the state-of-the-art and outperforms its speed by factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein (12 datasets), and character-based English text (2 datasets), respectively. GaKCo is shared as an open source tool at <https://github.com/QData/GaKCo-SVM>

READ FULL TEXT

page 6

page 13

page 14

research
09/26/2020

ProDOMA: improve PROtein DOMAin classification for third-generation sequencing reads using deep learning

Motivation: With the development of third-generation sequencing technolo...
research
04/28/2023

KmerCo: A lightweight K-mer counting technique with a tiny memory footprint

K-mer counting is a requisite process for DNA assembly because it speeds...
research
06/21/2022

The Complexity of the Co-Occurrence Problem

Let S be a string of length n over an alphabet Σ and let Q be a subset o...
research
02/19/2020

Fast and linear-time string matching algorithms based on the distances of q-gram occurrences

Given a text T of length n and a pattern P of length m, the string match...
research
11/08/2022

Comparing Two Counting Methods for Estimating the Probabilities of Strings

There are two methods for counting the number of occurrences of a string...
research
07/31/2012

Learning a peptide-protein binding affinity predictor with kernel ridge regression

We propose a specialized string kernel for small bio-molecules, peptides...
research
02/16/2023

ClaPIM: Scalable Sequence CLAssification using Processing-In-Memory

DNA sequence classification is a fundamental task in computational biolo...

Please sign up or login with your details

Forgot password? Click here to reset