Efficient Approximation Algorithms for String Kernel Based Sequence Classification

12/12/2017
by   Muhammad Farhan, et al.
0

Sequence classification algorithms, such as SVM, require a definition of distance (similarity) measure between two sequences. A commonly used notion of similarity is the number of matches between k-mers (k-length subsequences) in the two sequences. Extending this definition, by considering two k-mers to match if their distance is at most m, yields better classification performance. This, however, makes the problem computationally much more complex. Known algorithms to compute this similarity have computational complexity that render them applicable only for small values of k and m. In this work, we develop novel techniques to efficiently and accurately estimate the pairwise similarity score, which enables us to use much larger values of k and m, and get higher predictive accuracy. This opens up a broad avenue of applying this classification approach to audio, images, and text sequences. Our algorithm achieves excellent approximation performance with theoretical guarantees. In the process we solve an open combinatorial problem, which was posed as a major hindrance to the scalability of existing solutions. We give analytical bounds on quality and runtime of our algorithm and report its empirical performance on real world biological and music sequences datasets.

READ FULL TEXT

page 1

page 2

page 9

research
12/01/2021

Approximating Length-Restricted Means under Dynamic Time Warping

We study variants of the mean problem under the p-Dynamic Time Warping (...
research
11/23/2010

Evolutionary distances in the twilight zone -- a rational kernel approach

Phylogenetic tree reconstruction is traditionally based on multiple sequ...
research
12/10/2021

Improved Approximation Algorithms for Dyck Edit Distance and RNA Folding

The Dyck language, which consists of well-balanced sequences of parenthe...
research
12/09/2020

Regular sequences and synchronized sequences in abstract numeration systems

The notion of b-regular sequences was generalized to abstract numeration...
research
05/17/2014

Identification of functionally related enzymes by learning-to-rank methods

Enzyme sequences and structures are routinely used in the biological sci...
research
01/15/2014

Analogical Dissimilarity: Definition, Algorithms and Two Experiments in Machine Learning

This paper defines the notion of analogical dissimilarity between four o...
research
06/08/2022

Motiflets – Fast and Accurate Detection of Motifs in Time Series

A motif intuitively is a short time series that repeats itself approxima...

Please sign up or login with your details

Forgot password? Click here to reset