
Evolutionary distances in the twilight zone  a rational kernel approach
Phylogenetic tree reconstruction is traditionally based on multiple sequ...
read it

Exact Indexing of Time Series under Dynamic Time Warping
Dynamic time warping (DTW) is a robust similarity measure of time series...
read it

Regular sequences and synchronized sequences in abstract numeration systems
The notion of bregular sequences was generalized to abstract numeration...
read it

Sequence Covering Similarity for Symbolic Sequence Comparison
This paper introduces the sequence covering similarity, that we have for...
read it

Identification of functionally related enzymes by learningtorank methods
Enzyme sequences and structures are routinely used in the biological sci...
read it

Analogical Dissimilarity: Definition, Algorithms and Two Experiments in Machine Learning
This paper defines the notion of analogical dissimilarity between four o...
read it

Unaligned Sequence Similarity Search Using Deep Learning
Gene annotation has traditionally required direct comparison of DNA sequ...
read it
Efficient Approximation Algorithms for String Kernel Based Sequence Classification
Sequence classification algorithms, such as SVM, require a definition of distance (similarity) measure between two sequences. A commonly used notion of similarity is the number of matches between kmers (klength subsequences) in the two sequences. Extending this definition, by considering two kmers to match if their distance is at most m, yields better classification performance. This, however, makes the problem computationally much more complex. Known algorithms to compute this similarity have computational complexity that render them applicable only for small values of k and m. In this work, we develop novel techniques to efficiently and accurately estimate the pairwise similarity score, which enables us to use much larger values of k and m, and get higher predictive accuracy. This opens up a broad avenue of applying this classification approach to audio, images, and text sequences. Our algorithm achieves excellent approximation performance with theoretical guarantees. In the process we solve an open combinatorial problem, which was posed as a major hindrance to the scalability of existing solutions. We give analytical bounds on quality and runtime of our algorithm and report its empirical performance on real world biological and music sequences datasets.
READ FULL TEXT
Comments
There are no comments yet.