Exhaustive Exact String Matching: The Analysis of the Full Human Genome

Exact string matching has been a fundamental problem in computer science for decades because of many practical applications. Some are related to common procedures, such as searching in files and text editors, or, more recently, to more advanced problems such as pattern detection in Artificial Intelligence and Bioinformatics. Tens of algorithms and methodologies have been developed for pattern matching and several programming languages, packages, applications and online systems exist that can perform exact string matching in biological sequences. These techniques, however, are limited to searching for specific and predefined strings in a sequence. In this paper a novel methodology (called Ex2SM) is presented, which is a pipeline of execution of advanced data structures and algorithms, explicitly designed for text mining, that can detect every possible repeated string in multivariate biological sequences. In contrast to known algorithms in literature, the methodology presented here is string agnostic, i.e., it does not require an input string to search for it, rather it can detect every string that exists at least twice, regardless of its attributes such as length, frequency, alphabet, overlapping etc. The complexity of the problem solved and the potential of the proposed methodology is demonstrated with the experimental analysis performed on the entire human genome. More specifically, all repeated strings with a length of up to 50 characters have been detected, an achievement which is practically impossible using other algorithms due to the exponential number of possible permutations of such long strings.

READ FULL TEXT

page 1

page 6

page 7

research
05/07/2019

Order-Preserving Pattern Matching Indeterminate Strings

Given an indeterminate string pattern p and an indeterminate string text...
research
01/13/2022

Multiple Genome Analytics Framework: The Case of All SARS-CoV-2 Complete Variants

Pattern detection and string matching are fundamental problems in comput...
research
04/06/2020

SOPanG 2: online searching over a pan-genome without false positives

The pan-genome can be stored as elastic-degenerate (ED) string, a recent...
research
06/19/2023

Efficient Parameterized Pattern Matching in Sublinear Space

The parameterized matching problem is a variant of string matching, whic...
research
05/23/2018

Joint String Complexity for Markov Sources: Small Data Matters

String complexity is defined as the cardinality of a set of all distinct...
research
01/25/2019

An Optimized Pattern Recognition Algorithm for Anomaly Detection in IoT Environment

With the advent of large-scale heterogeneous search engines comes the pr...
research
03/29/2019

Data structures to represent sets of k-long DNA sequences

The analysis of biological sequencing data has been one of the biggest a...

Please sign up or login with your details

Forgot password? Click here to reset