Comparing Two Counting Methods for Estimating the Probabilities of Strings

11/08/2022
by   Ayaka Takamoto, et al.
0

There are two methods for counting the number of occurrences of a string in another large string. One is to count the number of places where the string is found. The other is to determine how many pieces of string can be extracted without overlapping. The difference between the two becomes apparent when the string is part of a periodic pattern. This research reports that the difference is significant in estimating the occurrence probability of a pattern. In this study, the strings used in the experiments are approximated from time-series data. The task involves classifying strings by estimating the probability or computing the information quantity. First, the frequencies of all substrings of a string are computed. Each counting method may sometimes produce different frequencies for an identical string. Second, the probability of the most probable segmentation is selected. The probability of the string is the product of all probabilities of substrings in the selected segmentation. The classification results demonstrate that the difference in counting methods is statistically significant, and that the method without overlapping is better.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2022

Suffix tree-based linear algorithms for multiple prefixes, single suffix counting and listing problems

Given two strings T and S and a set of strings P, for each string p ∈ P,...
research
06/02/2021

Counting Lyndon Subsequences

Counting substrings/subsequences that preserve some property (e.g., pali...
research
06/26/2019

String Sanitization: A Combinatorial Approach

String data are often disseminated to support applications such as locat...
research
01/31/2020

A Survey on String Constraint Solving

String constraint solving refers to solving combinatorial problems invol...
research
05/23/2018

Joint String Complexity for Markov Sources: Small Data Matters

String complexity is defined as the cardinality of a set of all distinct...
research
01/29/2020

Stochastic L-system Inference from Multiple String Sequence Inputs

Lindenmayer systems (L-systems) are a grammar system that consist of str...
research
04/24/2017

GaKCo: a Fast GApped k-mer string Kernel using COunting

String Kernel (SK) techniques, especially those using gapped k-mers as f...

Please sign up or login with your details

Forgot password? Click here to reset