Fuzzy Segmentations of a String

01/31/2022
by   Armen Kostanyan, et al.
0

This article discusses a particular case of the data clustering problem, where it is necessary to find groups of adjacent text segments of the appropriate length that match a fuzzy pattern represented as a sequence of fuzzy properties. To solve this problem, a heuristic algorithm for finding a sufficiently large number of solutions is proposed. The key idea of the proposed algorithm is the use of the prefix structure to track the process of mapping text segments to fuzzy properties. An important special case of the text segmentation problem is the fuzzy string matching problem, when adjacent text segments have unit length and, accordingly, the fuzzy pattern is a sequence of fuzzy properties of text characters. It is proven that the heuristic segmentation algorithm in this case finds all text segments that match the fuzzy pattern. Finally, we consider the problem of a best segmentation of the entire text based on a fuzzy pattern, which is solved using the dynamic programming method. Keywords: fuzzy clustering, fuzzy string matching, approximate string matching

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/19/2020

Fast and linear-time string matching algorithms based on the distances of q-gram occurrences

Given a text T of length n and a pattern P of length m, the string match...
02/17/2020

Detecting k-(Sub-)Cadences and Equidistant Subsequence Occurrences

The equidistant subsequence pattern matching problem is considered. Give...
09/24/2020

Novel Keyword Extraction and Language Detection Approaches

Fuzzy string matching and language classification are important tools in...
12/02/2018

Sequence Searching Allowing for Non-Overlapping Adjacent Unbalanced Translocations

Unbalanced translocations are among the most frequent chromosomal altera...
10/25/2019

Massively Parallel Algorithms for String Matching with Wildcards

We study distributed algorithms for string matching problem in presence ...
09/19/2015

A Fuzzy MLP Approach for Non-linear Pattern Classification

In case of decision making problems, classification of pattern is a comp...
12/08/2016

A fuzzy approach for segmentation of touching characters

The problem of correctly segmenting touching characters is an hard task ...

1 Introduction

This paper refers to the field of pattern recognition

, which “is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories”

[5]. Specifically, data clustering deals with the grouping of elements into clusters based on similarities defined in one way or another [13]. Due to the difficulties in accurately describing clusters in many applications, modern approaches use fuzzy clusters [4]. The data clustering involves sequence labeling that assigns categorical labels to specific parts of the sequential structure.

We consider the sequence labeling problem for the case when the sequential structure is given in the form of text, i.e., as a sequence of characters in some alphabet. The label category is defined as a pattern represented as a sequence of fuzzy properties of adjacent segments of text. To solve this problem (called the fuzzy local segmentation problem), we propose a heuristic algorithm that finds a significant part of the occurrences of a given pattern in the text. For the efficiency of the proposed algorithm, the concept of a prefix structure is introduced, which is a generalization of the concept of the prefix function array used in the KMP string matching algorithm [6]. A distinctive feature of the prefix structure is the tracking of the process of assigning adjacent text segments to pattern symbols.

In the particular case, when adjacent text segments have unit length, and the pattern, respectively, is a sequence of fuzzy properties of alphabet characters, the fuzzy local segmentation problem is transformed into a fuzzy string matching problem. The latter can be viewed in the context of approximate string matching, variations of which are distance-based string matching [11], string matching using patterns with meta-characters, and, more generally, string matching with patterns represented as regular expression [1, 2]. A detailed review of these works is presented in the monograph [12].

In this paper, we prove that the heuristic algorithm designed to solve the fuzzy local segmentation problem, adapted for the fuzzy string matching problem, finds all occurrences of a fuzzy pattern in the text. This result summarizes the previous research by the authors in the field of string matching. Particularly, in [9] the periodicity in the pattern was used to improve the efficiency of the preprocessing phase in the method of string matching with finite automata and in the KMP algorithm. In [10], a non-deterministic transition system was constructed to describe the possibilities of processing a given text in order to find all occurrences of a fuzzy pattern in it. In [7], an efficient algorithm was proposed for determining all occurrences of a fuzzy pattern in the text, imitating the KMP algorithm with a two-dimensional prefix table. The prefix structure-based solution we propose in this paper improves this result in terms of memory usage.

Another special case of the fuzzy local segmentation problem is achieved when it is necessary to split the entire text into adjacent segments of at least given length in order to best match the fuzzy pattern. We call this problem the fuzzy global segmentation problem. A special case of this problem is the Bellman’s string segmentation problem [3], in which it is required to split the text into adjacent segments so that elements from the same segments would be closely related to each other. (In contrast to this, we assume that segmentation should be done according to a fuzzy pattern in order to match it in a best possible way.) This problem was considered in [8], where an algorithm for finding an optimal solution using the dynamic programming approach was proposed. In this paper, we present this solution in a more general form.

The paper is organized as follows.

Section 2 presents the fuzzy local segmentation problem and a heuristic algorithm for solving it using the prefix structure. Section 3 introduces the fuzzy string matching problem as a special case of the fuzzy local segmentation problem. It is proven that the heuristic algorithm for the fuzzy local segmentation problem, adapted for this case, finds all occurrences of a fuzzy pattern in the text. Section 4 presents a solution to the global fuzzy segmentation problem. Finally, the conclusion summarizes the obtained results.

2 Fuzzy Local Segmentations

2.1 Preliminaries

Suppose is a linearly ordered set with the smallest element and the largest element . According to [14], a fuzzy subset A of the universal set is defined by the membership function that associates with each element from the value from , called the degree of membership of in . A fuzzy subset of can be represented by the additive form

We say that an element certainly belongs to if , and certainly does not belong to if . Conversely, if , then we say that belongs to with degree .

2.2 Problem definition

Let be an alphabet of characters and be the set of all finite length strings in .

We define a fuzzy segmentation symbol (or, in short, a segmentation symbol) as a fuzzy subset of that allows strings in to be measured by elements from L. Given a segmentation symbol and a string , we say that x matches with degree .

A segmentation symbol is said to be regular if

  • The value can be computed in time, and

  • For any , the values and can be obtained from the value in constant time with appropriate tracking of the calculation of the value .

It follows from this definition that for any string and for any regular segmentation symbol , the value can be computed in time. From now on, we assume that the segmentation symbols are regular, unless otherwise stated.

Define the text as a sequence T[1..n] of characters from , where n is the length of the text. The problem of fuzzy segmentation of a text that we are considering is based on the concept of a fuzzy segmentation pattern (or, in short, a segmentation pattern), which is defined as an array of segmentation symbols.

We define the problem of finding a segmentation pattern in the text using 2 parameters, the first of which is a restriction on the length of the text segment, and the second is a restriction on the degree of matching. More precisely, let us define the first parameter as a pair = (, ), where the numbers and determine the minimum and maximum lengths of the string to be found, and the second parameter as L that determines the minimum degree of matching. For and a segmentation symbol , we say that matches and write if and

Given the text the segmentation pattern and the restrictions (, ), we define a valid - segmentation of as a sequence

of adjacent segments of that meet the - restrictions, that is for all

We define the - fuzzy local segmentation problem (or, in short, the - segmentation problem) as the problem of finding all valid - segmentations of .

Let L be the segment [0, 1] of ordered reals and = {0, 1} be a two-element alphabet. Consider the segmentation symbols and such that for all , and are the relative numbers of 0’s and 1’s in x, respectively (it is easy to check that the specified symbols are regular).

Suppose that and the segmentation parameters are . Then, there are the following valid - segmentations of :

2.3 Brute force solution

A brute force solution to the - segmentation problem can be obtained by considering all increasing sequences of the text positions and checking if the segmentation

generated by J is a valid – segmentation.

This method of finding the valid segmentations is inefficient since the number of sequences to be considered is , which is exponential in n if . Based on the KMP string matching algorithm, we propose a heuristic method for constructing a sufficiently large number of segmentations, although not necessarily all of them.

2.4 Segment capture heuristic (SC-Heuristic)

The proposed heuristic is based on the KMP string matching algorithm. Note that in the KMP algorithm, moving forward in the text is carried out by one position. On the contrary, the proposed algorithm moves forward through the text by the length of the shortest segment that satisfies the (, ) - restriction for the next segmentation symbol. The found segment is captured in subsequent matches with symbols in the pattern.

We use the following functions to move through the text:

  • . Using the global parameters and , this function returns the rightmost position of the shortest segment that starts at position i of , and (, ) - matches the segmentation symbol (assume that this function returns if no such segment exists). It follows from the regularity of that the value can be calculated in time O().

  • . This function returns if , and otherwise.

Given an array and an index , we denote by the s - length postfix of z for which

For a given segmentation pattern , an array of strings in the alphabet such that , and the minimum matching degree , we define the x-border of as a subarray such that

(that is, the last components of match the first symbols of with a degree of at least ). We denote the longest -border of .

Let us define the x-prefix function for as a mapping

such that for all

Additionally, suppose that .

In addition to the segmentation symbols and defined in Example 2.2, let us introduce the segmentation symbols and so that for all , and are the maximum number of consecutive 0’s and 1’s divided by , respectively. The regularity of and follows from the fact that the values of (resp., ) and (resp., ) can be obtained from and by keeping track of the number of consecutive 0’s and 1’s along with the number of last (resp., first) 0’s and 1’s in .

Suppose Then, the -prefix function for will be defined as follows:

For a given segmentation pattern , let us define the P-based prefix structure as a triplet , where

  • is the length of the structure, ,

  • is a -length array of strings in the alphabet such that

    ,

  • is a -length array of values of the -prefix function for .

Let us define two basic operations over the prefix structure :

  • . This operation converts the -based prefix structure to the -based prefix structure such that , where . As a result of performing this operation, the length of the non-empty prefix structure is decreased by at least 1.

  • Assuming that , this operation converts the -based prefix structure to the -based prefix structure such that

    • ,

    • for ,

    • for ,

  • where is the first element of the sequence for which . Additionally, suppose that if there is no such element.

As a result of performing this operation the length of the prefix structure is increased by 1.

Assuming the arrays and are organized as multi-queue and multi-stack respectively, with operations

  • : inserts into multi-queue/multi-stack,

  • : removes elements from multi-queue/multi-stack,

consider the following implementations of these operations:

Input: A prefix structure
Output: The is reduced
; ; ; //leaves the last elements of ; //leaves the first elements of
Algorithm 1 reduce // Reduces a prefix structure
Input: A prefix structure , where A such that                                    ( and are global parameters)
Output: The is extended by
1 ; //adds to the multi-queue ; while  and  do
2       ;
3 end while
; //adds new element to the multi-stack ;
Algorithm 2 extend //Extends a prefix structure

The prefix structure based implementation of the algorithm using the and operations is shown in Figure 1.

Input: Text , pattern , global parameters and
Output: A set of valid - segmentations of
1 ; //current prefix structure ; //current position in the text while  do
2       while  and  do
3             ;
4       end while
5      if   then
6             ;
7       end if
8      if 

//entire pattern matched

 then
9             ; ;
10       end if
11      ;
12 end while
Algorithm 3 Prints a set of valid segmentations of
Figure 1: Prefix structure based implementation of the algorithm.

Suppose the segmentation symbols and are defined as in Example 2.4, .

Consider the following description of the processing of these data by the algorithm:

  • matches , generating the segmentation .

  • Then, we have a mismatch for in position .

  • Since , we fix the matching with the segment and continue processing for and .

  • matches , generating the first valid segmentation

  • Since , we fix the matching with and continue processing for and .

  • matches , generating the second valid segmentation

  • As a result, the algorithm generates the following two valid - segmentations of :

  • Note that the following - valid segmentation of is not found by the algorithm:

(See illustration in Figure 2.)

Figure 2: Illustration of the execution of the algorithm.

ANALYSIS

Let us use the potential method to estimate the complexity of the

algorithm. Define the potential before each execution of the body of the external while loop to be , which is initially 0 and never becomes negative. Considering that the text segments included in can be identified by pairs of indices, suppose that the elements of multi-queue are pairs of index values.

Ignoring for now the function call in line 5, we can argue that the while loop in lines 5-7 has amortized complexity, since the actual cost of an iteration is compensated by a decrease in potential.

The operation in line 9 has amortized complexity. Indeed, its actual cost is due to iterations of the while loop in lines 3-5 of the procedure, at each of which we calculate the matching degree in time. At the same time, this operation increases the potential by 1, which gives for both actual and amortized costs.

Finally, if statement in lines 11-14 has amortized complexity due to actual cost of the operation and amortized cost of the procedure.

Thus, we get the amortized complexity for the body of the outermost while loop. Since it is executed times, we have total complexity ignoring function calls.

It follows from the regularity of the segmentation symbols that a single call to function takes time. The total number of calls is since each call to this function in line 5 is accompanied by a decrease in potential. Obviously, the total number of calls to the same function in line is again . Thus, we get the total time for all calls to the function.

Summarizing the above, we conclude that the algorithm has time complexity

The algorithm uses extra memory required to represent the prefix structure .

It is important to figure out what fraction of valid segmentations the algorithm is guaranteed to recognize. The following statement sheds light on this question.

There is an extreme case of behavior of the algorithm when it finds only of valid - segmentations of .

Proof.

Let us define the segmentation symbols and the minimum matching degree so that the string satisfies the property iff and only one position of contains the value .

Suppose that , .

In this case, the algorithm produces a single valid segmentation of (starting at position 1), while there are valid segmentations starting at positions , respectively. ∎

3 Fuzzy String Matching

3.1 Problem definition

Let us consider a specific case of the fuzzy segmentation problem when . In this case, we rename the fuzzy segmentation symbol to fuzzy symbol that can be defined as a fuzzy subset of . Similarly, we rename the fuzzy segmentation pattern to fuzzy pattern that is defined as a sequence of fuzzy symbols of length . Finally, the - fuzzy segmentation problem we rename to the - fuzzy string matching problem and formulate it as follows.

Given text , fuzzy pattern and threshold , find all positions , , (hereinafter - match positions) in such that

Let us choose and define the fuzzy symbols , and as follows:

Then for and there are two - match positions in , which are and

The fuzzy string matching problem is a direct generalization of the classical string matching problem. It was investigated in , where a non-deterministic transition system was constructed to describe the possibilities of processing a given text to find all occurrences of a fuzzy pattern in it, and an efficient algorithm was proposed to determine a certain part of the occurrences.

3.2 Solution to the fuzzy string matching problem

In [7], an -time algorithm was proposed to solve the fuzzy string matching problem using a two-dimensional prefix table, which is a generalization of the one-dimensional prefix array used in the KMP algorithm. In this section, we propose a new -time algorithm to the same problem, which is the result of customization of the algorithm. Unlike the algorithm from [11], which has -space complexity, the proposed algorithm is more efficient in terms of memory usage and has -space complexity.

Let us note a number of simplifications in the algorithm and related data structures, which result in a solution to the fuzzy string matching problem.

  • In the prefix structure , becomes an array of characters from instead of an array of strings in ,

  • The condition in line 5 becomes ,

  • The if statement in lines 8-10 becomes
    if   then

                   
          end if
  • procedure in line 12 can be simplifies as ,

  • The statement in line 15 becomes .

4 Fuzzy Global Segmentations

4.1 Problem definition

In this problem, we remove the restriction on the maximum length of substrings and consider the problem of splitting the entire string into substrings with the minimum length for optimal matching to the segmentation pattern . For a more accurate assessment of the quaity of segmentation, we add an accumulation operation to the lattice , and instead of the minimum degree of matching of text segments with pattern symbols, use the value of the accumulation operation applied to text segments. In [14], a dynamic programming algorithm was proposed to find a best solution for a particular case of this problem. Below we formulate it in a more general form and present an algorithm for constructing a best solution.

Assume that the binary monotonic accumulation operation is defined on the set of measures , so that is a commutative monoid with respect to , with neutral element 1 and zero element 0. That is, for all

For a text , an integer and a restriction such that , define the - decomposition of as any sequence

of adjacent segments of of at least length that cover .

Given text , segmentation pattern and restriction , we define the - fuzzy global segmentation problem (or, in short, the - global segmentation problem) as an - decomposition of that maximizes the value

Let us denote

for all - decompositions of into substrings .

Suppose that the segmentation symbols and , as previously, are defined as the relative number of 0’s and 1’s, respectively, . Suppose also that is the segment of real numbers and the operation is the multiplication.

In this case, the only solution to the global segmentation problem is the decomposition , where

with .

Note that when , then there is another solution , where

4.2 Solution to the global segmentation problem

[Optimal substructure of the global segmentation problem] Suppose . Then

1. .

2.

for all .

Proof.

The first statement is obvious. The second statement follows from the fact that for a solution to the - global segmentation problem for , we have that

  • must match a substring for some value such that (to have the last segment of at least length), and (to have solution to the , - global segmentation problem for ).

  • The segmentation should be a solution to the - global segmentation problem for , since otherwise, the solution can be improved.

Recursive computation of based on Theorem 4.2 will be inefficient due to overlapping subproblems. To avoid this, let us use the dynamic-programming approach in the bottom-up version.

For , denote The optimal substructure of the global segmentation problem dictates the following recurrent equation for calculating :

The optimal cost of the global segmentation is obviously . To construct an optimal segmentation as well, let us maintain the value , equal to the index maximizing the value in formula (4.2.1).

The memoization and construction phases of the proposed algorithm for solving the - global optimization problem are provided in Figures 5 and 6.

Input: Text , pattern and global parameter
Output: The -value matrix and the integer matrix
1 for  to  do
2       ;
3 end for
4for  to  do
5       for  to  do
6             ; for  downto  do
7                   ; if   then
8                         ;
9                   end if
10                  
11             end for
12            
13       end for
14      
15 end for
return and ;
Algorithm 5 Constructs the auxiliary matrices and
Figure 5: Construction of the -value matrix and the integer matrix .
Input: Global parameter Matrix and indices and such that
Output: The pairs of indices determining a best solution to the - global segmentation
problem for
1 if   then
2       ;
3 end if
4else
5      ; ;
6 end if
Algorithm 6 Prints a solution to the - glob. segm. problem for
Figure 6: Extraction a solution based on the matrix .

The initial call to the procedure is .

With the initial data taken from the Example 4.1, the procedure creates the matrices and shown in Figure. 7.

Figure 7: Matrices and created based on and

The procedure processing the matrix , prints the pair last; the pair in penultimate; the pair first. These pairs correspond to the following optimal decomposition of :

with .

ANALYSIS

Three nested loops with headers in lines 6, 7 and 9 of the procedure are executed at most , and times, respectively. The execution of the body of the innermost loop in lines 10-13 can be made constant, since the regularity of the segmentation symbols implies that the value can be obtained from the value in constant time. As a result, we have time complexity for the procedure. The procedure obviously runs in time. Thus, the proposed solution to the global segmentation problem has time complexity .

The procedure requires space to store the -value matrix and the integer matrix .

5 Conclusion

The paper considers the problem of text segmentation according to a fuzzy pattern. This problem is being investigated in the following two aspects:

  • As a fuzzy segmentation problem aimed at finding text segmentations matching the pattern with given lower and upper limits on the length of the segmentation units, and

  • As a fuzzy decomposition problem aimed at decomposing the entire text into adjacent segments in order to best match the pattern, with a given lower limit on the length of the decomposition units.

For the fuzzy segmentation problem, a heuristic algorithm is proposed for finding a sufficiently large number of occurrences of a pattern in the text. In the special case, when it is required that segments have a unit length, this problem is transformed into the fuzzy string matching problem, when the occurrence of a pattern in the text means that there is a segment in the text having a length equal to the length of the pattern, with one-to-one correspondence between segment characters and pattern symbols. It is proven that the heuristic segmentation algorithm adapted for this particular case finds all occurrences of the pattern in the text.

For the fuzzy decomposition problem, an algorithm for finding a best solution is developed using the dynamic programming approach.

All proposed algorithms are implemented and verified on test cases.

If and are the text and the pattern lengths, respectively, and are limits on segment length, then the proposed algorithms have the following time and space complexities:

  • Fuzzy segmentation problem: ,

  • Fuzzy string matching problem: ,

  • Fuzzy decomposition problem: .

6 Acknowledgements

This work was supported by the Ministry of Education, Science, Culture and Sports of the Republic of Armenia, project 21T-1B326.

References

  • [1] R. A. Baeza-Yates and G. Navarro (1996) A faster algorithm for approximate string matching. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, CPM ’96, Berlin, Heidelberg, pp. 1–23. External Links: ISBN 3540612580 Cited by: §1.
  • [2] R. Baeza-Yates and G. Navarro (2006-04) Multiple approximate string matching. pp. 174–184. External Links: ISBN 978-3-540-63307-5, Document Cited by: §1.
  • [3] R. Bellman (1961-06) On the approximation of curves by line segments using dynamic programming. Commun. ACM 4 (6), pp. 284. External Links: ISSN 0001-0782, Document Cited by: §1.
  • [4] J. Bezdek (1981-01) Pattern recognition with fuzzy objective function algorithms. External Links: ISBN 978-1-4757-0452-5, Document Cited by: §1.
  • [5] C. M. Bishop (2006)

    Pattern recognition and machine learning (information science and statistics)

    .
    Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 0387310738 Cited by: §1.
  • [6] D. Knuth, J. H. Morris, and V. Pratt (1977) Fast pattern matching in strings. SIAM J. Comput. 6, pp. 323–350. Cited by: §1, §3.2.
  • [7] A. H. Kostanyan (2020-Dec.) Fuzzy string matching using a prefix table. Mathematical Problems of Computer Science 54, pp. 116–121. External Links: Document Cited by: §1, §3.2.
  • [8] A. Kostanyan and A. Harmandayan (2019) Mapping a fuzzy pattern onto a string. In 2019 Computer Science and Information Technologies (CSIT), Vol. , pp. 5–8. External Links: Document Cited by: §1.
  • [9] A. Kostanyan and A. Karapetyan (2019) String matching in case of periodicity in the pattern. In Recent Research in Control Engineering and Decision Making, O. Dolinina, A. Brovko, V. Pechenkin, A. Lvov, V. Zhmud, and V. Kreinovich (Eds.), Cham, pp. 61–66. External Links: ISBN 978-3-030-12072-6 Cited by: §1.
  • [10] A. Kostanyan (2017) Fuzzy string matching with finite automat. In 2017 Computer Science and Information Technologies (CSIT), Vol. , pp. 9–11. External Links: Document Cited by: §1.
  • [11] G. M. Landau and U. Vishkin (1986) Efficient string matching with k mismatches. Theoretical Computer Science 43, pp. 239–249. External Links: ISSN 0304-3975, Document Cited by: §1.
  • [12] W.F. Smyth (2013) Computing regularities in strings: a survey. European Journal of Combinatorics 34 (1), pp. 3–14. Note: Combinatorics and Stringology External Links: ISSN 0195-6698, Document Cited by: §1.
  • [13] Z. Wang (2017-06) Image segmentation by combining the global and local properties. Expert Systems with Applications 87, pp. . External Links: Document Cited by: §1.
  • [14] L.A. Zadeh (1975) The concept of a linguistic variable and its application to approximate reasoning—i. Information Sciences 8 (3), pp. 199–249. External Links: ISSN 0020-0255, Document Cited by: §2.1, §4.1.