On-shelf Utility Mining of Sequence Data

11/26/2020 ∙ by Chunkai Zhang, et al. ∙ Harbin Institute of Technology University of Illinois at Chicago 0

Utility mining has emerged as an important and interesting topic owing to its wide application and considerable popularity. However, conventional utility mining methods have a bias toward items that have longer on-shelf time as they have a greater chance to generate a high utility. To eliminate the bias, the problem of on-shelf utility mining (OSUM) is introduced. In this paper, we focus on the task of OSUM of sequence data, where the sequential database is divided into several partitions according to time periods and items are associated with utilities and several on-shelf time periods. To address the problem, we propose two methods, OSUM of sequence data (OSUMS) and OSUMS+, to extract on-shelf high-utility sequential patterns. For further efficiency, we also designed several strategies to reduce the search space and avoid redundant calculation with two upper bounds time prefix extension utility (TPEU) and time reduced sequence utility (TRSU). In addition, two novel data structures were developed for facilitating the calculation of upper bounds and utilities. Substantial experimental results on certain real and synthetic datasets show that the two methods outperform the state-of-the-art algorithm. In conclusion, OSUMS may consume a large amount of memory and is unsuitable for cases with limited memory, while OSUMS+ has wider real-life applications owing to its high efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 6

page 8

page 9

page 13

page 14

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

It is well known that the tasks of association rule mining (zhang2019privacy), clustering (huang2019ultra), classification (pavlinek2017text), and prediction (wu2018hybrid), which primarily aim at extracting interesting information and patterns from data repositories, perform a crucial role in the domain of knowledge discovery in database (i.e., data mining) (han2011data). Among them, association rule mining concentrates predominantly on the existence of a causal relationship in databases within the framework of support and confidence, where a subproblem of frequent itemset mining (FIM) exists (agrawal1994fast). As the name implies, the goal of FIM is to discover all frequent itemsets with respect to a threshold called support, which is the lower limit of occurrences. As a significant part in the domain of pattern mining, the research issue of FIM has been extensively studied so that the patterns, i.e., frequent itemsets, can be extracted easily and efficiently. However, FIM cannot tackle sequence data, where each item is embedded in a timestamp. The inherent sequential order among items results in a serious combinatorial explosion of the search space that results in not only a significantly long execution time but also large memory consumption. To handle this issue, a series of sequential pattern mining (SPM) algorithms were proposed (agrawal1995mining; ayres2002sequential). Given a sequence database, the mining objective of the SPM algorithms is to extract the complete set of frequent subsequences as sequential patterns. Bearing a striking resemblance to FIM, SPM identifies patterns with the measure of frequency. Therefore, both FIM and SPM approaches belong to the frequency-oriented framework. More details about FIM and SPM can be obtained from (fournier2017surveyi; fournier2017surveys).

Evidently, the implicit assumption within the frequency-oriented framework is that the frequency of a pattern, as the measure, represents its significance and interest; however, the frequency of a pattern cannot completely reveal the interesting aspects of a pattern (gan2020fast). There are several measures, such as utility, diversity, conciseness, applicability, and surprisingness that should be considered while extracting and ranking patterns with respect to their potential interests to the user (geng2006interestingness). Among them, utility is a semantic measure, that is, it considers the objective factors of the data itself, as well as the preferences of the user. In general, this type of interestingness is based on user-specified utility functions that can fit the importance of data, which are generated from domain knowledge or customized expectation of users. In practice, in a retail store, decision makers prefer goods with a high return on investment to those merely sold well, which is based on achieving profit maximization. For instance, the sales volume of televisions may be relatively low when compared to that of pencils; however, even certain infrequent sales patterns that include televisions may yield a higher profit than those pertaining to pencils. Hence, in this circumstance, the measure of patterns is supposed to consider a tradeoff between sales volume and unit profit to determine the preference of users. Thus, the concept of utility was introduced into data mining, resulting in the inception of the field of utility mining (gan2019utility; gan2020proum). As (gan2018surveyuo) formulated, all utility mining methods belong to the utility-oriented framework, which considers not only the quantity (i.e., internal utility) but also relative importance (i.e., external utility) of items. Accordingly, FIM and SPM have been generalized as the tasks of high-utility itemset mining (HUIM) (gan2018surveyii) and high-utility SPM (HUSPM) (truong2019survey), respectively, by incorporating the concept of utility. When compared to HUIM, HUSPM considers the sequential ordering of items and has to encounter a more serious combinatorial explosion of the search space, which results in a greater challenge in discovering desired patterns. Till now, several previous studies have developed a number of strategies and efficient data structures so that the HUSPM methods can be successfully applied to numerous real-life situations such as mobile commerce environment (shie2011mining), web browsing analysis (ahmed2010mining), and gene regulation (zihayat2017mining).

It is obvious that HUSPM algorithms can identify interesting and informative patterns from complex databases because the problem of HUSPM simultaneously considers frequency, sequential order, and utility. However, HUSPM only considers relative time, i.e., sequential order of items in a sequence, and neglects absolute time in the database (lan2014discovery). In reality, all items in the database only appear in multiple different and user-specified time periods, namely on-shelf time. In real-world situations, certain goods are only placed on the shelf during certain short time periods; for example, in most retail stores, T-shirts are sold only in the summer and are removed from the shelf in other seasons. Therefore, the pattern pencil, pencil sharpener is on the shelf throughout the year and more likely to be of high utility, while the pattern T-shirt, shorts may demonstrate a low utility even though the items are sold well during their short on-shelf time. Obviously, conventional utility mining methods have a bias toward items that have longer on-shelf time as they have a higher chance of generating a high utility. To eliminate the bias, the problem of on-shelf utility mining (OSUM) was proposed, where the database is divided into several partitions according to time periods and each item is associated with several on-shelf time periods. Incorporating on-shelf information into the original measure, that is utility, is crucially beneficial to discover interesting and informative time-related patterns. On the one hand, consumers expect to buy their favorite goods at any time whether they are on sale or not, and they may blacklist the store even if the goods run out of stock on the shelves only a few times. On the other hand, retailers value every row on their shelves, where each inch of space is significantly valuable; moreover, they have to ensure that the right brands and goods are placed on the shelf at the appropriate times. Therefore, these extracted time-related patterns are helpful for retailers to make decisions.

To fulfill the task of OSUM, HUIM and HUSPM are extended to the problems of OSUM in transaction data (lan2014shelf; radkar2015mining; fournier2015foshu) and sequence data (lan2014discovery), respectively, while considering the on-shelf time. In the task of OSUM in transaction data, algorithms can identify on-shelf high-utility itemsets (osHUIs) by recursively enumerating itemsets in an ascending alphabetical order. For further efficiency improvement, a series of upper bounds as well as pruning strategies were developed to reduce the massive search space. To address the sequence data, Lan et al. (lan2014discovery) proposed a novel approach serving as the first solution to discover on-shelf high-utility sequential patterns (osHUSPs). This issue of OSUM of sequence data considers a vital character, that is timestamps embedded in the items; thus, a method to determine osHUSPs has to overcome certain technical challenges, which are as presented below.

First, the calculation of utility is more complex than that of frequency as a subsequence may occur in a sequence multiple times. Therefore, knowing whether a subsequence will appear in a sequence is insufficient; determining all occurrences and choosing a proper calculation method are required.

Second, according to the Apriori property, the measurement frequency in the frequency-oriented framework is anti-monotonic. However, this download closure property is not satisfied for patterns with embedded utility values, which implies that it is impracticable to reduce the search space relying on the anti-monotone properties of the frequency-oriented framework in utility mining.

Third, a common and efficient method to mine patterns is by enumerating patterns in an alphabetical order recursively, which presents a general issue, that is, critical combinatorial explosion of the search space. This is because the intrinsic sequential ordering leads to several possibilities of concatenation, which is the operation of generating longer candidates. Thus, it is necessary to design tight upper bounds and powerful pruning strategies to overcome inefficient checking of a large number of candidates.

Fourth, the search space in the problem sharply increases when the on-shelf information is incorporated. This is because the method must enumerate sequences in different time periods. Moreover, it is not straightforward to adapt the pruning techniques in HUSPM for OSUM of sequence data, and only a few pruning strategies were developed to address this problem.

To the best our knowledge, only one work (lan2014discovery) focuses on extracting osHUSPs in sequence data. The development of the research topic is not yet mature, and the only existing algorithm, referred to as a two-phase approach for mining high on-shelf utility sequential patterns (TP-HOUS) has significant room for improvement in terms of execution time, memory consumption, candidates filtering, and scalability. Moreover, no systematic problem statement has been formulated. In this paper, we formulate the problem of OSUM of sequence data and propose two efficient algorithms. The major contributions of this study can be summarized as follows:

  • We formulated the problem of OSUM of sequence data as discovering the complete set of osHUSPs. In particular, important notations, concepts and principles in the problem are defined.

  • A two-phase method, namely On-Shelf Utility Mining of Sequences (OSUMS), is designed with a novel storage data structure named periodical q-matrix. Two local pruning strategies are proposed to reduce the search space with two upper bounds time prefix extension utility (TPEU) and time reduced sequence utility (TRSU), and a strategy was designed to avoid redundant calculations.

  • We also designed a one-phase method, namely OSUMS, which overcomes the two challenges present in the two-phase algorithms. OSUMS adopts a novel storage data structure with periodical utility chain and utilizes two global pruning strategies with two upper bounds TPEU and TRSU.

  • A series of experiments on six real-world datasets were conducted to evaluate the performance of the two proposed methods. The experimental results show that they outperform the only existing algorithm TP-HOUS. In addition, when compared to the one-phase algorithm OSUMS, OSUMS consumes a significantly large amount of memory and is not sufficiently efficient.

The remainder of this article is organized as follows. Section 2 briefly reviews related work about utility mining and OSUM. Section 3 presents basic definitions and formulates the problem of OSUMS in sequence data. In Section 4, we present the details of the two proposed methods with several novel data structures, upper bounds, and strategies. An experimental evaluation of the designed methods is presented in Section 5. Finally, conclusions and future work are discussed in Section 6.

2. Related Work

In this section, we separately review the prior literature pertaining to utility mining and OSUM. Particularly, we also discuss the drawbacks of the two-phase algorithms in the process of pattern mining, which also exists in the state-of-the-art algorithms in the domain of OSUM in a sequence database.

2.1. Utility Mining

The problem of HUIM is to select the high-utility itemsets (HUIs) from a transaction database where each row is an itemset and each item has a utility representing its importance based on the utility functions. Chan et al. (chan2003mining) first incorporated the concept of utility for discovering highly desirable statistical patterns (i.e., HUIs); moreover, they presented a level-wise approach with a novel pruning strategy by relying on a weaker but anti-monotonic condition. Then, Liu et al. (liu2005two) developed a milestone two-phase algorithm that performs two phases in the development of HUIM. They pioneered an upper bound named transaction-weighted utilization (TWU), which satisfies the downward closure property. The anti-monotonicity of the upper bound guarantees that the complete set of HUIs is included in the candidates that the high TWU values generated in the first phase. After all candidates are identified, one more database scan is required to calculate their accurate utilities in the second phase, and the HUIs are outputted at the same time. Inspired by the two-phase algorithm, a series of two-phase methods such as incremental high-utility pattern (ahmed2009efficient), UP-Growth (tseng2012efficient), and Maximum Utility Growth (yun2014high) were developed. When compared to the two-phase algorithm, the subsequent approaches extract HUIs performing two similar phases but avoid generating candidates that do not appear in the database in a pattern-growth manner. As explained, the two-phase algorithms have to retain a large number of candidates in memory and calculate their utilities in the second phase, which is the major causes of the high computational cost. In the worst situation discussed in (fournier2019survey), all candidates are evaluated in all transactions, which incurs several database scans. To avoid the scalability issue due to storing large-sized candidates, one-phase algorithms were designed, which achieved a significant breakthrough in terms of efficiency. In the mining process, the high utility of each candidate can be immediately identified once it is enumerated; thus, the memory for storing the candidate can be released soon. Till now, the one-phase HUIM approaches were extensively studied such as fast HUIM (fournier2014fhm), modified HUI-Miner (peng2017mhuiminer), and efficient HUIM (zida2015efim) More details about HUIM can be obtained from comprehensive surveys (fournier2019survey; gan2018surveyii).

The primary goal of HUSPM is to identify the subsequences with high utility, which are referred to as high-utility sequential patterns (HUSPs), from a sequence database. Similar to the development history of HUIM, early HUSPM methods extracted desired HUSPs in two phases. Ahmed et al. are the pioneers who first incorporated utility into SPM (ahmed2010novel). They proposed a novel HUSPM framework for more real-world applicable information discovery. Further, they developed two two-phase algorithms, i.e., utility level (UL) and utility span (US). UL is a level-wise candidate generation method, which implies that it may generate candidates that do not appear in the database. The brute-force algorithm UL is more simple and straightforward than US; however, it generates too many candidates in the first phase and requires several database scans that are time consuming. The other algorithm, US, exploits a pattern-growth approach, which generates candidates based on relatively small-scale projected databases such that it requires a maximum of three database scans. With a striking resemblance to the two-phase HUIM algorithms, the two-phase methods of HUSPM also demonstrate two limitations. The first limitation is that a considerable amount of memory is occupied for storing the candidates generated in the first phase. The second one is that it is time consuming to compute actual utilities of candidates. To address the issue, a series of one-phase algorithms were proposed. Yin et al. (yin2012uspan) formulated the HUSPM problem statement and designed a formal framework for HUSPM. A novel method USpan was introduced to discover the desired HUSPs using a new structure called lexicographic quantitative sequence tree (LQS-Tree), which represents the search space to be traversed. Moreover, they developed depth-first and width-first pruning strategies by adopting sequence-weighted utilization (SWU) and sequence-projected utilization (SPU) upper bounds, respectively. However, the SPU is not a true overestimation of utility value, which implies that the width-first pruning strategy may prune the nodes representing HUSPs in the LQS-Tree. Then, Alkan et al. (alkan2015crom) proposed the high utility sequential pattern extraction (HupsExt) algorithm based on a pattern tree. It can be noted that each node in the tree structure stores some necessary information for calculating the tighter upper bound named cumulated rest of match (CoRM) and facilitating a pruning strategy called pruning before candidate generation. To more efficiently mine HUSPs, Wang et al. (wang2016efficiently) extended USpan (yin2012uspan) to HUS-Span, which can quickly discover patterns. Besides SWU, the method adopts two tight upper bounds, prefix extension utility (PEU) and reduced sequence utility (RSU), which can be easily obtained from a utility chain. The compact data structure expediates the calculation of not only upper bound values but also utility values. Till now, HUSPM algorithms have been considerably researched (gan2020fast; gan2020proum), and the mining process is becoming increasingly efficient owing to compact data structures and effective pruning strategies.

2.2. On-shelf Utility Mining

Obviously, the aforementioned utility mining approaches by default consider that all items are on the shelf all the time in the market basket analysis. Thus, there exists a bias that the patterns having more on-shelf time are more likely to be extracted in HUSPM. To eliminate the bias, the problem of OSUM (lan2011discovery; lan2014shelf; fournier2015foshu; radkar2015mining; dam2017efficient; lan2014discovery) was generalized in utility mining by considering the on-shelf time of items. Lan et al. (lan2011discovery) first defined the task of OSUM in transaction data as the process of discovering the complete set of osHUIs from a transaction database with respect to a user-specified threshold. They also proposed an efficient two-phase algorithm with a data structure based on a periodical total transaction utility table that increases the execution efficiency. In the first phase, TP-HOU generates promising on-shelf utility itemsets that have high on-shelf utility in the current time period as candidates. Then, it calculates the actual on-shelf utility in the entire database in the second phase. As it is known, similar to the two-phase utility mining methods, the algorithm also suffers from the two drawbacks that degrade the efficiency. Consequently, their team extended the problem where certain items were associated with negative utility values (lan2014shelf). To cope with the issue, they designed the two-phase three-scan algorithm for mining osHUIs with negative profit (TS-HOUN), which requires only three database scans by adopting a proper upper bound and an effective itemset generation method. For further efficiency improvement, the mining process was reduced to one phase by Fournier et al. (fournier2015foshu) in the faster osHUI mining (FOSHU) algorithm. FOSHU handles all time periods simultaneously and introduces novel strategies to handle negative values efficiently; consequently, it runs 1,000 times faster than the state-of-the-art TS-HOUN. Moreover, certain extension problem were generalized to extract some interesting patterns; for example, discovering top- osHUIs (dam2017efficient) and mining osHUIs from dynamic updated databases (radkar2015mining).

Considering the on-shelf time, utility values, and sequential order, Lan et al. (lan2014discovery) presented a new research issue, that is OSUM of sequence data, which is intrinsically more complex than the task in transaction data owing to a combinatorial explosion of the huge search space. They also developed a two-phase method TP-HOUS to efficiently mine osHUSPs in a temporal quantitative sequence database. Moreover, the sequence-utility upper bound (SUUB), as well as a corresponding pruning strategy were designed to speed up the mining process. As a two-phase algorithm, TP-HOUS suffers from the common limitations of the aforementioned two-phase methods. To the best of our knowledge, TP-HOUS is the only existing method used for OSUM of sequence data. There is a significant room for improvement in terms of execution time, memory consumption, and scalability.

3. Definitions and Problem Statement

In this subsection, we introduce the significant definitions, including concepts, notations, and principles, used in the domain of OSUM. It is to be noted that certain basic definitions are derived from prior works (lan2014discovery; gan2020proum). Moreover, we formulated a normative problem statement of OSUM of sequence data.

3.1. Preliminaries

Let the finite set = be a universal set, where the elements are the items that may appear in the problem. An itemset is a set of distinct items and satisfies . If the length of , that is, the number of items contained in the itemset, is equal to , the itemset is called an -itemset. A sequence = is a list of itemsets in chronological order, where for . The number of elements contained in is the size of . We define = as the length of , and is called an -sequence if = . For example, = {a b} is a two-itemset, and = {a f}, {b e f}} is a five-sequence as it consists of five items.

Definition 3.1 ().

A quantitative item is defined as an ordered tuple (:) where and is a positive integer representing the internal utility of item . A quantitative itemset, denoted as = {(:) (:) (:)}, is a set of -items. Similarly, we also quantify the sequence and define the quantitative sequence = , where is a quantitative itemset for . For brevity, we use the prefix symbol ”-” to denote the term quantitative. For example, a quantitative item/quantitative itemset/quantitative sequence can be denoted as -item/-itemset/-sequence, respectively.

Definition 3.2 ().

Let = denote a set of mutually disjoint time periods. A temporal -sequence database is a set of -sequences, i.e., = , where is -th -sequence in -th time period. In general, a temporal -sequence database can be represented as a set of triples, each of which is denoted as (TID, SID, ) where is a -sequence, is the time period of occurring, and SID is the unique identifier of in the time period . In addition, each item that appears in the temporal -sequence database is associated with a positive integer named external utility and its on-shelf time period information. To facilitate the statement, all -sequences within the time period in can be identified as .

For convenience, we assume all items/-items in an itemset/-itemset are arranged in alphabetical ascending order in the remainder of this paper. As a running example, a temporal -sequence database, including five -sequences and six types of items, is listed in Table 1, where {(b:1) (d:3)}, {(c:4) (e:1)} is the first -sequence within the time period . Each item listed in Table 1 is associated with an external utility listed in Table 2. In addition, the on-shelf time information of items is listed in Table 3, where 1 represents the item is on the shelf in the current time period. For example, c is on the shelf all the time, while f is only on the shelf in . In addition, and are included in .

TID SID -sequence
{(b:1) (d:3)}, {(c:4) (e:1)}
{(b:2) (e:3)}, {(c:4)}, {(b:1) (c:3)}
{(c:3) (d:4)}, {(a:3) (c:1)}, {(a:2) (c:3) (d:1)}
{(a:3) (d:2)}, {(a:1) (e:2)}, {(c:3)}, {(b:2) (c:4)}
{(a:4) (e:2) (f:2)}, {(c:1) (e:3)}
Table 1. Running example of a temporal -sequence database
Item a b c d e f
External utility $2 $3 $1 $1 $2 $4
Table 2. Utility table
ItemTime
a 0 1 1
b 1 0 0
c 1 1 1
d 1 1 0
e 1 1 1
f 0 0 1
Table 3. On-shelf time periods of items

Let us consider a -itemset = {(:) (:) (:)} with length of . We define the utility of the item within as = , where is the internal utility (usually representing quantity) of the item within , and (usually representing unit profit) is the external utility of for . For the -itemset , its utility can be defined as = . Then, the utility of a -sequence = , , , , denoted as , is defined as = . Moreover, the utility of a temporal -sequence database , denoted as , is the sum of the utility of each -sequence contained in .

Consider the example in Table 1. The utility of the item f within the first itemset in can be calculated as (b,{(a:4) (e:2) (f:2)}) = 2 $4 = $8. Then, we have = $20 + $7 = $27 and = $12 + $22 + $22 + $27 + $27 = $110.

Definition 3.3 ().

Let there be two itemsets and . Let us consider that is a subset of , denoted as , if satisfies . Moreover, given two sequences = and = , is a subsequence of , denoted as , if and only if there exists integers such that for .

Definition 3.4 ().

Given a -sequence = , , , and sequence = , , , , if = and the items in are the same as those in for , then we say r matches s, which is denoted as .

Definition 3.5 ().

Given a -sequence and a sequence , if , then we say is contained in . In the remainder of this article, for convenience, is used to indicate that is contained in ().

For instance, = {a} and = {a c} are both subsets of = {a b c}, and {b}, {c} is a subsequence of the sequence {b d}, c e}. Consider the running example where the sequence {b d}, {c e} matches the -sequence . Then, we can say that the sequence {b}, {c} is contained in . Note a special circumstance where a sequence matching the -sequence is also contained in it.

Definition 3.6 ().

Let us consider a sequence = , , , is contained in a -sequence = , , , ; according to Definition 3.3, we assume that the integer sequence is ; then, we say has an instance in at position : , , …, .

Note a sequence could have multiple instances in a -sequence at different positions, and different instances may have the same extension position. Consider the sequence = {a}, {c} in Table 1. It has four instances in at positions 1, 3, 2, 3, 1, 4, and 2, 4, where the extension position of the first two instances is three.

Definition 3.7 ().

Let there be a sequence = , , , and a -sequence = , , , . Let us suppose has an instance in at position : , , …, . Then, the utility of the instance is denoted as and defined as = , where is the corresponding position of item in .

For example, in Table 1, the first instance of = {a}, {c} in is at position = 1, 3; then, the utility of this instance can be calculated as = 3 $2 + 3 $1 = $9.

Definition 3.8 ().

Let there be a sequence and -sequence . Assume that the set of extension positions of in is = . The utility of in at extension position is defined as = {(t, , , , , s) and }, where is the set of positions with extension position . The utility of in is the max utility value of all utilities at all its extension positions, which is defined as = .

For example, in Table 1, the first two instances of the sequence = {a}, {c} in is at positions 1,3 and 2,3 with the same extension position 3. Then, we have = = $9. The utility of in is = = $10.

Definition 3.9 ().

The sequence utility of a -sequence , denoted as , is the sum of the utilities of all items contained in . The periodical total sequence utility of time period , denoted as , is defined as the sum of the sequence utilities of all -sequences within the time period (lan2014discovery).

Definition 3.10 ().

Given a sequence and time period , the periodical utility of within is the sum of the utilities in each -sequence with TID = , which is defined as = .

Let us consider the time period in Table 1. The sequence utility of is = $3 + $3 + $4 + $2 = $12; further, we have = $12 +$22 = $34. For example, in Table 1, the periodical utility of = {a}, {c} within time period 2 is = $9 + $10 = $19.

Definition 3.11 ().

The set of on-shelf time periods of a sequence , denoted as , is the union of on-shelf time periods of all items contained in . Given a temporal -sequence database, the on-shelf utility of a sequence , denoted as , is defined as = . Moreover, the on-shelf utility ratio of is defined as = .

For example, given the sequence = {a}, {c}, we have = . Let us consider the running example in Table 1. Suppose = {a}, {c}, we have = $19 + $9 = $28, as is only on the shelf within time periods and . Then, we can easily obtain = $28 / ($49 + $27) = 36.8%.

4. Proposed MDUS Algorithm

By referring to the only existing algorithm TP-HOUS (lan2014discovery) that splits the mining process into two phases, we designed an efficient two-phase method, namely OSUMS. Similar to TP-HOUS, OSUMS scans the original database once to discover all promising osHUSPs in the first phase. Then, in the second phase, OSUMS calculates the actual on-shelf utility values of the items and output the complete set of osHUSPs. For further efficiency improvement, we introduce several novel data structures, upper bounds, and strategies. Although OSUMS is more efficient than TP-HOUS, it also suffers from the intrinsic two limitations of the two-phase approaches. The first limitation is that retaining the promising osHUSPs generated in the first phase occupy large amounts of memory; the second limitation is that identifying whether each promising osHUSP has an on-shelf high utility (i.e., calculating their actual on-shelf utility) incurs significantly high computational costs. To overcome these issues, we improved OSUMS and proposed a one-phase method OSUMS. When compared to OSUMS, the OSUMS method utilizes similar storage data structures and upper bounds, but adopts two global strategies to check one promising osHUSP immediately as long as it is extracted.

Note that if a sequence is not a promising osHUSP within each time period, then it is absolutely impossible for it to be an osHUSP in the database. We present the details of the two approaches below.

4.1. OSUMS Approach

To discover osHUSPs efficiently, TP-HOUS adopts an upper bound SUUB as well as a pruning strategy to reduce the search space. However, the upper bound SUUB is so loose that TP-HOUS generates abundant candidates in the mining process. To significantly reduce the number of candidates, OSUMS adopts two upper bounds TPEU and TRSU based on PEU and RSU designed in (wang2016efficiently), respectively, as well as two local pruning strategies. Note that the processes of determining promising osHUSPs in OSUMS within each time period are independent; thus, we called the pruning strategies in OSUMS local ones. We also designed two storage data structures that are convenient for calculating the two upper-bound values and the actual on-shelf utility values. In addition, with the purpose of expediating the checking process that determines osHUSPs in the second mining process, we developed a novel structure named candidate tree (CTree) and an efficient strategy. In theory, OSUMS is able to outperform the relatively simple method TP-HOUS.

To facilitate the statement of the proposed methods, first, we present certain basic and essential definitions as shown below.

Definition 4.1 ().

Let there be a sequence = , , , and an item . We define the -Extension operation of as the process that appends to the last itemset . The operation results in a new sequence , which is denoted as and is called an -Extension sequence of . The other operation of -Extension of is defined as placing a new itemset that only contains behind , which also generates a new sequence . Generated by the -Extension operation, , denoted as , is called an -Extension sequence of .

Consider an example = {a}, = {a c} is an -Extension sequence of , while = {a}, {c} is an -Extension sequence of . Obviously, any nonempty sequence can be generated by a series of /-Extension operations from an empty sequence .

Definition 4.2 ().

Given a sequence and -sequence , let us assume that has instances in at extension position and the extension item is . The rest utility in at extension position , denoted as , is defined as = , where and represents the -items located behind in .

Let us consider the example listed in Table 1. The sequence = {a}, {a} has an instance in at the extension position 2; then, we obtain = $4 + $3 + $6 +$4 = $17.

Let us consider a sequence and -sequence and let us assume that has an instance in at extension position . According to (wang2016efficiently), the PEU upper bound of in at is defined as:

Suppose has several instances in at extension positions, = . The PEU value of in is the maximum value of PEU at all extension positions, which can be represented as: = , , , . Based on PEU, we introduce the TPEU upper bound, which can be defined as = .

For instance, in Table 1, let us consider a sequence = {a}, {c}. We obtain the PEU values = $9 + $1 = $10 and = $9 + $10 = $19. Thus, the TPEU of in time period can be calculated as = $10 + $19 = $29.

Assume the sequence is an extension sequence of the sequence , then the RSU upper bound of in a -sequence (wang2016efficiently) is defined as follows.

Based on RSU, we define a novel upper bound TRSU that can be adopted in the problem of OSUM in sequence data, that is, = .

Let there be two sequences = {a}, {c} and = {a}, {c}, {b}. It can be determined that = $19 as , , and = $19, which we have calculated.

To clearly explain the proposed algorithms, we design a structure called periodical lexicographic sequence forest (PLS-Forest) to represent the entire search space in the mining process of OSUMS based on the concept of lexicographic sequence tree (LS-Tree) (yin2013efficiently). LS-Tree is a tree structure where each node represents a sequence, with the root representing an empty sequence. For a node in the LS-Tree, the sequence represented by it is the extension sequence of that of its parent. Each LS-Tree in PLS-Forest represents the search space within one of the time periods. We show an example of PLS-Forest in Figure 1. The LS-Trees in a PLS-Forest may be different as items are on the shelf at different time periods. Without loss of generality, we present all children of a node in alphabetical ascending order. Note that PLS-Forest and LS-Tree are both abstractly conceptual structures, and the real search space may be different depending on different situations (gan2020proum).

Figure 1. Partial periodical lexicographic sequence forest for the running example

4.1.1. Storage Data Structure

Based on the data structure -matrix in USpan (yin2012uspan), we introduce a novel data structure, namely periodical -matrix, where each matrix can represent a -sequence in a temporal -sequence database. Besides including a matrix for storing utility and rest utility information, a periodical -matrix also consists of the time period and identifier of the -sequence. In practice, the periodical -matrices with the same time period are placed in one list, which can be indexed by the time period in memory. For better visualization, we present the periodical -matrices of the running example in Figure 2, where we only show details of the periodical -matrix of for brevity. Here, we briefly introduce the data structure of the -matrix, where elements can be indexed by on-shelf items within the time period and -itemset numbers. Each element has three values, where the first value shows the utility of the -item and second is the sum of utilities of -items behind it (also called rest utility). The items that do not appear in the -itemset are given a utility value of zero.

Let us observe the record for item a in the periodical -matrix of in Figure 2. The terms in the first entry are both $0, as a does not appear within the first -itemset in . According to the definition of utility, = 3 × $2 = $6 and ru = $1 + $4 + $3 + $1 = $9 can be calculated; then, we obtain the second entry ($6, $9). The remaining calculations can be performed in the same manner.

Figure 2. Periodical -matrices for the running example

A simple and intuitive method to calculate the utility of a candidate is by scanning the entire database. Obviously, the brute-force method incurs significantly high computational costs as the method has to check the -sequences that have no possibility to contain the candidate. It is easy to understand that a -sequence may contain a sequence only if it contains the prefix of the sequence. Thus, OSUMS recursively constructs projected databases represented by a compact and efficient data structure utility chain (wang2016efficiently) for reducing the scan scope. The projected databases of candidates have a relatively small scale and store the necessary information for calculating the values of utilities and upper bounds. It is noted that the candidate refers to those sequences that must be checked to determine if they have high on-shelf utility. The utility chain of a sequence consists of multiple utility lists and a head table. The head table includes a series of tuples (SID, PEU), each of which corresponds to a -sequence containing and indexes a utility list. The SID value is the identifier of the -sequence within the time period, and PEU is the upper bound value of in this -sequence. Suppose has extension positions : , , , in the -sequence , the corresponding utility list of has utility elements that consist of the following three fields: 1) field tid presents the -th extension position , 2) field acu is the utility of at the -th extension position (i.e., ), and 3)field ru shows the rest utility of at the -th extension position in (i.e., ).

Figure 3. Projected database of = {a}, {c} in OSUMS

Figure 3 illustrates the projected database, represented by a utility chain, of the sequence = {a}, {c} within the time period . Let us consider the second utility list, which corresponds to , as an example. has two extension positions in , that is, 2 and 3; therefore, the utility list consists of two elements. The utilities as well as the rest utilities at these two extension positions are = $9, = $10, = $10, = $0, respectively. Thus, the PEU value can be $19 according to the definition of PEU. As it can be observed, the utility chain consists of essential and compact information of ; consequently, scanning the projected database is sufficient to check whether a candidate has high on-shelf utility. Based on the projected database of an -sequence, the projected databases of its extension sequences (i.e., -sequence) can be constructed. More details about utility chain can be referred to (wang2016efficiently).

To avoid the redundant utility calculations in the second mining process, we propose a novel data structure called CTree based on a trie (i.e., prefix tree) for storing certain calculation results in the first mining process. Just as its name implies, the tree structure buffers the candidates whose periodical utility values have been calculated within at least one time period. Moreover, the structure also retains the related information by indicating whether it must be checked at certain time periods. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty sequence. There are certain virtual nodes, denoted as -1, which indicate the end of the itemset. Excluding these virtual nodes, any other node represents a sequence. To facilitate efficient utility calculation and support multiple database updates, each node in CTree contains the following auxiliary information:

  • Item: The item of .

  • Seq: The candidate represented by .

  • On-shelfTime: A bit set represents the on-shelf information of Seq. The size is equal to the number of time periods. The -th bit value is set to 1 when Seq is on the shelf in the -th time period. Otherwise, this value is set to 0.

  • CalculatedTime: A bit set represents the time periods when the utility of Seq has been calculated. The size is also equal to the number of time periods. The -th bit value is set to 1 when the utility of Seq has been calculated in the -th time period. Otherwise, this value is set to 0.

  • ChildrenList: A list containing the children of node .

  • isPromising: A Boolean value indicating whether Seq is a promising osHUSP within at least one time period.

  • Utility: A list that stores the periodical utility values of Seq within time periods. The size of the list is the number of time periods. If Seq is not on the shelf in -th time period, or it is on the shelf but its periodical utility in the -th time period has not been calculated, the bit value is set to 0. Otherwise, this value is set to 1.

  • CUtility: The sum of values in the list Utility.

Figure 4. Partial candidate tree for the running example

Figure 4 illustrates the CTree of the running example listed in Table 1 when = 0.3. For brevity, we only provide the values of Item and Seq for each node, and the full structure of the node that represents the sequence = {a}, {c}. Subsequently, we discuss the process of constructing the node and how to update its information when = 0.3. In the first phase, is generated as a candidate and its periodical utility is calculated as $19 within the time period . Then, OSUMS builds the node representing {a}, {c} as the child of the node representing {a} because the candidate first occurs in the mining process. After construction, OSUMS updates the information of and the children of . Finally, OSUMS sets the Boolean field isPromising to TRUE, as is a promising osHUSP within the time period . Next, when OSUMS generates again within the time period , updating the auxiliary information of and is sufficient without a construction operation.

In the second phase, OSUMS traverses the CTree with a depth-first strategy. If the sequence represented by the current node is not a promising osHUSP within any time period, then it is skipped. If not, the AND operation is used to obtain the time periods when the candidate is on the shelf but its periodical utility has not been calculated. Then, OSUMS checks whether the candidate is an osHUSP by a series of calculations. Clearly, OSUMS fully exploits the calculation results of the first mining process and avoids the redundant utility calculations. Furthermore, let us assume that the average length of candidates is and number of candidates is ; then, maintaining the CTree with time complexity CTree is a linear time operation; however, updating the information of promising osHUSPs is of polynomial time in TP-HOUS (lan2014discovery).

4.1.2. Strategies

As it can be observed in Figure 1, LS-Forest is an enormous structure, especially when there are a large number of items appearing in the database. To avoid critical combinatorial explosion of the search space, we propose two pruning strategies according to the download closure property of the two upper bounds TPEU and TRSU. The two upper bounds are tighter than the only existing one SUUB (lan2014discovery) in the problem of OSUM of sequence data, which implies that our method OSUMS is able to reduce more search space and generate less candidates. It is not difficult to understand that the CTree may be different under various settings. This is because certain branches in PLS-Forest could be pruned by the pruning strategies, and the representing candidates are not generated.

Theorem 4.3 ().

Given a temporal -sequence database and two sequences and , assuming that the node representing is a descendant of the node representing (i.e., is a prefix sequence of ), it can be obtained within a time period in that

Proof.

Let = , where is a common concatenation operation of sequences. is a nonempty sequence as is a prefix sequence of . Given a -sequence , if is contained in , then and . The utility of in is the sum of two parts, which can be denoted as = + . shows the utility of an instance of at extension position in , and is the utility of an instance of in where the first item of such instance is after . Obviously, . Then, the following can be obtained:

Since the -sequence containing must contain as ; we have the relationship = = within the time period of . Finally, = . ∎

By utilizing the download closure property of the TPEU upper bound in a time period, we designed a depth pruning strategy called local depth pruning (LDP). Given a sequence represented by a node in LS-Tree of time period , a minimum on-shelf utility threshold , and a temporal -sequence database , the descendants of can be safely pruned without affecting the mining results when .

Theorem 4.4 ().

Given a temporal -sequence database and two sequences and , suppose the node representing is the descendant of the node representing or = , we have

Proof.

Let us assume that is the extension sequence of the sequence ; then, is also a prefix of . Given a -sequence , we have according to Proof 4.1.2. Based on the definition of RSU, we have when contains both and . Then, . Conversely, if does not contain , then = = 0, as does not contain . A conclusion can be drawn that . The -sequence containing contains as ; thus, it can be obtained that = = within the time period of . Finally, = . ∎

As shown in Proof 4.1.2, the TRSU upper bound demonstrates anti-monotonicity within any time periods in a temporal -sequence database. Based on TRSU, we developed a width pruning strategy called local width pruning (LWP). Given a sequence represented by a node in LS-Tree of time period , minimum on-shelf utility threshold , and temporal -sequence database , OSUMS can prune the node itself and a descendant of when .

Varying from the upper bound TPEU, TRSU of a sequence is an overestimation over the utility values of itself and its descendants of . The reason why we name the two pruning strategies as local strategies is that the mining processes of OSUMS are independent among time periods; consequently, they only prune in one LS-Tree of the PLS-Forest at a time and do not affect the search space of other time periods.

Moreover, we also designed a novel strategy termed avoid redundancy calculations (ARC) to filter candidates when traversing the CTree to avoid calculating actual on-shelf utility. Clearly, a sequence in CTree may have its periodical utility values within certain time periods and have a value of zero within some others according to the field Utility. The reason why zero occurs within the time periods is that the sequence is not on the shelf or the sequence has no possibility to be a promising osHUSP, which is pruned by the aforementioned two local pruning strategies. We denote the set of the time periods belonging to the second circumstance as .

Theorem 4.5 ().

Given a temporal -sequence database , minimum on-shelf utility , and sequence , suppose the structure CTree has been constructed and the node representing exists in the CTree, has no possibility of being an osHUSP and can be skipped when .

Proof.

Clearly, after the first mining process, the on-shelf utility of can be divided into two parts. The first part is the sum over periodical utility values of within the time periods, denoted as , when the periodical utility values of have been calculated. The periodical utility values in this part are stored in the array Utility, and the integer CUtility is the summation over all values. The second part is the sum of the periodical utility values of within the time periods, denoted as , when has been pruned and its periodical utility value has not been calculated. The reason why has been pruned is that is not a promising osHUSP, that is, . Then, we obtain the following.

Then, the following equation can be obtained.

Given a sequence in the CTree, there is no requirement to calculate the actual on-shelf utility of , and the node can be skipped when , as must not be an osHUSP. Note that the set can be easily obtained by an AND operation over the two bit sets On-shelfTime and CalculatedTime. The ARC strategy can significantly reduce the high computational cost as it is time consuming to calculate the periodical utility of in a time period when starting from the beginning.

4.1.3. Overview of OSUMS

Based on the storage data structures and the three designed strategies, the proposed OSUMS algorithm can be stated as follows. To facilitate the statement, we present the main pseudocode of the OSUMS Algorithms 1 and 2.

Input: : a temporal -sequence database;
UT: a utility table with external utility values of items;
OT: a table containing on-shelf information of items;
: a minimum on-shelf utility threshold.
Output: osHUSP: the complete set of osHUSPs;
1 build a new CTree;
2 scan the original database to: (i) calculate the periodical total sequence utility (ptsu) value of each time period; (ii) build the periodical -matrix of each -sequence in ;
3 for each time period  do
4       scan the -sequences in to: (i) calculate the periodical utility value and TPEU value of each 1-squence; (ii) construct the projected database of each 1-sequence in ;
5       for  1-sequences  do
6             if  does not exist in  then
7                   construct the node representing ;
8                  
9             end if
10            update information of the node representing and its parent;
             // The LDP strategy
11             if  then
12                   call Local_PGrowth();
13                  
14             end if
15            
16       end for
17      
18 end for
19for each node representing in CTree do
20       calculate the periodical utilities of within the time periods when is on-shelf but has not been calculated;
       // The ARC strategy
21       if  then
22             calculate the on-shelf utility of the sequence and the on-shelf utility ratio ;
23             if  then
24                   ;
25                  
26             end if
27            
28       end if
29      
30 end for
return osHUSP
ALGORITHM 1 OSUMS Algorithm

The OSUMS method takes a temporal -sequence database , utility table UT, table with on-shelf information OT, and minimum on-shelf utility threshold as the inputs. The key idea of OSUMS is to enumerate the promising osHUSPs independently in each time period in the first mining process (Lines 3–14), and then identify whether they have high on-shelf utility in the second mining process (Lines 15–23). Initially, OSUMS constructs a new CTree for storing calculation results (Lines 1). Then, OSUMS scans the original database for obtaining ptsu values and periodical -matrices (Line 2). For each time period in (Lines 3–14), OSUMS scans the to calculate essential values and construct projected databases of all 1-sequences in (Line 4). Then, CTree will be updated according to each 1-sequence (Lines 6–9). It is to be noted that if the node representing the candidate does not appear in the tree, the node would be added as a child of the root at first (Lines 6–8). Subsequently, OSUMS adopts the LDP pruning strategy to judge whether descendants of the current 1-sequence have no possibility to be a promising osHUSP (Lines 10–12). If so, it backtracks to the root in the LS-Tree. Otherwise, it calls the Local_PGrowth procedure for mining longer promising osHUSPs by recursively enumerating candidate with the prefix of the current 1-sequence. In the second mining process, OSUMS traverses the CTree to identify osHUSPs. For each node in the tree structure, OSUMS uses the AND operation to obtain the time periods when the sequence is on the shelf, but its periodical utility has not been calculated (Line 16). Then, OSUMS adopts the ARC strategy to judge whether the actual on-shelf utility must be calculated (Line 18). If is not skipped, OSUMS calculates unknown periodical utility values in to verify whether the sequence has on-shelf high utility (Line 19). If so, it adds the sequence to the set osHUSP (Line 20). Finally, the final complete set of osHUSPs is discovered by the designed OSUMS algorithm (Line 24).

Input: : a sequence as prefix;
: the projected database of ;
: the current time period.
1 for each utility list ul in  do
2       obtain the periodical -matrix pm of ul;
3       scan pm to obtain the set of -Extension items ilist;
4       scan pm to obtain the set of -Extension items slist;
5      
6 end for
7for each item  do
8       ;
       // The LWP strategy
9       if  then
10             continue;
11            
12       end if
13      construct projected database of within the time period ;
14       put into seqlist;
15      
16 end for
17for each item  do
18       ;
       // The LWP strategy
19       if  then
20             continue;
21            
22       end if
23      construct projected database of within the time period (i.e. );
24       put into seqlist;
25      
26 end for
27for each sequence  do
28       if  does not exist in  then
29             construct the node representing ;
30            
31       end if
32      update information of the node representing and its parent;
       // The LDP strategy
33       if  then
34             call Local_PGrowth();
35            
36       end if
37      
38 end for
ALGORITHM 2 Local_PGrowth

The Local_PGrowth procedure presented in Algorithm 2 takes a sequence , the current time period , and its projected database in as the inputs. Sequences with the prefix of are enumerated by applying the -Extension and -Extension operations. The method first scans the periodical -matrix to obtain the items to be extended (Lines 1–5). For each item in ilist, is generated from by an -Extension operation with (Line 7). Note that the TRSU value of in is simultaneously obtained from the scan (Lines 3–4). According to the LWP pruning strategy, OSUMS then determines whether the node representing and its descendants should be pruned (Lines 8–10). If the node has not been pruned, is saved and its projected database in is constructed (Lines 11–12). Each item in slist can be processed in a similar manner (Lines 14–21). After extension operations, OSUMS updates the CTree according to the calculation results of the newly generated sequences (Lines 23–26). Finally, the Local_PGrowth procedure is recursively called for mining longer promising osHUSPs with the prefix in time period with the LDP strategy (Lines 27–29).

4.2. Osums Approach

The designed pruning strategies guarantee that only on-shelf low-utility sequences are pruned in PLS-Forest. Thus, the OSUMS method is able to discover the complete set of osHUSPs by reducing the search space to improve their performance. However, it must retain all candidates in memory in the form of the CTree before performing the second phase of operations. This huge tree structure wastes large amounts of memory, especially when too many candidates are generated in the first mining process. Consequently, we designed a one-phase algorithm named OSUMS where each candidate can be immediately identified by determining if it has on-shelf high utility once it is enumerated as a promising osHUSP. The search space of OSUMS can be represented by an LS-Tree, as the method handles all time periods simultaneously. This subsection presents the details about the OSUMS approach.

4.2.1. Storage Data Structure

Similar to OSUMS, the OSUMS method also uses the periodical -matrix to store each -sequence. It facilitated the process of identifying the items to be extended and the construction of projected databases of 1-sequences. Moreover, a novel data structure with periodical utility chain is designed based on the concept of the utility chain (wang2016efficiently) to represent the projected database. A periodical utility chain of a sequence is the union of utility chains in each of its on-shelf time periods. To distinguish between the utility chains generated from different time periods, we incorporated a new field Time, which is the identifier of time periods, into the head table. As an example, Figure 5 shows the projected database of the sequence = {a}, {c} in the OSUMS approach. We know is on the shelf within the time periods and , and the utility chain within the time period has been illustrated in Figure 3. Thus, Time in the head table is equal to in the first two utility lists. Following a recursive process, the projected databases of +1-sequences can be built according to those of -sequences.

Figure 5. Projected database of = {a}, {c} in OSUMS

4.2.2. Pruning Strategies

OSUMS adopts the PEU and RSU upper bounds, as well as two global pruning strategies. We defined the calculation methods of the TPEU and TRSU upper bounds in a time period in Section 4.1. Then, we set the upper bounds for the entire on-shelf time as presented below. Given a sequence and temporal -sequence database , we have = and = .

Theorem 4.6 ().

Given a temporal -sequence database and two sequences and , suppose the node representing is a descendant of the node representing (i.e., is a prefix sequence of ), within a time period in , we obtain

Proof.

Given a time period in , we have according to Theorem 4.3. Then, the following can be obtained

Subsequently, the following is obtained.

Clearly, the TPEU upper bound holds the download closure property of the entire time period. Then, we developed a depth pruning strategy termed global depth pruning (GDP), which is as described below. Given a sequence represented by a node in the LS-Tree, minimum on-shelf utility threshold , and temporal -sequence database , the descendants of can be safely pruned without affecting the mining results when .

Theorem 4.7 ().

Given a temporal -sequence database and two sequences and , suppose the node representing is the descendant of the node representing or = , we obtain

Proof.

Given a time period in , we have according to Theorem 4.4. Then, the following is obtained.

Then,

As proven above, the TRSU upper bound demonstrates anti-monotonicity in the entire time periods in a temporal -sequence database. Then, we designed a width pruning strategy termed global width pruning (GWP), which is described below. Given a sequence that is represented by a node in LS-Tree, minimum on-shelf utility threshold , and temporal -sequence database , and its descendants can be safely pruned when .

When compared to the two local pruning strategies in OSUMS, the two global strategies are able to affect all time periods. Several branches in the tree structure LS-Tree can be efficiently and safely pruned.

4.2.3. Overview of OSUMS

To facilitate the statement of the proposed OSUMS method, we present two pieces of the pseudocode in Algorithms 3 and 4.

Input: : a temporal -sequence database;
UT: a utility table with external utility values of items;
OT: a table containing on-shelf information of items;
: a minimum on-shelf utility threshold.
Output: osHUSP: the complete set of osHUSPs;
1 first scan the original database to: (i) calculate the periodical total sequence utility (ptsu) value of each time period; (ii) build the periodical -matrix of each -sequence in ;
2 second scan the original database to: (i) calculate the on-shelf utility and PEU value of each 1-sequence; (ii) construct the projected database of each 1-sequence;
3 for  1-sequences do
4       calculate the on-shelf utility ratio of ;
5       if  then
6             ;
7            
8       end if
      // The GDP strategy
9       if  then
10             call Global_PGrowth();
11            
12       end if
13      
14 end for
return osHUSP
ALGORITHM 3 OSUMS Algorithm

As it can be observed in Algorithm 3, the input and output of OSUMS are identical to those of OSUMS. In addition, both methods scan the original database at first for calculating ptsu values and constructing the periodical -matrices (Line 1). Then, OSUMS performs a database scan again to obtain related information of all 1-sequences (Line 2). For each 1-sequence , OSUMS first calculates its on-shelf utility ratio to check whether is an osHUSP (Lines 4–7). Adopting the GDP strategy, the method can prune the descendants of ’s if the TPEU of does not satisfy the given condition (Line 8). Otherwise, the function Global_PGrowth is called to check longer sequences with the prefix of (Line 9). After the recursive process, OSUMS finally returns the complete set of osHUSPs (Line 12).

Input: : a sequence as prefix;
: the projected database of ;
1 for each utility list ul in  do
2       obtain the periodical -matrix pm of ul;
3       scan pm to obtain the set of -Extension items ilist;
4       scan pm to obtain the set of -Extension items slist;
5      
6 end for
7for each item  do
8       ;
       // The GWP strategy
9       if  then
10             continue;
11            
12       end if
13      construct projected database of within the time period ;
14       put into seqlist;
15      
16 end for
17for each item  do
18       ;
       // The GWP strategy
19       if  then
20             continue;
21            
22       end if
23      construct projected database of within the time period ;
24       put into seqlist;
25      
26 end for
27for each sequence  do
28       if  then
29