DeepAI
Log In Sign Up

Smart System: Joint Utility and Frequency for Pattern Classification

06/09/2022
by   Qi Lin, et al.
IEEE
Tencent QQ
0

Nowadays, the environments of smart systems for Industry 4.0 and Internet of Things (IoT) are experiencing fast industrial upgrading. Big data technologies such as design making, event detection, and classification are developed to help manufacturing organizations to achieve smart systems. By applying data analysis, the potential values of rich data can be maximized and thus help manufacturing organizations to finish another round of upgrading. In this paper, we propose two new algorithms with respect to big data analysis, namely UFC_gen and UFC_fast. Both algorithms are designed to collect three types of patterns to help people determine the market positions for different product combinations. We compare these algorithms on various types of datasets, both real and synthetic. The experimental results show that both algorithms can successfully achieve pattern classification by utilizing three different types of interesting patterns from all candidate patterns based on user-specified thresholds of utility and frequency. Furthermore, the list-based UFC_fast algorithm outperforms the level-wise-based UFC_gen algorithm in terms of both execution time and memory consumption.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/17/2018

Machine learning for Internet of Things data analysis: A survey

Rapid developments in hardware, software, and communication technologies...
12/06/2020

A Data-driven Human Responsibility Management System

An ideal safe workplace is described as a place where staffs fulfill res...
11/09/2018

VDAS: Verifiable Data Aggregation Scheme for Internet of Things

Along with the miniaturization of various types of sensors, a mass of in...
04/19/2020

Dataset for anomalies detection in 3D printing

Nowadays, Internet of Things plays a significant role in many domains. E...
02/25/2019

Utility-driven Data Analytics on Uncertain Data

Modern Internet of Things (IoT) applications generate massive amounts of...
03/30/2021

TUSQ: Targeted High-Utility Sequence Querying

Significant efforts have been expended in the research and development o...

1. Introduction

In today’s manufacturing environment, massive raw data are collected from shops, online e-commerce platforms or through sensors and electronic equipment. These data can be related to user behaviors, transaction records, the characteristics of products, production lines and so on. The size of data is increasingly grown up in the manufacturing industries (choudhary2009data; dogan2020machine) and this has led to a very important issue, that is how to discover the potential knowledge from the high-volume databases. During the past few years, big data analysis has been attached great importance to different fields, especially for smart manufacturing. In big data analysis, knowledge discovery in databases (KDD) (Chen1996Data) aims to extract useful patterns from data, and data mining (DM) (Chen1996Data; gan2017data) is the key step in KDD since it provides the algorithm or model to efficiently discover the useful patterns and knowledge. Association rule mining (ARM) (agrawal1994fast; hong1999mining) and pattern classification are two of the most attractive and studied fields in data mining. ARM has been very popular in the past two decades. Scientists have intended to determine the correlation and association for various items, thereby predicting future trends or developing feasible strategies. To describe the frequency of a pattern and the reliability of a rule, the concepts of support and confidence are adopted to record such information in a database. For ARM, the first step is to filter data, a common and traditional method is to formulate rules by utilizing algorithms in high-frequency itemset mining (HFIM) (luna2019frequent; hong1999mining; han2004mining). During this step, high-frequency itemsets (HFIs) are discovered, and the other itemsets that do not meet the minimum frequency threshold are eliminated. The second step is to determine the hidden rules or patterns in HFIM and choose the rules with higher confidence. Finally, these interesting rules are applied in the real world to help design appropriate strategies for different tasks. In the past, a number of ARM approaches have been proposed, such as Apriori (han2011data), FP-growth (han2004mining), H-Mine (pei2001h), and Eclat (zaki2000scalable).

Based on ARM, pattern classification (PC) and classification based on associations (CBA) (thabtah2007review; abdelhamid2014associative; nguyen2012classification; thabtah2004mmac)

have been studied to label items or build classifiers to predict future trends. It makes use of the rule discovery process in ARM by extracting efficient rules which can precisely generalize the training databases. Nowadays, pattern classification has been applied in various world application such as commercial prediction, finance analysis, phishing website detection

(abdelhamid2014phishing). For example, algorithms of pattern classification extract classifiers containing association rules of high confidence to put different labels on websites, thereby identifying which website can be viewed as a phishing website. As ARM normally adopts frequency as the sole standard to extract valuable knowledge, every item is assigned to the same utility (value), and therefore frequency becomes the only precondition to obtain the association rules. It is noticeable that such a strategy ignores the variety of importance in items, and strategies developed to address this may not improve the sales of some items. To overcome this drawback, utility-driven pattern mining (gan2021survey) has been proposed. In smart manufacturing, the high-utility patterns are always welcomed by merchants, because they simply create higher profits. High-utility pattern mining (HUPM) (chan2003mining; liu2005two; lin2017fdhup) assigns different utilities to items and considers both occurred quantity (internal utility) and utility (external utility). In recent years, a series of alternative approaches have been proposed, including IHUP (ahmed2009efficient), HUI-Miner (liu2012mining), UMEpi (gan2019utility), UP-growth (tseng2010up), TKU (tseng2015efficient), and so on.

In the actual process of manufacturing, both frequency and utility are the essential indicators for determining the position of a product combination. On the one hand, in HFIM, high frequency itemsets are discovered, and they are often related to the ARM. A number of association rules are conducted from these frequent itemsets. However, if a majority of these itemsets have low utility, then it might be a waste of resource unless these itemsets have a high frequency so that new strategies can be applied. On the other hand, we do not expect similar situation in HUPM. In manufacturing, if many itemsets with high utility being discovered are quite infrequent, then the discovered rules may be less valuable. It is clear that one of the major challenges in pattern mining for manufacturing is how to better locate each pattern’s position and design more specific and valuable strategies for bundling sale. Therefore, if frequency and utility can be taken into account in big data analysis, then we can better evaluate each combination of different products, and then obtain a more accurate position of each combination, thus develop more effective and specific bundling sale strategies. In this study, we introduce a new model for pattern classification. We formulate this model as joint utility and frequency for pattern classification, and also develop two effective algorithms.

Based on the properties of frequency and utility, there are four types of patterns in pattern classification, including (i) High Frequency and High Utility Itemset (HFHUI); (ii) High Frequency and Low Utility Itemset (HFLUI); (iii) Low Frequency and High Utility Itemset (LFHUI); and (iv) Low Frequency and Low Utility Itemset (LFLUI). Since LFLUI is meaningless and numerous, our proposed model will focus on the interesting three types, i.e., HFHUI, HFLUI, and LFHUI. The key contributions of this paper are summarized as follows:

  • In data analysis, only a few studies consider both frequency and utility as two common standards to discover interesting patterns. In this study, two algorithms namely UFC and UFC are developed to help better locate the market locations for different product combinations in real-world applications, such as smart manufacturing. Both algorithms can efficiently collect three different types of patterns.

  • Based on utility and frequency, several powerful strategies are developed to prune the search space. Different strategies are applied in UFC and UFC to ensure the reliability of pattern classification.

  • A special novel structure named frequency-utility-list (FU-list for short) is developed in the efficient UFC algorithm. Such a structure stores information about transaction identifier, frequency and remaining utility of patterns.

  • Extensive experiments on real and synthetic datasets show that the UFC and UFC algorithms can be applied to various sizes of datasets, and UFC has a remarkable performance than UFC.

Note that some key concepts and an initial algorithm were presented in a preliminary version (lin2021joint) of this article. The remainder of this extended version is organized as follows. The relevant literature and necessary background of association rule mining, associative classification, and utility-driven pattern mining are respectively reviewed in Section 2. Some descriptions of associative classification and the statement of the studied problem are provided and discussed in Section 3. The detailed process of level-based algorithm are presented in Section 4. Then, in Section 5 we provide a supplementary algorithm, which has a better performance on solving the pattern classification problem. Extensive experimental results are presented in Section 6. Finally, Section 7 concludes the paper and highlights several future studies.

2. Related Work

In this section, we review the related works which have been studied in big data analysis, including data mining in manufacturing, utility-driven pattern mining, and pattern classification.

2.1. Data mining in manufacturing

The general goal of data mining is to find the potential relationship between items, hidden pattern and predict the future trend. In smart manufacturing, the algorithm for data mining can help diagnose fault information, discover uncovering relationship among products, and design complex strategies. With the extensive use of machines and a large production of items, data mining in manufacturing usually plays an increasingly important role. In the past two decades, a number of approaches have been developed and applied in the real process of manufacturing (choudhary2009data; dogan2020machine). For instance, Fan et al. (fan2001data) proposed an algorithm that can automatically discover knowledge from production databases for fault diagnosis, and finally it can optimize device performance with a given target. In 2017, Nakata et al. (nakata2017comprehensive) proposed a system that can identity the cause of failure from wafer failure map patterns and manufacturing histories. Then, by applying big data analysis into different stages, the system can help engineers to support their work.

ARM (agrawal1993mining; brin1997beyond) was first proposed to analyze basket data and is nowadays widely applied in different aspects. In smart manufacturing industries, ARM can be used to find the association from products and induce a number of rules that can help predict user behavior. In general, there are two main steps in association rule mining. The first step is to discover frequent itemsets by calculating the occurrence of each possible combination in the datasets. And in the next step, it simply generates and selects rules from the itemset by acknowledging the confidence of each rule. In the past, approaches such as Apriori (agrawal1994fast) and FP-growth (han2004mining) were applied to obtain the HFIs. In these approaches, Apriori proposed by Agrawal and Srikant (agrawal1994fast) in 1994 is the most classic algorithm. According to the frequency downward closure property, if a 1-itemset’s frequency cannot meet the minimum frequency (min_fre) threshold, then all of its supersets cannot be a high-frequency itemset. Apriori adopts this property to discover the HFIs using a generate-and-test search. Although it can avoid a brute-force manner, it usually needs to scan the database multiple times. Therefore, it usually requires high consumption of execution time and a large search space. To overcome these shortages, the FP-growth algorithm (han2004mining) only requires twice scans of the database by utilizing a tree structure, called FP-tree, that extends its branches by using the prefix. Compared with Apriori, the FP-tree algorithm has less time and space complexity. To date, a large number of HFIM and ARM algorithms have been proposed in different domains (xie2010max; gan2017data; zaki2000scalable; luna2019frequent).

2.2. Utility-driven pattern mining

In the field of ARM, a number of rules with high support and confidence can be discovered by applying different existing algorithms, but the items included in the rules may not be the best target for the company. Generally, the profit of products is emphasized in smart manufacturing, people are eager to realize which product combination can bring more profits. High-utility pattern mining (HUPM) (gan2021survey; 2gan2018survey; gan2018privacy) can well solve the problem, since for HUPM in manufacturing organizations, utility can often be replaced by profit. As the frequency downward closure property was first developed in HFIM, to maintain the similar property in HUPM, the transaction-weighted utility downward closure property was first applied in a two-phase model that proposed by Liu et al. (liu2005two). In the first phase, it scans the database level-by-level to obtain a set of candidate patterns whose TWU satisfies the minimum utility threshold. In the second phase, the database is scanned again to compute the specific utility for each candidate. Therefore, all the actual HUIs are found in the second phase. To date, a series of utility-driven mining approaches have been proposed, as reviewed in (gan2021survey), including IHUP (ahmed2009efficient) for processing dynamic data, UP-growth (tseng2010up) for processing static data, utility mining in uncertain data (lin2016efficient; gan2020utility), FHN (lin2016fhn) for processing transaction data containing both positive and negative utility values, UMEpi (gan2019utility) for discovering high-utility episodes, up-to-date high-utility pattern mining (lin2015efficient; gan2020utility2), TKU (tseng2015efficient) and TKUS (zhang2021tkus) for top- mining, HUPM in dynamic profit databases (nguyen2019mining) or noisy databases (baek2021approximate), and utility mining on sequence data (gan2020proum; gan2021fast; zhang2021shelf). Besides, based on the new utility-occupancy concept, the HUOPM (gan2020huopm) and UHUOPM (chen2021discovering) algorithms are proposed. Some researches tried to conduct a linear combination of frequency and utility as a weighted model, and thus more information about the patterns can be included (wang2007pushing; gan2018extracting). Similarly, Shao et al. (shao2015mining) proposed an algorithm to mine a combined pattern with high utility and frequency. Such a pattern considers the utility in generated associative rules, which aims to discover rules containing HUIs. In the past, a fast utility frequent mining algorithm (FUFM) (shankar2009fast) was introduced.

2.3. Pattern classification

In smart manufacturing, pattern classification aims to automatically classify each pattern (a single product or a product combination) as to help decision makers to locate market position of each pattern and design more suitable strategies. To be different from general pattern mining applying single goal, pattern classification often consider multiple goals. For example, some researches tried to classify patterns according to frequency and utility together. In the past, some researchers tried conducting a linear combination of frequency and utility as a weighted model, and thus more information about the patterns can be included. Lin et al. (lin2017fdhup) introduced the concept of discriminative high-utility pattern with strong frequency affinity. Similarly, Shao et al. (shao2015mining) proposed an algorithm to mine a combined pattern with high utility and frequency. Such a pattern considers the utility in generated associative rules, aiming to discover rules containing high-utility patterns. Meanwhile, in the past, a fast utility frequent mining algorithm (FUFM) (shankar2009fast) was also introduced, which conducted two approaches about FRM and HUPM in two different phases in order to categorize the pattern. The jointly utility-frequency approach discussed in the following sections utilizes both pattern classification and utility-driven pattern mining. Besides, the two designed algorithms are quite different. The first one adopts a two-phase manner, and the second one utilizes a vertical data structure containing transaction number, frequency, and remaining utility. Both algorithms aim to collect three different kinds of patterns based on frequency and utility.

3. Preliminaries

3.1. Definitions

In this subsection, the preliminaries and definitions of key terms related to pattern classification and utility-driven mining are presented. A transaction database with the utility-table as a running example is given in Tables 1 and 2, respectively.

tid Transaction Utility
(, 1), (, 2), (, 1) $13
(, 2), (, 3), (, 2) $23
(, 2), (, 2), (, 2) $16
(, 2), (, 1), (, 1), (, 3) $10
(, 1), (, 2), (, 2), (, 1) $12
Table 1. Example database
Item A B C D E F G
Utility ($) 5 3 2 1 4 2 1
Table 2. Utility table

Suppose there is a finite set of distinct items = {, , , , which is stored in a transaction database = {, , , }. Each transaction contains a subset of and a unique identified number, which can be abbreviated as . Each item in a transaction has a positive value , called its internal utility, which can also be interpreted as the occurrence of an item in the transaction. A set of definitions and properties is given as the follows:

Definition 3.1 ().

(, utility of an itemset) The external utility of an item is associated with the utility table, and can be considered as . And the external utility of an itemset is simply computed by . measures the total utility of an item in a transaction, which is calculated by . The utility of an itemset in a transaction is calculated as ( where , , , , . Thus, the utility of an itemset is the sum of utility of this itemset in all transactions, i.e. ( )).

For instance, in Table 1, the external utility of is $5, and the utility of in transaction is denoted by = $5 2 = $10. Therefore, the utility of itemset in is ($5 + $6) 1 = $11, and the utility of itemset is computed as = + = ($5 + $3) 1 + ($5 + $3) 2 = $24. The transaction-weighted utility of itemset is denoted as (tseng2015efficient). It is the sum of transaction utility of all the transactions containing , which is defined as = , in which = .

Definition 3.2 ().

(, support of an itemset) The support (aka frequency) of an itemset is associated with the internal utility of an item, which is defined as = = where , , , , .

For instance, in Table 1, the TWU of item is computed by = + = $13 + $23 = $36. The frequency of itemset is calculated by = + + = 1 + 2 + 2 = 5.

Definition 3.3 ().

(High-utility itemset and high-frequency itemset) An itemset is called a high-utility itemset (HUI) if its utility in a database is higher than or equal to a specified utility threshold, denoted as min_util. Similarly, an itemset is called a high-frequency itemset (HFI) if its support is higher than or equal to a specified frequency threshold, denoted as min_fre.

Table 1 shows a transaction database containing 5 transactions, where each letter represents a specific item, and the corresponding number represents the quantity in the corresponding transaction. The utility table is shown in Table 2. If min_util is $15, and min_fre is 3. Then a complete set of HUIs should be: {}: $15, {}: $24, {}: $16, {}: $24, {}: $15, {}: $16. And a complete set of HFIs is {}: 3, {}: 8, {}: 5, {}: 3, {}: 4, {}: 3, {}: 4, {}: 3, {} :4, {}: 3.

3.2. Problem statement

With the significant progress in data mining and knowledge discovery, tremendous data mining algorithms have been applied in many fields such as business. With appropriate applications in marketing, salesmen can better develop their own strategies to obtain more profit.

Our study mainly focuses on pattern classification by applying the indicators of frequency and utility. Both of them are very essential in developing strategies from a transaction database because they are the fundamental information of an item or itemset. However, it is widely seen that utility or frequency is used as the sole measure in many typical applications, such as shopping basket analysis. Therefore, the following situations can be seen: What if an item (or itemset) of high frequency has a quite low utility (low profit in business) in a transaction database? What if an item (or itemset) of high utility only occurs for a few times in a transaction database?

Both situations mentioned above indicates two types of itemsets, high-frequency itemsets utility and low-frequency itemsets with high utility. At the same time, there are also itemsets which can bring high profit and have high occurrences. Naturally, in most cases people are preferable to high-frequency itemsets with high utility. However, the first two types of itemsets can also have positive impact, with less importance than high-frequency itemsets with high utility. Therefore, these three types of itemsets actually play different important roles in strategies developing and rule designing.

To better discover the potential of all patterns, an acceptable solution is to classify patterns according to these two indicators simultaneously. Given the threshold of frequency and utility, there are four types of patterns (wang2007pushing).

Definition 3.4 ().

By taking frequency and utility together into account, there are four types of patterns: High Frequency and High Utility itemset (HFHUI); High Frequency and Low Utility itemset (HFLUI); Low Frequency and High Utility itemset (LFHUI); and Low Frequency and Low Utility itemset (LFLUI). These four patterns can be illustrated in Fig. 1.

Figure 1. Four types of patterns (wang2007pushing).

However, among these four categories, LFLUI actually have little contribution to rule discovery and strategies designing. Therefore, the goal of our proposed algorithms is to collect the other three types of itemsets from the given database. Naturally, if the utility and frequency of an itemset can satisfy both thresholds then it belongs to HFHUI. If the itemset utility satisfies the threshold but the frequency does not, then it is HFLUI. Similarly, LFHUI and LFLUI are sorted in the same way. Recalled that in Example I, we denote the minimum frequency threshold is 3 and the minimum utility threshold is $15. In such case, we can obtain: HFHUI: {}: $15, 3 (the first one represents utility, the second one represents frequency), {}: $24, 4, {}: $16, 4, {}: $24, 3, {}: $15, 4; HFLUI: {}: $6, 3, {}: $3, 3, {}: $6, 3, {}: $9, 3; LFHUI: {}: $16, 2.

Generally, the application of our proposed algorithm in smart manufacturing can be split into the following steps: (1) Read the transaction records from the database; (2) Extract the necessary information from the records, i.e. the id of each good and its frequency and utility in each transaction; (3) Apply UFC or UFC on these records; (4) Obtain the classification results of three types of patterns; (5) Finally, proceed advanced analysis and design new strategies according to the classification results. These processes are shown in Fig. 2.

Figure 2. Classification processes in smart manufacturing.

4. Ufc Algorithm

In this section, we introduce an algorithm called UFC. The key idea is that we collect three types of itemsets based on the their utility and frequency in two phases. Details of this are presented below.

In the first phase, by utilizing the transaction-weighted utility downward closure property (liu2005two) and frequency downward closure property, the UFC algorithm can efficiently generate new candidates and put them into the set of final candidates. In the second phase, an extra scan of the whole transaction database is needed to calculate the real utility of each itemset in the final candidates set.

4.1. Phase I

In the first scan of the database, we record the TWU and frequency of each 1-itemset, and then generate new candidates by applying an operation called connection operation level by level until there are no new candidates generated. Now we introduce some significant properties and definitions which are applied in Phase I.

Property 1 ().

(TWU downward closure property (liu2005two)) In a given database , for any itemset , if is less than a user-specified min_util, then all supersets of cannot be an HUI. If , then min_util ¿ .

For example, assume that min_util is $20, then in Table 1 we can obtain TWU of the itemset , which is exactly $16. Because $16 $20, by Property 1, we can conclude that cannot be an HUI.

Property 2 ().

(Frequency downward closure property (agrawal1994fast)) In a given database , for any itemset , if is a low-frequency itemset, then for any superset of , can not be a high-frequency itemset.

For example, suppose that min_fre is 4, then by Table 1 and Table 2 we obtain the support of the itemset , which is computed by + = 1 + 2 = 3. Since 3 4, then by Property 2, we can conclude that any superset of such as cannot be a HFI.

Property 3 ().

In a given database, We denote any high transaction-weighted utility itemset in as HTWUI. Then with the same user-specified min_util, if an itemset is an HUI, then it must be a HTWUI.

Proof.

Let be the database and be the collection of all itemsets in transaction , for a given itemset , we define = , then for any itemset HUIs, we have = = . Therefore, is also a HTWUI. ∎

For example, we still assume that min_util is $20, from Table 1 and Table 2, we obtain the utility of itemset , which is computed by + + + = $3 (2 + 3 + 2 + 1) = $24. Because $24 $20, then by Property 3, we can conclude that must be a HTWUI.

Definition 4.1 ().

(Operation of connection) we define a set as if for every itemset , is an itemset containing items. We denote an operation as . For each , , if the condition satisfied that for 1, 2, , , = , , then = , where = , , , , .

Itemset TWU Frequency
$36 3
$52 8
$35 5
$26 3
$16 2
$22 3
$16 4
Table 3. Characteristics of candidate sets

Through the operation of connection, we can generate new candidates level-by-level, and any itemset in level would be a -itemset. From Table 1, we can obtain each item’s TWU and frequency in Table 3. Now suppose min_util and min_fre is $30 and 4, respectively, then by scanning Table 3, we have to omit item , item and item since the either of their frequency or utility satisfy the thresholds. Hence they cannot engage in connection operation. Hence, in the next level, we will only obtain itemset like , , and so on connected by , , and . In phase I, we continue this operation until no candidate sets can take part in the connection operation to enter the next level.

Using the properties and definitions above, we develop our pruning strategies to efficiently reduce search spaces. The strategies are presented the followings:

  • Strategy 1. If the TWU of current itemset is less than the utility threshold and the frequency of it is less than the frequency threshold, then current itemset cannot appear in phase II, which indicates the current itemset is automatically a LFLUI.

  • Strategy 2. If the transaction-weighted utility of current itemset is higher than or equal to the utility threshold or the frequency of it is higher than or equal to the frequency threshold, we add the itemset to phase II. Then we apply the connection operation to all such itemsets in order to generate new candidates.

In Phase I, the initial inputs are: , min_util, and min_fre. We use to denote the sets of current itemsets with the proposition of TWU and frequency, denoted as and , respectively. When the length of is zero, the algorithm automatically terminates. And the set is used to represent the final candidate itemsets to be classified into three different types of itemsets. To be noticed, the null of a set means their is no itemset in the set. In lines 3-8, we apply our strategies in this algorithm. If the proposition of current itemset in does not satisfy either thresholds, then we remove it from . In lines 10-19, we perform the connection operation on all itemsets in . Based on these strategies, all the itemsets that satisfy the condition would be added to the final candidates , which will be classified in phase II. Details of the procedure in phase I are represented in Algorithm 1.

Input: : the set of all 1-itemsets; min_util, the threshold of utility; min_fre, the threshold of frequency.
Output: cand: the set of final candidates.
1 initialize , cand null;
2 while length(cur) 0 do
3       for each itemset in  do
4             if e.twu min_util or e.fre min_fre then
5                   add to cand;
6                   else
7                         remove from ;
8                        
9                   end if
10                  
11             end if
12            
13       end for
14       null;
15       for each itemset in cur do
16             length();
17             for each itemset after in  do
18                   if the first items in and are equal then
19                         , , …, , ;
20                         add to cand;
21                         add to tmp;
22                        
23                   end if
24                  
25             end for
26            
27       end for
28      clear cur;
29       ;
30      
31 end while
return
ALGORITHM 1 Phase I of UFC

4.2. Phase II

An additional scan of the transaction database is required. The extra scan is designed to calculate the real utility of each candidate set. There are no pruning strategies in phase II. Its purpose is to collect three types of itemsets according the user-specified thresholds. Eventually, HFHUI, HFLUI, and LFHUI are presented as the output. In Phase II, util and fre are denoted as the real utility and frequency of each itemset respectively in the set cand. Details of phase II are given in Algorithm 2.

Input: cand: the final candidates after the Phase I; min_util: the threshold of utility; min_fre: the threshold of utility.
Output: HFHUI: high frequency and high utility itemsets; HFLUI: high frequency and low utility itemsets; LFHUI: low frequency and high utility itemsets.
1 initialize HFHUI null, HFLUI null, LFHUI null;
2 scan the database again to obtain the real utility of each itemset;
3 for each pattern elem in cand do
4       if elem.util min_util and elem.fre min_fre then
5             add elem to HFHUI;
6            
7       end if
8      else if elem.util min_util and elem.fre min_fre then
9             add elem to HFLUI;
10            
11       end if
12      else
13             add elem to LFHUI;
14            
15       end if
16      
17 end for
return HFHUI, HFLUI, LFHUI
ALGORITHM 2 Phase II of UFC

5. Ufc Algorithm

In this section, a more efficient algorithm called UFC is supplemented to solve the pattern classification problem. First we briefly introduce a special data structure called utility-list (liu2012mining). The basic definitions are presented in the follows.

5.1. Frequency-utility-list

Definition 5.1 ().

(Revised transaction (liu2012mining)) (1) A transaction is considered revised if the transaction is sorted in ascending order according to the TWU; (2) all the itemsets whose TWU is less than the specified utility threshold is deleted from the transaction.

Definition 5.2 ().

(Remaining utility (liu2012mining)) In a database , given a revised transaction and an item, the remaining utility of an itemset represents the sum of utility for items which are sorted after the current itemset. Let rutil() (liu2012mining; lin2016fhn) denote the remaining utility of an itemset in a given database , which is computed by . Here means the position of is after in each revised transaction.

For instance, by assuming the threshold of utility is $30, we can rearrange Table 1 according to the definition of revised transaction. The new table can be seen in Table 4. Considering the itemset in Table 4, the remaining utility of in can be computed as rutil() = + = $5 1 + $3 2 = $11. For a list which stores i) the transaction number, ii) the utility of an itemset in a transaction, and iii) its remaining utility, we call such list as utility-list (liu2012mining). This structure is very useful because it contains significant information, which helps avoid generating unpromising candidates. The utility-list (liu2012mining) stores the transaction number, utility in a transaction and the remaining utility. Such structure is very useful because it contains very significant information and can be easily extended to a new itemset, and thus avoid generating new candidates.

tid Transaction Utility
(, 1), (, 1), (, 2) $13
(, 2), (, 3) $19
(, 2) $6
(, 3), (, 2) $7
(, 1), (, 2), (, 1) $8
Table 4. Example database (revised transaction)

To solve the pattern classification problem, we slightly change the utility-list, by substituting the second column into internal frequency of the item in each transaction, that is, the frequency of the item is recorded at each tid. For the above example, suppose the external utility of item is $2, then the list of a should be transformed into the list illustrated in Fig. 3. We present it as a new data structure, called frequency-utility-list, or simply denotes it as FU-list.

Definition 5.3 ().

(Frequency-utility list (FU-list)) For a given database, the FU-list contains a set of tuples, and each tuple has three fields: tid, fre, rutil. The fre is the occurrence of an itemset in the current transaction.

To obtain the set of FU-lists, we have to change the definition of a revised transaction, now instead of deleting those itemsets with low TWU, we remove itemsets with low TWU and low frequency. Constructing the FU-lists for one-itemsets requires twice scan of the database. In the first scan, we calculate each one-itemset’s TWU and frequency in order to transform each transaction into a revised transaction. Then in the second scan, we construct FU-lists from these new transactions.

Figure 3. FU-list of itemset

5.2. Frequency-utility-list of multiple itemsets

Building the FU-list of an -itemset is similar to the operation of connection mentioned in the UFC algorithm, where the intersection of transactions is not null and the first -1 terms in two different -itemsets are equivalent. We now introduce a new operation called 1-extension, which is used to construct a new FU-list of (+1)-itemset from two different -itemsets with the same prefix.

Definition 5.4 ().

(1-extension (liu2012mining)) For two different -itemsets having the same first (-1)-items, denote them as ex and ey, where denotes the first (-1)-items. Assume in the revised transaction the item is after item , then by traversing their FU-list, if a shared transaction is found, we create a new entry by adding the information of shared tid, the smaller frequency of ex and ey in this transaction, and the remaining utility of ey. Then we add the entry to the FU-list of exy. We keep the 1-extension operation until no more shared transactions are found.

Suppose there are two FU-lists created for two new items and , whose TWU values are $50 and $100, respectively. It can be found that their common tids are . For the frequency of itemset in and , we choose the smaller frequency of item and item . Therefore, in , the frequency of should be 1 and in it should be 2. As for the remaining utility of , because the TWU of is larger than , then in the revised transaction should be after , thus we have rutil(, ) = rutil(, ) = $5, and rutil(, ) = rutil(, ) = $10. Therefore, the FU-list of is built. The process is shown in Fig. 4.

Figure 4. FU-list of itemset {}
Property 4 ().

(Remaining utility downward closure property) The property is similar with Property 1. If the sum of an itemset’s utility and the remaining utility of the itemset is less than the utility threshold, then any 1-extension of the itemset cannot be an HUI, i.e., if and U(X) + rutil(X) min_util, then cannot be an HUI.

Proof.

In the database , for an itemset , let be a 1-extension of , define as the set of revised transactions containing , and as the set of revised transactions containing . Since is a 1-extension of , then . Furthermore, every item after itemset must be included in the set of items after itemset . Therefore, we can obtain: + = = .

Input: : an item; ex.FU: the FU-list of itemset ; ey.FU: the FU-list of itemset .
Output: exy.FU: the FU-list of itemset exy.
1 initialize exy.FU null;
2 for each entry ex.FU do
3       if  a.tid == b.tid ey.FU then
4             fre min(a.fre, b.fre);
5             create a new entry elem [a.tid, fre, b.rutil];
6             exy.FU exy.FU elem;
7            
8       end if
9      
10 end for
return exy.FU
ALGORITHM 3 Extend procedure

In the extended procedure, we take , ex.FU, and ey.FU as input, where is an item (can be empty), and is the item after the item in the revised transactions. The remaining utility of an itemset in each transaction is denoted by rutil. In lines 2-3, we collect transactions where and both exist. In lines 3-5, for the construction of new FU-list exy.FU, for each satisfying transaction, we create a new entry for them. Then we select the smaller frequency of and as the frequency in the entry. Furthermore, the remaining utility of is chosen as the new remaining utility of exy. Finally, we append the new entry to the FU-list of exy. Details of the extended procedure for the 1-extension of are presented in Algorithm 3.

5.3. Pruning strategies in UFC

The pruning strategies are slightly different from the strategies in UFC because of the special structure of the FU-list. There is an extra remaining utility, which can be used to derive a new but similar pruning strategy. In the FU-list, the frequency is already denoted in the second column; thus, the itemset utility can be easily obtained by acquiring the external utility of the itemset through the corresponding utility-table, and thus we can immediately identify its category. Now we introduce a lemma, which is used as one of the pruning strategies in the UFC algorithm.

Lemma 5.5 ().

Given the FU-list of an itemset with its external utility, if the product of its frequency and external utility (the utility of the itemset) plus its remaining utility is less than the given minimum threshold, then any extension of is not an HUI.

Taking the FU-list of as an instance, suppose its external utility is $1 and the utility threshold is $35. And its frequency multiples external utility plus its remaining utility is equal to + + = (2 $1 + $6) + (2 $1 + $7) + (2 $1 + $12) = $31 $35. Then any extension of can not be a HUI. With Lemma 1 and Property 1 and Property 2 mentioned in UFC algorithm, the pruning strategies are developed as the follows:

Strategy 3. In the first scan of the database, we select those 1-itemsets whose transaction-weighted utility are higher than or equal to the utility threshold or those whose frequency is higher than or equal to the frequency threshold. If neither of the thresholds are achieved, then the 1-itemset and its extension must be a LFLUI, and thus we simply remove it from the current transaction.

For example, suppose the min_util is $25 and min_fre is 5. According to Table 1 and Table 2, we obtain the TWU of itemset and frequency of it are computed by + = $10 + $12 = $22 and + = 1 + 3 = 4 respectively. Because $22 ¡ $25 and 4 ¡ 5, then itemset would be a LFLUI, and we should delete it.

Strategy 4. For each itemset, we can immediately obtain its real frequency and utility through its FU-list and utility table in the database, and thus the itemset can be classified immediately.

For example, as shown in Fig. 3, the FU-list of item , by assuming the external utility of is $2, we can obtain the frequency and utility of : 5, $10. If min_fre and min_util are 4 and $15 respectively, then should be a HFLUI.

Strategy 5. If the current itemset’s real frequency is higher than or equal to the frequency threshold or the sum of its utility plus and remaining utility is higher than the utility threshold, then we can apply 1-extension to the current itemset with other itemsets after it.

For example, we assume that min_util is $30 and min_fre is 3. According to Table 2 and Table 4, we obtain the sum of itemset ’s utility and its remaining utility is computed by + rutil() = + + (rutil() + = $1 (3 + 1) + ($2 2 + $2 2 + $3 1) = $15. The support of is computed by + = 3 + 1 = 4. Because $15 $30 and 4 3, we should create an 1-extension of , including itemset , , and .

5.4. Proposed UFC algorithm

After twice scans of the database, we build the initial FU-list for all the one-itemsets. The operation of 1-extension starts from the set of original FU-lists. At each time, we first obtain the current itemset’s frequency and utility and sort it according to the given thresholds. Then, by applying Strategy 5, we extend the current itemset with other itemsets. The UFC algorithm terminates when there are no more candidates can be generated. Details of the algorithm are presented in Algorithm 4.

Input: : an item, initially empty; : the set of FU-lists of itemset ’s 1-extensions; min_util: the threshold of utility; min_fre: the threshold of frequency.
Output: HFHUI: high frequency and high utility itemsets; HFLUI: high frequency and low utility itemsets; LFHUI: low frequency and high utility itemsets.
1 initialize HFHUI null, HFLUI null, LFHUI null;
2 for each FU-list in  do
3       x the itemset corresponding to ;
4       classify itemset ;
5       if x.fre x.util + x.rutil min_util or min_fre then
6             exFUs null;
7             for FU-list after in  do
8                   the itemset corresponding to ;
9                   exFUs exFUs Extend(, , );
10                  
11             end for
12            call UFC(, exFUs, min_util, min_fre);
13            
14       end if
15      
16 end for
ALGORITHM 4 UFC algorithm

In this algorithm, an item and the set of FU-lists FUs are used as the input. At first, we scan every FU-list of in FUs, the FU-lists of 1-itemsets are constructed for those whose transaction-weighted utility or frequency satisfy the minimum threshold, thus we can immediately sort the current itemset to corresponding category through its FU-list, this is exactly the process of Strategy 3 and Strategy 4. On the other hand, in lines 5-10, an empty set of FU-lists exFUs is created, and the Strategy 5 is applied. If the sum of utility and remaining utility for or its frequency meets the minimum value, then we apply the extended algorithm to it and all the itemset after it in order to create new FU-lists. Once a new FU-list is created, it is added to the exFUs. Finally, the algorithm is called recursively, and it terminates when there is no more itemset that can be classified.

According to Property 2 and Property 4, the completeness and correctness of the proposed UFC algorithm is satisfied; hence, it can ensure that each itemset is sorted into the corresponding category. Furthermore, all the itemsets can be applied to the 1-extensions operation if the condition is satisfied, hence the algorithm can guarantee that all the possible itemsets would be discovered.

6. Experiments

We performed extensive experiments to evaluate the UFC and UFC algorithms. Because the model is the first model that collects three types of itemsets according to frequency and utility, we can only compare these two algorithms on different datasets, in terms of execution time, memory consumption and scalability.

Both algorithms are implemented in Java language. All experiments were conducted in a personal computer with 4 GB of RAM, running the 64-bit Microsoft Windows 10 operating system. In the experiments, the execution time, memory consumption, classification results and scalability analysis are evaluated respectively. In order to visualize the impact of thresholds on other factors, min_util and min_fre are represented as percentage in all the experiments. For instance, if the sum of utility in a database is $1,000 and the minimum utility threshold is $150, then min_util should be 15%, which is calculated as .

In the experiment, unlike the general experiments for HFIM and HUPM, the parameters of thresholds are set in the form of combinations, which means that both frequency and utility are taken into account in the same experiment. In each experiment, we choose different thresholds of frequency and vary the threshold of utility. For each dataset, the parameters of both thresholds are all set at a very low level (less than 1%), because in earlier testings we found that when the thresholds were set at a relative high level, then many itemsets could not meet the condition, which could cause difficulties in comparison for the proposed algorithms. From our previous testings, it is proved that an increase of 5% for each point is a relative appropriate setting to show the general trend for the proposed algorithms.

6.1. Experiment datasets

Both synthetic and real datasets (foodmart, retail, T40I10D100K, and yoochooseBuys) were used in the experiments. The resource link for the synthetic datasets and real dataset are http://www.philippe-fournier-viger.com/spmf and recsys.acm.org/recsys15/challenge/. For the dataset foodmart, it has 21,556 rows of transactions, while having around 1,559 items contained. For datasets retail and T40I10D100K, the quantities of their transactions are much larger. At the same time, there are only averagely 4 items and 10.3 items for most of the transactions from foodmart and retail, separately, but for dataset T40I10D100K, it includes more than 70 items in each row of transactions. For all datasets, both internal utility and external utility are assigned to each item. The detailed characteristics of all datasets are listed in Table 5, where represents the number of transactions in the database, and indicates the number of items in the database.

Dataset AvgLen
retail 88,162 16,470 10.3
foodmart 21,556 1,559 4.0
T40I10D100K 100,000 942 39.6
yoochooseBuys 507,746 19,102 2.3
Table 5. Characteristics of test datasets
Figure 5. Execution time w.r.t different combination of thresholds of frequency and utility.

6.2. Execution time

The running time of UFC and UFC algorithms on synthetic datasets are illustrated in Fig. 5. A task would terminate if its running time exceeded 10,000 s. By setting a group of thresholds for both frequency and utility, we obtain the execution time of the proposed algorithms at each point. It is likely that the lower the threshold is, the more computation time is required for both algorithms. The execution time increases because there are more itemsets satisfying the requirements. For instance, on dataset retail, when min_fre and min_util are both 0.15%, the running times for UFC and UFC are 210 s and 6.5 s, respectively. However, when the thresholds both become 0.1%, their execution time became 217 s and 23 s, respectively.

It is noticeable that UFC always requires less time than that on UFC. This is mainly because UFC keeps generating new candidates level by level, during which a large amount of time is consumed in getting new candidates by applying the connection operation. In particular, when the number of items is extremely large, the computation time for combining current itemsets can be very slow. When the number of itemsets becomes smaller, the UFC algorithm can spend less time on connecting the old itemsets to generate new ones, thereby reducing the time difference between both algorithms. For example, from Fig. 5(b), it can be seen that when applying the dataset foodmart, the difference of running time between UFC and UFC is much smaller than that on dataset retail. The ranges for both proposed algorithms are within 5 seconds, and the largest time consumed is 2.6 s and 4.5 s, respectively, when the thresholds of frequency and utility are 0.1% and 0.05%, separately.

Fig. 5(c) shows the running time on dataset (T40I10D100K). Because the average length of the transaction database is large, for the UFC, it has to deal with an enormous number of itemsets when setting the same thresholds as in dataset retail; in that case, the execution time is remarkably large, which means it has to be terminated. By contrast, for UFC, it finishes the classification tasks longer than on dataset retail and foodmart, where the most time consumed is 1635 seconds when the thresholds are both 1%. The overall trend of execution time for UFC on it is relatively smooth.

Overall, the proposed UFC has better performance than UFC on computation time under various thresholds of utility and frequency on each dataset. This is particularly evident when the database contains a large number of items and many rows of transactions. On the other hand, from the figures it can be seen that the execution time of both algorithms does not always reduce linearly as the thresholds increase, but the fluctuations are often acceptable.

6.3. Memory consumption

Fig. 6 shows the peak memory consumption of both algorithms on real and synthetic datasets. It can be seen that UFC always consumes much less memory than UFC. This is mainly because UFC does not generate candidates, while UFC always generates new candidates from previous candidates, satisfying conditions each time.

Figure 6. Memory consumption w.r.t different thresholds of frequency and utility.

Generally, the trends of memory consumption for both algorithms on the datasets are not linear. The peak memory of UFC fluctuates significantly on datasets retail and foodmart. When applying UFC to synthetic datasets, the memory consumption for it tends to be more stable. For instance, on dataset retail, with the increase of one of the thresholds, the memory consumption by applying UFC is maintained at approximately 200 MB, reaching the highest (271.78 MB) when the thresholds of frequency and utility are 0.3% and 0.25% respectively. By contrast, the figure for UFC experiences some significant fluctuations when situated in different thresholds. When the utility threshold is 0.05%, its memory consumption of it is dramatically higher than that in other situations. While increasing the utility threshold, the memory consumed drops noticeably, but it does not maintain the declining trend, and increases in other situations. On the other hand, large fluctuations can be seen in UFC on T40I10D100K.

It can be concluded that the UFC algorithm may not fit in datasets containing a high quantity of items or transactions of a long average length. Because it means that there is likely to be a remarkable memory consumption on storage for new itemsets generated by connecting the current candidate sets. For instance, when we apply UFC on dataset T40I10D100K, the memory consumption is so large that exceeds the stack capacity of JVM, when the threshold of utility is 0.1%. However, for the UFC algorithm, it stores information about all itemsets using the FU-list. At each time it directly classifies the current itemset using the pruning strategies without generating new candidates. Therefore, its memory consumption of it on different datasets always has a much better performance than that of UFC. It only requires a little extra memory to store the remaining utility of each itemset. Overall, as in the experiment of testing execution time, UFC outperforms than UFC dramatically, and the main reason for this is that the former does not generate new candidates, while the latter does.

6.4. Classification results

By selecting different thresholds for frequency and utility, there are various classification results. Both algorithms obtained the same classification results for the datasets. Table 6 compares the number of HFHUI, HFLUI and LFHUI for the tested datasets, with fixed frequency threshold and varying utility thresholds. The chosen frequency threshold from each dataset is denoted in the racket of the first column in the table.

For dataset retail, the number of HFHUI only decreases from 97 to 54 when the threshold increased, while the quantity of LFHUI decreases dramatically from 293 to 24. Similarly, it can be seen that the number of LFHUI on dataset T40I10D100K is 23,347 when min_util is 0.1%, which is significantly large. However, the quantity drops rapidly to 3,086 when the threshold rises to 0.15%. As for dataset foodmart, it can be seen that the number of HFLUI accounts for the majority of total candidates in all situations, which indicates that a large number of itemsets in the dataset tends to have the preposition of higher frequency and lower utility. With an increase in the threshold of utility, the numbers of HFLUI and HFHUI both decline, and the quantity of LFHUI increases significantly.

Dataset min_util HFHUI HFLUI LFHUI
Foodmart (0.07%) 0.070% 224 200 400
0.075% 211 213 333
0.080% 187 237 273
0.085% 170 254 214
Retail (0.1%) 0.10% 97 22 293
0.15% 79 40 107
0.20% 66 53 47
0.25% 54 65 24
T10I4D100K (0.1%) 0.10% 243 150 23,347
0.15% 188 205 3,086
0.20% 138 255 361
0.25% 105 288 59
Table 6. Classification results on tested datasets

6.5. Scalability analysis

We use two datasets (foodmart and retail) to perform the scalability analysis of the proposed algorithms. We fix the thresholds of utility and frequency to 0.1%, while increasing the number of transactions each time. With the increase of in the number of transactions, the number of candidates could see a growing trend, because some candidates may not be included in those new additive transactions. Fig. 7 and Fig. 8 illustrate the execution time and memory consumption of UFC and UFC on dataset foodmart and retail.

Figure 7. Scalability w.r.t. execution time
Figure 8. Scalability w.r.t. memory consumption

Both proposed algorithms have excellent scalability with regard to execution time. On foodmart, we find that the execution time of both algorithms increases almost linearly, although the time for UFC is higher than UFC throughout the period. However, this gap is not noticeable. For dataset retail, the increment of transactions is 10000 each time. Changes in UFC become larger when applying different sizes of total transactions, spending 69.637 s when the size is 2,000, while it becomes 187.717 s when the size is 8000. On the other hand, the time spent by UFC seems to have a linear trend. The time grows smoothly and slowly, and in each situation, the time consumed is within 25 s, which is much less than the time spent by UFC.

On the other hand, in Fig. 8, it shows the memory consumption of UFC and UFC on both datasets, with the same division of transactions as before. For UFC, it tends to keep unchanged or grows steadily, while for UFC, there is often a significant growth on it. For example, when running both algorithms on foodmart, the memory consumption of UFC increases steadily as a linear trend, standing at 49.9 MB when the number of transactions is 20,000, with the figure for that being 114.64 MB, when the quantity of transactions is 80,000. By contrast, when applying UFC, the memory consumption is 1006.88 MB when the transaction number is 20,000. However, the number becomes 2,173 MB when the quantity of transactions is 80,000.

6.6. Case study on market analysis

A case study is also evaluated on both proposed algorithms. The real-life dataset is yoochooseBuys, which is a collection of sequences of click events. It represents a period of six months of activities about a big e-commerce business in Europe selling all kinds of stuffs such as garden tools, toys, clothes and electronics. There are 507,746 rows of transactions and 19,102 items in total. Each transaction records the quantity and the price of each item. The frequency threshold is set to range from 0.1% to 0.25%, with the threshold of utility starting from 0.05% to 0.3% after testing different parameters. Table 7 shows the detailed classification results.

Dataset min_util HFHUI HFLUI LFHUI
yoochooseBuys 0.05 % 148 68 348
0.1 % 86 130 127
0.15 % 62 154 70
0.20 % 50 166 40
0.25 % 42 174 31
0.30 % 28 188 21
Table 7. Classification results on yoochooseBuys
Figure 9. Execution time on yoochooseBuys
Figure 10. Memory consumption on yoochooseBuys

Fig. 9 and Fig. 10 showed the performances of both algorithms with regard to execution time and memory consumption respectively. For the running time, both algorithms have excellent performances on it, presenting a general smooth trend. When the UFC is applied, it only takes around 5 s when the utility threshold is 0.05%, with four different frequency thresholds. As for other combinations of utility and frequency, the running time remains at the level under 3 s, ranging from 1.180 s to 2.731 s. On the other hand, the UFC cost more than 20 s when the utility threshold is 0.5%. However, the time drops sharply when the threshold of utility is transformed into 1%. And it shares a similar trend with that for UFC, and the gap between them is getting smaller. Overall, dramatic differences between both algorithms do not appear on this business dataset from the real world, while both of the proposed methods behave well on it.

7. Conclusion and Future Studies

In this paper, we studied the problem of pattern classification for smart systems. We used the original properties of both frequency and utility and developed two algorithms, UFC and UFC, which were used to efficiently collect three types of patterns, HFHUIs, HFLUIs, and LFHUIs. For the proposed level-wise-based UFC algorithm, it adopted a level-based structure, and a number of new candidates were generated each time when a connection operation was applied. As for UFC, it utilized a list-based structure, and can reduce the computational time since the database was only scanned twice. Compared with UFC, UFC may not be suitable enough since it required the generation of a large number of candidates and multiple times of scans for the database during the mining process. Experimental results on real and synthetic datasets shown that both UFC and UFC were effective for pattern classification in smart systems. UFC had a more significant performance when dealing with different types of datasets under various parameters in terms of both execution time and memory consumption. In practice, UFC was more suitable for processing large-scale datasets.

For future studies, we plan to perform research to further improve the performance of both proposed algorithm. Currently, the number of researches on pattern classification is relatively less, and the method of how to better combine the properties of frequency and utility still requires further discovery. Moreover, studies of pattern classification based on different models in smart systems, such as smart manufacturing, will be performed in the future.

8. Acknowledgment

This research was supported in part by the National Natural Science Foundation of China (Grant Nos. 62002136 and 61902079), Guangzhou Basic and Applied Basic Research Foundation (Grant Nos. 202102020277 and 202102020928), Guangdong Basic and Applied Basic Research Foundation (Grant No. 2019B1515120010), Guangdong Key R&D Plan2020 (Grant No. 2020B0101090002), and National Key R&D Plan2020 (Grant No. 2020YFB1005600).

References