Peer to peer marketplaces enable both “obtaining” and “providing” in a temporary or permanent fashion valuable services through direct interaction between people . Travel peer to peer marketplaces such as AirBnB, HouseTrip, HomeAway, and Vayable111airbnb.com; housetrip.com; homeaway.com; vayable.com., work and service peer to peer marketplaces such as UpWork, FreeLancer, PivotDesk, ShareDesk, and Breather222upwork.com; freelancer.com; pivotdesk.com; sharedesk.net; breather.com, car sharing marketplaces such as BlaBlaCar333blablacar.com, education peer to peer marketplaces such as PopExpert444popexpert.com, and pet peer to peer marketplaces such as DogVacay555dogvacay.com are a few examples of such marketplaces. In travel peer to peer marketplaces, for example, the service caters to accommodation rental; hosts are those providing the service (service providers), and guests, who are looking for temporary rentals, are receiving service (service receivers). Hosts list properties, along with a set of amenities for each, while guests utilize the search interface to identify suitable properties to rent. Figure 1 presents a sample set of rental accommodations. Each row corresponds to a property and each column represents an amenity. For instance, the first property offers Breakfast, TV, and Internet as amenities but does not offer Washer.
Although sizeable effort has been devoted to design user-friendly search tools assisting service receivers in the search process, little effort has been recorded to date to build tools to assist service providers. Consider for example a host in a travel peer to peer marketplace; while listing a property in the service for (temporary) rent, the host is faced with various choices. Although some amenities in the property are relatively fixed, such as number of rooms for rent, or existence of an elevator, others are relatively flexible; for example offering Breakfast or TV as an amenity. Flexible amenities can be added without a significant effort. Although amenities make sense in the context of travel peer to peer marketplaces (as part of the standard terminology used in the service), for a general peer to peer marketplace we use the term attribute and refer to the subsequent choice of attributes as flexible attributes.
Service providers participate in the service with specified objectives; for instance hosts may want to increase overall occupancy and/or optimize their anticipated revenue. Since there is a cost (e.g., monetary base cost to the host to offer internet) associated with each flexible attribute, it is challenging for service providers to choose the set of flexible attributes to offer given some budget limitations (constraints). An informed choice of attributes to offer should maximize the objectives of the service provider in each case subject to any constraints. Objectives may vary by application; for example an objective could be maximize the number of times a listing appears on search results, the position in the search result ranking or other. This necessitates the existence of functions that relate flexible attributes to such objectives in order to aid the service provider’s decision. We refer to the service provider’s objectives in a generic sense as gain and to the functions that relate attributes to gain as gain functions in what follows.
In this paper, we aim to assist service providers in peer to peer marketplaces by suggesting those flexible attributes which maximize their gain. Given a service with known flexible attributes and budget limitation, our objective is to identify a set of attributes to suggest to service providers in order to maximize the gain. We refer to this problem as Gain Maximization over Flexible Attributes (GMFA). Since the target applications involve mainly ordinal attributes, in this paper, we focus our attention on ordinal attributes and we assume that numeric attributes (if any) are suitably discretized. Without loss of generality, we first design our algorithms for binary attributes, and provide the extension to ordinal attributes in § V-A.
Our contribution in this paper is twofold. First, we formally define the general problem of Gain Maximization over Flexible Attributes (GMFA) in peer to peer marketplaces and, as our main contribution, propose a general solution which is applicable to a general class of gain functions. Second, without making any assumption on the existence extra information other than the dataset itself, we introduce the notion of frequent-item based count as a simple yet compelling gain function in the absence of other sources of information. As our first contribution, using a reduction from the quadratic knapsack problem [18, 8], we prove that (i) the general GMFA is NP-hard, and (ii) that there is no approximate algorithm with a fixed ratio for GMFA unless there is one for quadratic knapsack problem. We provide a (practically) efficient exact algorithm to the GMFA problem for a general class of monotonic gain functions666Monotonicity of the gain function simply means that adding a new attribute does not reduce the gain.. This generic proposal is due to the fact that gain function design is application specific and depends on the available information. Thus, instead of limiting the solution to a specific application, the proposed algorithm gives the freedom to easily apply any arbitrary gain function into it. In other words, it works for any arbitrary monotonic gain function no matter how and based on what data it is designed. In a rational setting in which attributes on offer add value, we expect that all gain functions will be monotonic. More specifically, given any user defined monotonic gain function, we first propose an algorithm called I-GMFA (Improved GMFA) that exploits the properties of the monotonic function to suggest efficient ways to explore the solution space. Next, we introduce several techniques to speed up this algorithm, both theoretically and practically, changing the way the traversal of the search space is performed. To do so, we propose the G-GMFA (General GMFA) Algorithm which transforms the underlying problem structure from a lattice to a tree, preorderes the the attributes, and amortizes the computation cost over different nodes during the traversal.
The next part of our contribution, focuses on the gain function design. It is evident that gain functions could vary depending on the underlying objectives and extra information such as a weighting of attributes based on some criteria (e.g., importantance), that can be naturally incorporated in our framework without changes to the algorithm. The gain function design, as discussed in Appendix V-B, is application specific and may vary upon on the availability of information such as query logs or reviews; thus, rather than assuming the existence of any specific extra information, we, alternatively, introduce the notion of frequent-item based count (FBC) that utilizes nothing but the existing tuples in the database to define the notion of gain for the absence of extra information. Therefore, even the applications with limited access to the data such as a third party service for assisting the service providers that may only have access to the dataset tuples can utilize G-GMFA while applying FBC inside it. The motivation behind the definition of FBC is that (rational) service providers provide attributes based on demand. For example, in Figure 1 the existence of TV and Internet together in more than half of the rows, indicates the demand for this combination of amenities. Also, as shown in the real case study provided in § VI-C, popularity of Breakfast in the rentals located in Paris indicates the demand for this amenity there. Since counting the number of frequent itemsets is #P-complete , computing the FBC is challenging. In contrast with a simple algorithm that is an adaptation of Apriori  algorithm, we propose a practical output-sensitive algorithm for computing FBC that runs in the time linear in its output value. The algorithm uses an innovative approach that avoids iterating over the frequent attribute combinations by partitioning them into disjoint sets and calculating the FBC as the summation of their cardinalities.
In summary, we make the following contributions in this paper.
We introduce the notion of flexible attributes and the novel problem of gain maximization over flexible attributes (GMFA) in peer to peer marketplaces.
We prove that the general GMFA problem is NP-hard and we prove the difficulty of designing an approximate algorithm.
For the general GMFA problem, we propose an algorithm called I-GMFA (Improved GMFA) that exploits the properties of the monotonic function to suggest efficient ways to explore the solution space.
We propose the G-GMFA (General GMFA) algorithm which transforms the underlying problem structure from a lattice to a tree, preorders the attributes, and amortizes the computation cost over nodes during the traversal. Given the application specific nature of the gain function design, G-GMFA is designed such that any arbitrary monotonic gain function can simply get plugged into it.
While not promoting any specific gain function, without any assumption on the existence of extra information other than the dataset itself, we propose frequent-item based count (FBC) as a simple yet compelling gain function in the absence of other sources of data.
In contrast with the simple Apriori-based algorithm, we propose and present the algorithm FBC to efficiently assess gain and demonstrate its practical significance.
We present the results of a comprehensive performance study on real dataset from AirBnB to evaluate the proposed algorithms. Also, in a real case study, we to illustrate the practicality of the approaches.
This paper is organized as follows. § II provides formal definitions and introduces notation stating formally the problem we focus and its associated complexity. We propose the exact algorithm for the general class of monotonic gain functions in § III. In § IV, we study the gain function design and propose an alternative gain function for the absence of user preferences. The experiment results are provided in § VI, related work is discussed in § VII, and the paper is concluded in § VIII.
Dataset Model: We model the entities under consideration in a peer to peer marketplace as a dataset with tuples and attributes . For a tuple , we use to denote the value of the attribute in . Figure 1 presents a sample set of rental accommodations with tuples and attributes. Each row corresponds to a tuple (property) and each column represents an attribute. For example, the first property offers Breakfast, TV, and Internet as amenities but does not offer Washer. Note that, since the target applications involve mainly ordinal attributes, we focus our attention on such attributes and we assume that numeric attributes (if any) are suitably discretized. Without loss of generality, throughout the paper, we consider the attributes to be binary and defer the extension of algorithms to ordinal attributes in § V-A. We use to refer to the set of attributes for which is non zero; i.e. , and the size of is .
Query Model: Given the dataset and set of binary attributes , the query returns the set of tuples in where contain as their attributes; formally:
Similarly, the query model for the ordinal attributes is as following: given the dataset , the set of ordinal attributes , and values where is a value in the domain of , returns the tuples in that for attribute , .
Flexible Attribute Model: In this paper, we assume an underlying cost777Depending on the application it may represent a monetary value. associated with each attribute , i.e., a flexible attribute can be added to a tuple by incurring . For example, the costs of providing attributes Breakfast, TV, Internet, and Washer, in Figure 1, on an annual basis, are . For the ordinal attributes, represents the cost of changing the value of from to . Our approach places no restrictions on the number of flexible attributes in . For the ease of explanation, in the rest of paper we assume all the attributes in are flexible.
We also assume the existence of a gain function , that given the dataset , for a given attribute combination , provides a score showing how desirable is. For example in a travel peer to peer marketplace, given a set of amenities, such a function could quantify the anticipated gain (e.g., visibility) for a host if a subset of these amenities are provided by the host on a certain property.
Table I presents a summary of the notation used in this paper. We will provide the additional notations for § IV at Table II. Next, we formally define the general Gain Maximization over Flexible Attributes (GMFA) in peer to peer marketplaces.
|The set of the attributes in database|
|The size of|
|The number of tuples in database|
|The value of attribute in tuple|
|The set of non-zero attributes in tuple|
|The cost to change the binary attribute to|
|The gain function|
|The lattice of attribute combinations|
|The set of nodes in|
|The bit representative of the node|
|The node with attribute combination|
|The node with the bit representative|
|The level of the node|
|The cost associated with the node|
|parents(,)||The parents of the node in|
|The index of the right-most zero in|
|parent||The parent of in the tree data structure|
Ii-a General Problem Definition
We define the general problem of Gain Maximization over Flexible Attributes (GMFA) in peer to peer marketplaces as a constrained optimization problem. The general problem is agnostic to the choice of the gain function. Given , a service provider with a certain budget strives to maximize by considering the addition of flexible attributes to the service. For example, in a travel peer to peer marketplace, a host who owns an accommodation () and has a limited (monetary) budget aims to determine which amenities should be offered in the property such that the costs to offer the amenities to the host are within the budget , and the gain (888In addition to the input set of attributes, the function may depend to other variables such as the number of attribute (); one such function is discussed in § IV.) resulting from offering the amenities is maximized. Formally, our GMFA problem is defined as an optimization problem as shown in Figure 2.
We next discuss the complexity of the general GMFA problem and we show the difficulty of designing an approximation algorithm with a constant approximate ratio for this problem.
Ii-B Computational Complexity
We prove that GMFA is NP-hard999Please note that GMFA is NP-complete even for the polynomial time gain functions. by reduction from quadratic knapsack [18, 8] which is NP-complete . The reduction from the QPK shows that it can be modeled as an instance of GMFA; thus, a solution for QPK cannot not be used to solve GMFA.
Quadratic 0/1 knapsack (QKP): Given a set of items , each with a weight , a knapsack with capacity , and the profit function , determine a subset of items such that their cumulative weights do not exceed and the profit is maximized. Note that the profit function is defined for the choice of individual items but also considers an extra profit that can be earned if two items are selected jointly. In other words, for items, is a matrix of , where the diagonal of matrix () depicts the profit of item and an element in () represents the profit of picking item , and item together.
The problem of Gain maximization over flexible attributes (GMFA) is NP-hard.
The decision version of GMFA is defined as follows: given a decision value , dataset with the set of flexible attributes , associated costs , a gain function , a budget , and a tuple , decide if there is a , such that
We reduce the decision version of quadratic knapsack problem (QKP) to the decision version of GMFA and argue that the solution to QKP exists, if and only if, a solution to our problem exists. The decision version of QKP, in addition to , , , and , accepts the decision variable and decides if there is a subset of that can be accommodated to the knapsack with profit .
A mapping between QKP to GMFA is constructed as follows: The set of items is mapped to flexible attributes , the weight of items is mapped to the cost of the attributes , the capacity of the knapsack to budget , and the decision value in QKP to the value in GMFA. Moreover, we set and ; the gain function, in this instance of GMFA, can be constructed based on the profit matrix as follows:
The answer to the QKP is yes (resp. no) if the answer to its corresponding GMFA is yes (resp. no).
Note that GMFA belongs to the NP-complete class only for the functions that are polynomial. In those cases the verification of whether or not a given subset , has cost less than or equal to and a gain at least equal to can be performed in polynomial time.
In addition to the complexity, the reduction from the quadratic knapsack presents the difficulty in designing an approximate algorithm for GMFA. Rader et. al.  prove that QKP does not have a polynomial time approximation algorithm with fixed approximation ratio unless P=NP. Even for the cases that , it is an open problem whether or not there is an approximate algorithm with a fixed approximate ratio for QKP . In Theorem 2, we show that a polynomial approximate algorithm with a fixed approximate ratio for GMFA guarantees a fixed approximate ratio for QKP, and its existence contradicts the result of . Furthermore, studies on the constrained set functions optimization, such as , also admits that maximizing a monotone set function up to an acceptable approximation, even subject to simple constraints is not possible.
There is no polynomial-time approximate algorithm with a fixed approximate ratio for GMFA unless there is an approximate algorithm with a constant approximate ratio for QKP.
Suppose there is an approximate algorithm with a constant approximate ratio for GMFA. Let be the attribute combination returned by the approximate algorithm and be the optimal solution. Since the approximate ratio is :
Based on the mapping provided in the proof of Theorem 1, we first show that is the corresponding set of items for in the optimal solution QKP. If there is a set of items for which the profit is higher than , due to the choice of in the mapping, its corresponding attribute combination in GMFA has a higher than , which contradicts the fact that is the optimal solution of GMFA. Now since :
Thus, the profit of the optimal set of items () is at most times the profit of the set of items () returned by the approximate algorithm, giving the approximate ratio of for the quadratic knapsack problem.
Iii Exact Solution
Considering the negative result of Theorem 2, we turn our attention to the design of an exact algorithm for the GMFA problem; even though this algorithm will be exponential in the worst case, we will demonstrate that is efficient in practice. In this section, our focus is on providing a solution for GMFA over any monotonic gain function. A gain function is monotonic, if given two set of attributes and where , . As a result, this section provides a general solution that works for any monotonic gain function, no matter how and based on what data it is designed. In fact considering a non-monotonic function for gain is not reasonable here, because adding more attributes to a tuple (service) should not decrease the gain. For ease of explanation, we first provide the following definitions and notations. Then we discuss an initial solution in § III-A, which leads to our final algorithm in § III-B.
Lattice of Attribute Combination: Given an attribute combination , the lattice of is defined as , where the nodeset , depicted as , corresponds to the set of all subsets of ; thus , there exists a one to one mapping between each and each . Each node is associated with a bit representative of length in which bit is if and otherwise. For consistency, for each node in , the index is the decimal value of . Given the bit representative we define function to return . In the lattice an edge exists if and , differ in only one bit. Thus, (resp. ) is parent (resp. child) of (resp. ) in the lattice. For each node , level of , denoted by , is defined as the number of 1’s in the bit representative of . In addition, every node is associated with a cost defined as .
Maximal Affordable Node: A node is affordable iff ; otherwise it is unaffordable. An affordable node is maximal affordable iff nodes in parents of , is unaffordable.
Example 1: As a running example throughout the paper,
consider as shown in Figure 1, defined over the set of attributes :Breakfast, :TV, :Internet, :Washer with cost to provide these attributes as . Assume the budget is and that the property does not offer these attributes/amenities, i.e., .
Figure 3 presents over these four attributes. The bit representative for the highlighted node in the figure is representing the set of attributes :Breakfast, : Internet; The level of is , and it is the parent of nodes and with the bit representatives and . Since and the cost of is , is an affordable node; however, since its parent the cost and is affordable, is not a maximal affordable node. and with bit representatives and , the parents of , are unaffordable; thus is a maximal affordable node.
A baseline approach for the GMFA problem is to examine all the nodes of . Since for every node the algorithm needs to compute the gain, it’s running time is in , where is the computation cost associated with the function.
As a first algorithm, we improve upon this baseline by leveraging the monotonicity of the gain function, which enables us to prune some of the branches in the lattice while searching for the optimal solution. This algorithm is described in the next subsection as improved GMFA (I-GMFA). Then, we discuss drawbacks and propose a general algorithm for the GMFA problem in Section III-B.
An algorithm for GMFA can identify the maximal affordable nodes and return the one with the maximum gain. Given a node , due to the monotonicity of the gain function, for any child of , . Consequently, when searching for an affordable node that maximizes gain, one can ignore all the nodes that are not maximal affordable. Thus, our goal is to efficiently identify the maximal affordable nodes, while pruning all the nodes in their sublattices. Algorithm 1 presents a top-down101010One could design a bottom-up algorithm that starts from and keeps ascending the lattice, in BFS manner and stop at the maximal affordable nodes. We did not include it due to its similarity to the top-down approach. BFS (breadth first search) traversal of starting from the root of the lattice (i.e., ). To determine whether a node should be pruned or not, the algorithm checks if the node has an affordable parent, and if so, prunes it. In Example 1, since (:TV, :Internet, :Washer) is (and it does not have any affordable parents), is a maximal affordable node; thus the algorithm prunes the sublattice under it. For the nodes with cost more than , the algorithm generates their children and if not already in the queue, adds them (lines 14 to 16).
Iii-B General GMFA Solution
Algorithm 1 conducts a BFS over the lattice of attribute combinations which make the time complexity of the algorithm dependent on the number of edges of the lattice. For every node in the lattice, Algorithm 1 may generate all of its children and parents. Yet for every generated child (resp. parent), the algorithm checks if the queue (resp. feasible set) contains it. Moreover, it stores the set of feasible nodes to determine the feasibility of their parents. In this section, we discuss these drawbacks in detail and propose a general approach as algorithm G-GFMA.
Iii-B1 The problem with multiple children generation
Algorithm 1 generates all children of an unaffordable node. Thus, if a node has multiple unaffordable parents, Algorithm 1 will generate the children multiple times, even though they will be added to the queue once. In Figure 3, node (:Breakfast, :Washer and )) will be generated twice by the unaffordable parents and with the bit representatives and ; the children will be added to the queue once, while subsequent attempts for adding them will be ignored as they are already in the queue.
A natural question is whether it is possible to design a strategy that for each level in the lattice, (i) make sure we generate all the non-pruned children and (ii) guarantee to generate children of each node only once.
Tree construction: To address this, we adopt the one-to-all broadcast algorithm in a hypercube  constructing a tree that guarantees to generate each node in only once. As a result, since the generation of each node is unique, further checks before adding the node to the queue are not required. The algorithm works as following: Considering the bit representation of a node , let be the right-most in . The algorithm first identifies ; then it complements the bits in the right side of one by one to generate the children of . Figure 6 demonstrates the resulting tree for the lattice of Figure 3 for this algorithm. For example, consider the node () in the figure; is (for attribute ). Thus, nodes and with the bit representatives and are generated as its children.
As shown in Figure 6, i) the children of a node are generated once, and ii) all nodes in are generated; that is because every node has one (and only one) parent in the tree structure, identified by flipping the bit in to one. We use parent to refer to the parent of the node in the tree structure. For example, for in Figure 3, since is , its parent in the tree is parent (). Also, note that in order to identify there is no need to search in to identify it. Based on the way is constructed, is the bit that has been flipped by its parent to generate it.
Thus, by transforming the lattice to a tree, some of the nodes that could be generated by Algorithm 1, will not be generated in the tree. Figure 6 represents an example where the node will be generated in the lattice (Figure 6(a)) but it will be immediately pruned in the tree (Figure 6(b)). In the lattice, node will be generated by the unaffordable parent , whereas in the tree, Figure 6(b), will be pruned, as parent is affordable. According to Definition 2, a node that has at least one affordable parent is not maximal affordable. Since in these cases parent is affordable, there is no need to generate them. Note that we only present one step of running Algorithm 1 in the lattice and the tree. Even though node is generated (and added to the queue) in Algorithm 1, it will be pruned in the next step (lines 7 to 9 in Algorithm 1).
Iii-B2 The problem with checking all parents in the lattice
The problem with multiple generations of children has been resolved by constructing a tree. The pruning strategy is to stop traversing a node in the lattice when a node has at least one affordable parent. I n Figure 6, if for a node , parent is affordable, will not be generated. However, this does not imply that if a is generated in the tree it does not have an affordable parent in the lattice. For example, consider ( and :Breakfast) in Figure 6. We enlarge that part of the tree in Figure 6. As presented in Figure 6, parent is unaffordable and thus is generated in the tree. However, by consulting the lattice, has the affordable parent (:Breakfast, :Internet); thus is not maximal affordable.
In order to decide if an affordable node is maximal affordable, one has to check all its parents in the lattice (not the tree). If at least one of its parents in the lattice is affordable, it is not maximal affordable. Thus, even though we construct a tree to avoid generating the children multiple times, we may end up checking all edges in the lattice since we have to check the affordability of all parents of a node in the lattice. To tackle this problem we exploit the monotonicity of the cost function to construct the tree such that for a given node we only check the affordability of the node’s parent in the tree (not the lattice).
In the lattice, each child has one less attribute than its parents. Thus, for a node , one can simply determine the parent with the minimum cost (cheapest parent) by considering the cheapest attribute in that does not belong to . In Figure 6, the cheapest parent of is because Internet is the cheapest missing attribute in . The key observation is that, for a node , if the parent with minimum cost is not affordable, none of the other parents is affordable; on the other hand, if the cheapest parent is affordable, there is no need to check the other parents as this node is not maximal affordable. In the same example, is not maximal affordable since its cheapest parent has a cost less than the budget, i.e. ().
Consequently, one only has to identify the least cost missing attribute and check if its cost plus the cost of attributes in the combination is at most . Identifying the missing attribute with the smallest cost is in . For each node , is the bit that has been flipped by parent to generate it. For example, consider in Figure 6; since , parent. We can use this information to reorder the attributes and instantly get the cheapest missing attribute in . The key idea is that if we originally order the attributes from the most expensive to the cheapest, is the index of the cheapest attribute. Moreover, adding the cheapest missing attribute generates parent. Therefore, if the attributes are sorted on their cost in descending order, a node with an affordable parent in the lattice will never be generated in the tree data structure. Consequently, after presorting the attributes, there is no need to check if a node in the queue has an affordable parent.
Sorting the attributes is in . In addition, computing the cost of a node is thus performed in constant time, using the cost of its parent in the tree. For each node , is . Applying all these ideas the final algorithm is in and . Algorithm 2 presents the pseudo-code of our final approach, G-GMFA.
Iv Gain Function Design
As the main contribution of this paper, in § III, we proposed a general solution that works for any arbitrary monotonic gain function. We conducted our presentation for a generic gain function because the design of the gain function is application specific and depends on the available information. The application specific nature of the gain function design, motivated the consideration of the generic gain function, instead of promoting a specific function. Consequently, applying any monotonic gain function in Algorithm 2 is as simple as calling it in line 8.
In our work, the focus is on understanding which subsets of attributes are attractive to users. Based on the application, in addition to the dataset , some extra information (such as query logs and user ratings) may be available that help in understanding the desire for combinations of attributes and could be a basis for the design of such a function. However, such comprehensive information that reflect user preferences are not always available. Consider a third party service for assisting the service providers. Such services have a limited view of the data  and may only have access to the dataset tuples. An example of such third party services is AirDNA 111111www.airdna.co which is built on top of AirBnB. Therefore, instead on focusing on a specific application and assuming the existence of extra information, in the rest of this section, we focus on a simple, yet compelling variant of a practical gain function that only utilizes the existing tuples in the dataset to define the notion of gain in the absence of other sources of information. We provide a general discussion of gain functions with extra information in Appendix V-B.
Iv-a Frequent-item Based Count (FBC)
In this section, we propose a practical gain function that only utilizes the existing tuples in the dataset. It hinges on the observations that the bulk of market participants are expected to behave rationally. Thus, goods on offer are expected to follow a basic supply and demand principle. For example, based on the case study provided in § VI-C, while many of the properties in Paris offer Breakfast, offering it is not popular in New York City. This indicates a relatively high demand for such an amenity in Paris and relatively low demand in New York City. As another example, there are many accommodations that provide washer, dryer, and iron together; providing dryer without a washer and iron is rare. This reveals a need for the combination of these attributes. Utilizing this intuition, we define a frequent node in as follows:
Frequent Node: Given a dataset , and a threshold , a node is frequent if and only if the number of tuples in containing the attributes is at least times , i.e., .121212For simplicity, we use to refer to .
For instance, in Example 1 let be . In Figure 1, is frequent because Accom. 2, Accom. 5, and Accom. 9 contain the attributes :Internet, :Washer ; thus . However, since is , is not frequent. The set of additional notation, utilized in § IV is provided in Table II.
Consider a tuple and a set of attributes to be added to . Let be and be . After adding to , for any node in , belongs to . However, according to Definition 3, only the frequent nodes in are desirable. Using this intuition, Definition 4 provides a practical gain function utilizing nothing but the tuples in the dataset.
Frequent-item Based Count (FBC): Given a dataset , and a node , the Frequent-item Based Count (FBC) of is the number of frequent nodes in . Formally
For simplicity, throughout the paper we use FBC() to refer to FBC. In Example 1, consider . In Figure 9, we have colored the frequent nodes in . Counting the number of colored nodes in Figure 9, FBC() is .
Such a definition of a gain function has several advantages, mainly (i) it requires knowledge only of the existing tuples in the dataset (ii) it naturally captures changes in the joint demand for certain attribute combinations (iii) it is robust and adaptive to the underlying data changes. However, it is known that , counting the number of frequent itemsets is #P-complete. Consequently, counting the number of frequent subsets of a subset of attributes (i.e., counting the number of frequent nodes in ) is exponential to the size of the subset (i.e., the size of ). Therefore, for this gain function, even the verification version of GMFA is likely not solvable in polynomial time.
Thus, in the rest of this section, we design a practical output sensitive algorithm for computing FBC(). Next, we discuss an observation that leads to an effective computation of FBC and a negative result against such a design.
|The frequency threshold|
|The set of tuples in that contain the attributes|
|FBC||The frequent-item based count of|
|The set of maximal frequent nodes in|
|The set of frequent nodes in|
|The pattern that is a string of size|
|COV||The set of nodes covered by the pattern|
|Number of s in|
|The bipartite graph of the node|
|The set of nodes of|
|The set of edges of|
|The edge from the node to in|
|The adjacent nodes to the in|
The binary vector showing the nodeswhere
|Part of assigned to while applying Rule 1 on|
Iv-B FBC computation – Initial Observations
Given a node , to identify FBC, the baseline solution traverses the lattice under , i.e., counting the number of nodes in which more than tuples in the dataset match the attributes corresponding to . Thus, this baseline is always in . An improved method to compute FBC of , is to start from the bottom of the lattice and follow the well-known Apriori algorithm discovering the number of frequent nodes. This algorithm utilizes the fact that any superset of an infrequent node is also infrequent. The algorithm combines pairs of frequent nodes at level that share attributes, to generate the candidate nodes at level . It then checks the frequency of candidate pairs at level to identify the frequent nodes of size and continues until no more candidates are generated. Since generating the candidate nodes at level contains combining the frequent nodes at level , this algorithms is in FBC.
Consider a node which is frequent. In this case, Apriori will generate all the frequent nodes, i.e., in par with the baseline solution. One interesting observation is that if itself is frequent, since all nodes in are also frequent, FBC is . As a result, in such cases, FBC can be computed in constant time. In Example 1, since node with bit representative is frequent FBC ().
This motivates us to compute the number of frequent nodes in a lattice without generating all the nodes. First we define the set of maximal frequent nodes as follows:
Set of Maximal Frequent Nodes: Given a node , dataset , and a threshold , the set of maximal frequent nodes is the set of frequent nodes in that do not have a frequent parent. Formally,
In the rest of the paper, we ease the notation with . In Example 1, the set of maximal frequent nodes of with bit representative is , where , , and .
Unfortunately, unlike the cases where itself is frequent, calculating the FBC of infrequent nodes is challenging. That is because the intersections between the frequent nodes in the sublattices of are not empty. Due to the space limitations, please find further details on this negative result in Appendix IV-C. Therefore, in § IV-D, we propose an algorithm that breaks the frequent nodes in the sublattices of into disjoint partitions.
Iv-C A negative result on computing FBC using
As discussed in § IV-B, if a node is frequent, all the nodes in are also frequent, FBC() is . Considering this, given , suppose is a node in the set of maximal frequent nodes of (i.e., ); the FBC of any node where can be simply calculated as FBC() . In Example 1, node with bit representative is in , thus FBC; for the node also, since is a subset of , FBC is ().
Unfortunately, this does not hold for the nodes whose attributes are a superset of a maximal frequent node; calculating the FBC of those nodes is challenging. Suppose we wish to calculate the FBC of in Example 1. The set of maximal frequent nodes of is , where , , and . Figure 9 presents the sublattice of each of the maximal frequent nodes in a different color. The nodes in are colored green, while the nodes in are orange and the nodes in are blue. Several nodes in (including itself) are not frequent; thus FBC is less than (). In fact, FBC is equal to the size of the union of colored sublattices. Note that the intersection between the sublattices of the maximal frequent nodes is not empty. Thus, even though for each maximal frequent node , the FBC is , computing the FBC is not (computationally) simple. If we simply add the FBC of all maximal frequent nodes in , we are overestimating FBC, because we are counting the intersecting nodes multiple times.
More formally, given an infrequent node with maximal frequent nodes , the FBC is equal to the size of the union of the sublattices of its maximal frequent nodes which utilizing the inclusionexclusion principle is provided by Equation 4. In this equation, and is the sublattice of node .
For example, in Figure 9:
Computing FBC based on Equation 4 requires to add (or subtract) terms, thus its running time is in .
Iv-D Computing FBC using
As elaborated in § IV-C, it is evident that for a given infrequent node , the intersection between the sublattices of its maximal frequent nodes in is not empty. Let be the set of all frequent nodes in – i.e., . In Example 1, the is a set of all colored nodes in Figure 9.
Our goal is to partition to a collection of disjoint sets such that (i) and (ii) the intersection of the partitions is empty, i.e., , ; given such a partition, FBC is . Such a partition for Example 1 is shown in Figure 9, where each color (e.g, blue) represents a set of nodes which is disjoint from the other sets designated by different colors (e.g., orange and green). In the rest of this section, we propose an algorithm for such disjoint partitioning.
Let us first define “pattern” as a string of size , where : . Specially, we refer to the pattern generated by replacing all s in with as the initial pattern for . For example, in Figure 9, there are four attributes, is a pattern (which is the initial pattern for ). The pattern covers all nodes whose bit representatives start with (the nodes with green color in Figure 9). More formally, the coverage of a pattern is defined as follows:
Given the set of attributes and a pattern , the coverage of pattern is131313Note that if , may or may not belong to .
In Figure 9, all nodes with green color are in and all nodes with blue color are in . Specifically, node with bit representative is in because its first bit is . Note that a node may be covered by multiple patterns, e.g., node with bit representative is in and . We refer to patterns with disjoint coverage as disjoint patterns. For example, and are disjoint patterns.
Figure 9, provides a set of disjoint patterns (also presented in the 4-th column of Figure 9) that partition in Figure 9. The nodes in the coverage of each pattern is colored with a different color. Given a pattern let be the number of s in ; the number of nodes covered by the pattern is . Thus given a set of disjoint patterns that partition , FBC() is simply . For example, considering the set of disjoint patterns in Figure 9, the last column of Figure 9 presents the number of nodes in the coverage of each pattern (i.e., ); thus FBC() in this example is the summation of the numbers in the last column (i.e., ).
In order to identify the set of disjoint patterns that partition , a baseline solution may need to compare every initial pattern for every node with all the discovered patterns , to determine the set of patterns that are disjoint from . As a result, because every pattern covers at least one node in , the baseline solution may generate up to FBC() patterns and (comparing all patterns together) its running time in worst case is quadratic in its output (i.e., FBC). As a more comprehensive example, let us introduce Example 2; we will use this example in what follows.
Example 2. Consider a case where and we want to compute FBC for the root of , i.e. . Let be as shown in Figure 10.
In Example 2, consider the two initial patterns and , for the nodes and , respectively. Note that for each initial pattern for a node , is equal to . In order to partition the space in