In many scientific fields, there is a growing need to understand measured or observed data, to find different regularities or anomalies, groups of instances (patterns) for which they occur and their descriptions in order to get an insight into the underlying phenomena.
This is addressed by redescription mining (Ramakrishnan et al., 2004), a type of knowledge discovery that aims to find different descriptions of similar sets of instances by using one, or more disjoint sets of descriptive attributes, called views. It is applicable in a variety of scientific fields like biology, economy, pharmacy, ecology, social science and other, where it is important to understand connections between different descriptors and to find regularities that are valid for different subsets of instances. Redescriptions are tuples of logical formulas which are called queries. Redescription contains two queries:
The first query () describes a set of instances (geospatial locations) by using a set of attributes related to temperature () and precipitation () in a given month as first view (in the example average temperature in July and average precipitation in June). The second query () describes very similar set of locations by using a set of attributes specifying animal species inhabiting these locations as a second view (in this instance polar bear). Queries contain only conjunction logical operator, though the approach supports conjunction, negation and disjunction operators.
We first describe the fields of data mining and knowledge discovery closely related to redescription mining. Next, we describe recent research in redescription mining, relevant to the approach we propose. We then outline our approach positioned in the context of related work.
1.1 Fields related to redescription mining
Redescription mining is related to association rule mining (Agrawal et al., 1996; Hipp et al., 2000; Zhang & He, 2010), two-view data association discovery (van Leeuwen & Galbrun, 2015), clustering (Cox, 1957; Fisher, 1958; Ward, 1963; Jain et al., 1999; Xu & Tian, 2015) and it’s special form conceptual clustering (Michalski, 1980; Fisher, 1987), subgroup discovery (Klösgen, 1996; Wrobel, 1997; Novak et al., 2009; Herrera et al., 2010), emerging patterns (Dong & Li, 1999; Novak et al., 2009), contrast set mining (Bay & Pazzani, 2001; Novak et al., 2009) and exceptional model mining (Leman et al., 2008). Most important relations can be seen in Figure 1.
Association rule mining (Agrawal et al., 1996) is related to redescription mining in the aim to find queries describing similar sets of instances which reveal associations between attributes used in these queries. The main difference is that association rules produce one directional associations while redescription mining produces bi directional associations. Two-view data association discovery (van Leeuwen & Galbrun, 2015) aims at finding a small, non - redundant set of associations that provide insight in how two views are related. Produced associations are both uni and bi directional as opposed to redescription mining that only produces bi directional connections providing interesting descriptions of instances.
The main goal of clustering is to find groups of similar instances with respect to a set of attributes. However, it does not provide understandable and concise descriptions of these groups which are often complex and hard to find. This is resolved in conceptual clustering Michalski (1980); Fisher (1987) that finds clusters and concepts that describe them. Redescription mining shares this aim but requires each discovered cluster to be described by at least two concepts. Clustering is extended by multi-view (Bickel & Scheffer, 2004; Wang et al., 2013) and multi-layer clustering (Gamberger et al., 2014) to find groups of instances that are strongly connected across multiple views.
Subgroup discovery (Klösgen, 1996; Wrobel, 1997) differs from redescription mining in its goals. It finds queries describing groups of instances having unusual and interesting statistical properties on their target variable which are often unavailable in purely descriptive tasks. Exceptional model mining (Leman et al., 2008) extends subgroup discovery to more complex target concepts searching for subgroups such that a model trained on this subgroup is exceptional based on some property.
Emerging Patterns (Dong & Li, 1999) aim at finding itemsets that are statistically dependent on a specific target class while Contrast Set Mining (Bay & Pazzani, 2001) identifies monotone conjunctive queries that best discriminate between instances containing one target class from all other instances.
1.2 Related work in redescription mining
The field of redescription mining was introduced by Ramakrishnan et al. (2004)
, who present an algorithm to mine redescriptions based on decision trees, called CARTwheels. The algorithm works by building two decision trees (one for each view) that are joined in the leaves. Redescriptions are found by examining the paths from the root node of the first tree to the root node of the second. The algorithm uses multi class classification to guide the search between the two views. Other approaches to mine redescriptions include the one proposed byZaki & Ramakrishnan (2005), which uses a lattice of closed descriptor sets to find redescriptions; the algorithm for mining exact and approximate redescriptions by Parida & Ramakrishnan (2005) that uses relaxation lattice, and the greedy and the MID algorithm based on frequent itemset mining by Gallo et al. (2008). All these approaches work only on Boolean data.
Galbrun & Miettinen (2012b) extend the greedy approach by Gallo et al. (2008) to work on numerical data. Redescription mining was extended by Galbrun & Kimmig (2013) to a relational and by Galbrun & Miettinen (2012a) to an interactive setting. Recently, two tree-based algorithms have been proposed by Zinchenko (2014), which explore the use of decision trees in a non-Boolean setting and present different methods of layer-by-layer tree construction, which make informed splits at each level of the tree. Mihelčić et al. (2015a, b) proposed a redescription mining algorithm based on multi-target predictive clustering trees (PCTs) (Blockeel & De Raedt, 1998; Kocev et al., 2013). This algorithm typically creates a large number of redescriptions by executing PCTs iteratively: it uses rules created for one view of attributes in one iteration, as target attributes for generating rules for the other view of attributes in the next iteration. A redescription set of a given size is improved over the iterations by introducing more suitable redescriptions which replace the ones that are inferior according to predefined quality criteria.
In this work, we introduce a redescription mining framework that allows creating multiple redescription sets of user defined size, based on user defined importance levels of one or more redescription quality criteria. The underlying redescription mining algorithm uses multi-target predictive clustering trees (Kocev et al., 2013) and allows the main steps of rule creation and redescription construction explained in (Mihelčić et al., 2015b). This is in contrast to current state of the art approaches that return all constructed redescriptions that satisfy accuracy and support constraints (Ramakrishnan et al., 2004; Zaki & Ramakrishnan, 2005; Parida & Ramakrishnan, 2005), a smaller number of accurate and significant redescriptions that satisfy support constraints (Galbrun & Miettinen, 2012b; Zinchenko, 2014; Gallo et al., 2008) or optimize one redescription set of user defined size (Mihelčić et al., 2015b). This algorithm supports a broader process which involves the creation and effective utilization of a possibly large redescription set.
From the expert systems perspective, the framework allows creating large and heterogeneous knowledge basis for use by the domain experts. It also allows fully automated construction of specific subsets of obtained knowledge based on predefined user-criteria. The system is modular and allows using the redescription set construction procedure as an independent querying system on the database created by merging multiple redescription sets produced by many different redescription mining approaches. Obtained knowledge can be used, for example, as a basis or complement in decision support systems.
The framework provides means to explore and compare multiple redescription sets, without the need to expensively experiment with tuning the parameters of the underlying redescription mining algorithm. This is achieved with (i) an efficient redescription mining algorithm with a new conjunctive refinement procedure, that produces large, heterogeneous and accurate redescription sets and (ii) redescription set construction procedure that produces one or more reduced redescription sets tailored to specific user preferences in a multi-objective optimization manner.
After introducing the necessary notation in Section 2, we present the framework for redescription set construction in Section 3. First, we shortly describe the CLUS-RM algorithm, then we introduce the conjunctive refinement procedure and explain the generalized redescription set construction process. Next, we introduce the variability index: which supports a refined treatment of redescription accuracy in presence of missing values. We describe the datasets and an application involving redescription sets produced by the framework in Section 4 and perform theoretical and empirical evaluation of the framework’s performance in Section 5. Empirical evaluation includes quality analysis of representative sets and comparison to the set containing all discovered redescriptions, evaluation of the conjunctive refinement procedure, and quality comparison of redescriptions produced by our framework to those produced by several state of the art redescription mining algorithms, on three datasets with different properties. We conclude the paper in Section 6.
2 Notation and definitions
The input dataset is a quintuple of the two attribute (variable) sets (), an element (instance) set , and the two views corresponding to these attribute sets. Views ( and ) are data matrices such that if an element has a value for attribute .
A query is a logical formula that can contain the conjunction, disjunction and negation logical operators. These operators describe logical relations between different attributes, from attribute sets and , that constitute a query. The set of all valid queries is called a query language. The set of elements described by a query , denoted , is called its support. A redescription is defined as a pair of queries, where and contain variables from and respectively. The support of a redescription is the set of elements supported by both queries that constitute this redescription . We use to denote the multi-set of all occurrences of attributes in the queries of a redescription . The corresponding set of attributes is denoted . The set containing all produced redescriptions is denoted . User-defined constraints are typically limits on various redescription quality measures.
Given a dataset , a query language over a set of attributes , and a set of constraints , the task of redescription mining (Galbrun, 2013) is to find all redescriptions satisfying constraints in .
2.1 Individual redescription quality measures
The accuracy of a redescription
is measured with the Jaccard similarity coefficient (Jaccard index).
The problem with this measure is that redescriptions describing large subsets of instances often have a large intersection which results in high value of Jaccard index. As a result, the obtained knowledge is quite general and often not very useful to the domain expert. It is thus preferred to have redescriptions that reveal more specific knowledge about the studied problem and are harder to obtain by random sampling from the underlying data distribution.
This is why we compute the statistical significance (
-value) of each obtained redescription. We denote the marginal probability of a queryand with and , respectively and the set of elements described by both as . The corresponding -value (Galbrun, 2013) is defined as
The -value represents a probability that a subset of elements of observed size or larger is obtained by joining two random queries with marginal probabilities equal to the fractions of covered elements. It is an optimistic criterion, since the assumption that all elements can be sampled with equal probability need not hold for all datasets.
Since it is important to provide understandable and short descriptions, it is interesting to measure the number of attributes occurring in redescription queries .
Redescription with its queries defined as:
describes locations which are inhabited by the polar bear. The query describes the average temperature () and the average precipitation () conditions of these locations in June and July. The redescription has a Jaccard index value of and a -value smaller than . The multi-set and its corresponding set . The query size of , denoted , equals .
2.2 Redescription quality measures based on redescription set properties
We use two redescription quality measures based on properties of redescriptions contained in a corresponding redescription set.
The measure providing information about the redundancy of elements contained in the redescription support is called the average redescription element Jaccard index and is defined as:
Analogously, the measure providing information about the redundancy of attributes contained in redescription queries, called the average redescription attribute Jaccard index, is defined as:
We illustrate the average attribute Jaccard index on the redescription example from the previous subsection. If we assume that our redescription set contains only two redescriptions where equals:
The corresponding average attribute Jaccard index of the redescription equals showing a high level of redundancy in the used attributes between redescription and the only other redescription available in the set . On the other hand, in the redescription set , where contains queries:
the average attribute Jaccard index of the redescription equals showing no redundancy in the used attributes.
3 Redescription mining framework
In this section, we present a redescription mining framework. It first creates a large set of redescriptions and then uses it to create one or more smaller sets that are presented to the user. This is done by taking into account the relative user preferences regarding importance of different redescription quality criteria.
3.1 The CLUS-RM algorihtm
The framework generates redescriptions with the CLUS-RM algorithm Mihelčić et al. (2015b), presented in Algorithm 1. It uses multi-target Predictive Clustering Trees (PCT) (Kocev et al., 2013) to construct conjunctive queries which are used as building blocks of redescriptions. Queries containing disjunctions and negations are obtained by combining and transforming queries containing only conjunction operator.
The algorithm is able to produce a large number of highly accurate redescriptions from which many contain only conjunction operator in the queries. This is in part the consequence of using PCTs in multi-target setting, which is known to outperform single class classification or regression trees due to the property of inductive transfer (Piccart, 2012). This distinguishes the CLUS-RM redescription mining algorithm from other state of the art solutions that in general create a smaller number of redescriptions with majority of redescription queries containing the disjunction operator.
3.1.1 Rule construction and redescription creation
The initial task in the algorithm is to create one PCT per view of the original data, constructed for performing unsupervised tasks, to obtain different subsets of instances (referred to as initial clusters) and the corresponding queries that describe them. To create initial clusters (line 2 in Algorithm 1), the algorithm transforms an unsupervised problem to a supervised problem by constructing an artificial instance for each original instance in the dataset. These instances are obtained by shuffling attribute values among original instances thus braking any existing correlations between the attributes. Each artificial instance is assigned a target label while each original instance is assigned a target label . One such dataset is created for each view considered in the redescription mining process. A PCT is constructed on each dataset, with the goal of distinguishing between the original and the artificial instances, and transformed to a set of rules. This transformation is achieved by traversing the tree, joining all attributes used in splits into a rule and computing its support. Each node in a tree forms one query containing the conjunction and possibly negation operators (line 3 and 7 in Algorithm 1).
After the initial queries are created, the algorithm connects different views by assigning target labels to instances based on their coverage by queries constructed from the opposing view (line 5 in Algorithm 1). To construct queries containing attributes from , each instance is assigned a target label if it is described by a query containing the attributes from , otherwise it is assigned a value . The process is iteratively repeated a predefined number of steps (line 4 in Algorithm 1).
Redescriptions are created as a Cartesian product of a set of queries formed on and a set of queries formed on (line 8 in Algorithm 1). All redescriptions that satisfy user defined constraints (): the minimal Jaccard index, the maximal -value, the minimal and the maximal support are added to the redescription set. The algorithm can produce redescriptions containing conjunction, negation and disjunction operators.
The initialization, rule construction and various types of redescription creation are thoroughly described in (Mihelčić et al., 2015b).
3.1.2 Conjunctive refinement
In this subsection, we present an algorithmic improvement to the redescription mining process presented in Algorithm 1. The aim of this method is to improve the overall accuracy of redescriptions in the redescription set by combining newly created redescriptions with redescriptions already present in redescription set .
Combining existing redescription queries with an attribute by using conjunction operator has been used in greedy based redescription mining algorithms (Gallo et al., 2008; Galbrun & Miettinen, 2012b) to construct redescriptions. The idea is to expand each redescription query in turn by using a selected attribute and the selected logical operator. Such procedure, if used with the conjunction operator, leads to increase of Jaccard index but also mostly reduces the support size of a redescription. Zaki & Ramakrishnan (2005) combine closed descriptor sets by using conjunction operator to construct a closed lattice of descriptor sets which are used to construct redescriptions. They conclude that combining descriptor set and describing element sets and respectively, such that , can be done by constructing a descriptor set . They conclude that the newly created descriptor set, describes the same set of elements as the set . This procedure works only with attributes containing Boolean values and does not use the notion of views.
Instead of extending redescription queries with attributes connected using conjunction operator (which is usually constrained by the number of expansions), the conjunctive refinement procedure compares support of each redescription in the redescription set with the selected redescription . It merges the queries of these two redescriptions with the operator to obtain a new redescription if and only if . We extend and prove the property described in Zaki & Ramakrishnan (2005) in a more general setting, combining redescriptions with arbitrary type of attributes and a finite amount of different views. We demonstrate how to use it efficiently with numerical attributes and show that this procedure does not decrease the accuracy of a redescription. In fact, if such that , than .
If the attributes contain numerical values, we can transform the redescription , given an arbitrary redescription such that , to redescription such that has tighter numerical bounds on all attributes contained in the queries, and that . By doing this, we increase the probability of finding the element or as described above, which leads to improving the accuracy of redescription . The construction procedure of such redescription is explained in Section S1.1 (Online Resource 1). The redescription is used as a refinement redescription when numerical attributes are present in the data.
We can now state and prove the following lemma:
For every redescription , for every redescription , where , and , , and . If then for a redescription it holds that and .
The proof of Lemma 3.1 for redescription mining problems containing two views can be seen in Section S1.1 (Online Resource 1). General formulation with arbitrary views is proven by mathematical induction. It is easily seen from the proof that if such that then thus ultimately .
The conjunctive refinement is demonstrated in Figure 2.
The procedure described in Algorithm 2 and demonstrated in Figure S1 applies conjunctive refinement by using redescriptions that satisfy the user defined constraints and redescriptions that satisfy looser constraints on the Jaccard index (, ). These constraints determine the amount and variability of redescriptions used to improve the redescription set.
The refinement procedure, in combination with redescription query minimization explained in Mihelčić et al. (2015b), provides grounds for mining more accurate yet compact redescriptions.
3.2 Generalized redescription set construction
The redescription set obtained by Algorithm 1 contains redescriptions satisfying hard constraints described in the previous subsections. It is often very large and hard to explore. For this reason, we extract one or more smaller sets of redescriptions that satisfy additional preferential properties on objective redescription evaluation measures, set up by the user, and present them for exploration. This process is demonstrated in Figure 3.
Producing summaries and compressed rule set representations is important in many fields of knowledge discovery. In the field of frequent itemset mining such dense representations include closed itemsets (Pasquier et al., 1999) and free sets (Boulicaut & Bykowski, 2000). The approaches using set pattern mining construct a set by enforcing constraints on different pattern properties, such as support, overlap or coverage (Guns et al., 2011). Methods developed in information theory consider sets that provide the best compression of a larger set of patterns. These techniques use properties like the Information Bottleneck (Tishby et al., 1999) or the Minimum description length (Grünwald, 2007). The work on statistical selection of association rules developed by Bouker et al. (2012) presented techniques to eliminate irrelevant rules based on dominance, which is computed on several possibly conflicting criteria. If some rule is not strictly dominated by any other rule already in the set, the minimal similarity with some representative rule is used to determine if it should be added to the set.
Redescriptions are highly overlapping with respect to described instances and attributes used in the queries. It is often very hard to find fully dominated redescriptions, and the number of dominated redescriptions that can be safely discarded is relatively small compared to a set of all created redescriptions. Our approach, to create a set of user defined (small) size, does not use a representative rule to compute the similarity. Instead, it adds redescriptions to the final redescription set by using the scalarization technique (Caramia & Dell’Olmo, 2008) developed in multi-objective optimization to find the optimal solution when faced with many conflicting criteria. If the corresponding optimization function is minimized, given positive weights, the solution is a strict pareto optimum, otherwise it is a weak pareto optimum (Caramia & Dell’Olmo, 2008) of a multi objective optimization problem. Similar aggregation technique is used in multi attribute utility theory - MAUT (Winterfeldt & Fischer, 1975) to rank the alternatives in decision making problems.
Each redescription is evaluated with a set of criteria known from the literature or defined by the user. The final quality score is obtained by aggregating these criteria with user-defined importance weights to produce a final numerical score. Based on this score, the method selects one non-dominated redescription, based on utilised quality criteria, at each step of redescription set construction.
The procedure generalizes the current redescription set construction approaches in two ways: 1) it allows defining importance weights to different redescription quality criteria and adding new ones to enable constructing redescription sets with different properties which provides different insight into the data, 2) it allows creating multiple redescription sets by using different weight vectors, support levels, Jaccard index thresholds or redescription set sizes. Thus, it in many cases eliminates the need to make multiple runs of a redescription mining algorithm.
One extremely useful property of the procedure is that it can be used by any existing redescription mining algorithm, or a combination thereof. In general, larger number of diverse, high quality redescriptions allows higher quality reduced sets construction.
Are there any elements in the data that share many common properties? Can we find a subset of elements that allows multiple different redescriptions? Can we find very diverse but accurate redescriptions? What is the effect of reducing redescription query size to the overall accuracy on the observed data? What are the effects of missing values to the redescription accuracy? What is our confidence that these redescriptions will remain accurate if missing values are added to our set? This is only a subset of questions that can be addressed by observing redescription sets produced by the proposed procedure. The goal is not to make redescription mining subjective in the sense of interestingness (Tuzhilin, 1995) or unexpectedness (Padmanabhan & Tuzhilin, 1998), but to enable exploration of mined patterns in a more versatile manner.
The input to the procedure is a set of redescriptions produced by Algorithm 1 and an importance weight matrix defined by the user. The rows of the importance weight matrix define the users’ importance for various redescription quality criteria. The procedure creates one output redescription set for each row in the importance weight matrix (line 3 in Algorithm 3). The procedure works in two parts: first it computes element and attribute occurrence in redescriptions from the original redescription set (line 2 in Algorithm 3). This information is used to find the redescription that satisfies the user defined criteria and describes elements by using attributes that are found in a small number of redescriptions from the redescription set. When found (line 4 in Algorithm 3), it is placed in the redescription set being constructed (line 5 in Algorithm 3). Next, the procedure iteratively adds non-dominated redescriptions (lines 7-9 in Algorithm 3) until the maximum allowed number of redescriptions is placed in the newly constructed set (line 6 in Algorithm 3).
In the current implementation, we use redescription quality criteria, however more can be added. Five of these criteria are general redescription quality criteria, the last one is used when the underlying data contains missing values and will be described in the following section.
The procedure findSpecificRed uses the information about the redescription Jaccard index, -value, query size and the occurrence of elements described by the redescription and attributes found in redescriptions queries in redescriptions from the redescription set. The -value quality score of a redescription is computed as:
The logarithm is applied to linearise the -values and the normalization is used because is the smallest possible -value that we can compute.
The element occurrence score of a redescription is computed as: . The attribute occurrence score is computed in the same way as: . We also compute the score measuring query size in redescriptions:
The user-defined constant denotes redescription complexity normalization factor. In this work we use , because redescriptions containing more than variables in the queries are highly complex and hard to understand.
The first redescription is chosen by computing: . Each following redescription is evaluated with a score function that computes redescription similarity to each redescription contained in the redescription set. The similarity is based on described elements and attributes used in redescription queries. This score thus allows controlling the level of redundancy in the redescription set. For a redescription we compute: and .
Several different approaches to reducing redundancy among redescriptions have been used before, however no exact measure was used to select redescriptions or to assess the overall level of redundancy in the redescription set. Zaki & Ramakrishnan (2005) developed an approach for non-redundant redescription generation based on a lattice of closed descriptor sets, Ramakrishnan et al. (2004) used the parameter defining the number of times one class or descriptor is allowed to participate in a redescription. This is used to make a trade-off between exploration and redundancy. Parida & Ramakrishnan (2005) computed non-redundant representations of sets of redescriptions containing some selected descriptor (set of Boolean attributes). Galbrun & Miettinen (2012b) defined a minimal contribution parameter each literal must satisfy to be incorporated in a redescription query. This enforces control over redundancy on the redescription level. Redundancy between different redescriptions is tackled in the Siren tool Galbrun & Miettinen (2012c) as a post processing (filtering) step. Mihelčić et al. (2015b) use weighting of attributes occurring in redescription queries and element occurrence in redescription supports based on work in subgroup discovery (Gamberger & Lavrac, 2002; Lavrač et al., 2004).
We combine the redescription -value score with its support to first add highly accurate, significant redescriptions with smaller support, and then incrementally add accurate redescriptions with larger support size. Candidate redescriptions are found by computing: , where denotes the number of redescriptions contained in the set under construction at this step.
3.3 Missing values
There are more possible ways of computing the redescription Jaccard index when the data contains missing values. The approach that assumes that all elements from redescription support containing missing values are distributed in a way to increase the redescription Jaccard index is called optimistic (). Similarly, the approach that assumes that all elements from redescription support containing missing values are distributed in a way to decrease the redescription Jaccard index is called pessimistic (). The rejective Jaccard index evaluates redescriptions only by observing elements that do not contain missing values for attributes contained in redescription queries. These measures are discussed in (Galbrun & Miettinen, 2012b). The Query non-missing Jaccard index (), introduced in (Mihelčić et al., 2015b)
, is an approach that gives a more conservative estimate than the optimistic Jaccard index but more optimistic estimate than the pessimistic Jaccard index. The main evaluation criteria for this index is that a query (containing only the conjunction operator) can not describe an element that contains missing values for attributes in that query. This index is by its value closer to the optimistic than the pessimistic Jaccard index. However, as opposed to the optimistic approach, redescriptions evaluated by this index contain in their support only elements that have defined values for all attributes in redescription queries and that satisfy query constraints. The index does not penalize the elements containing missing values for attributes in both queries which are penalized in the pessimistic Jaccard index.
In this paper, we introduce a natural extension to the presented measures: the redescription variability index. This index measures the maximum possible variability in redescription accuracy due to missing values. This allows finding redescriptions that have only slight variation in accuracy regardless the actual value of the missing values. It also allows reducing very strict constraints imposed by the pessimistic Jaccard index that might lead to the elimination of some useful redescriptions.
The redescription variability index is defined as: .
Formal definitions of pessimistic and optimistic Jaccard index can be seen in Section S1.2 (Online resource 1).
The scores used to find the first and the best redescription in generalized redescription set construction (Section 3.2) are extended to include the variability score.
Our framework optimizes query non-missing Jaccard but reports all Jaccard index measures when mining redescriptions on the data containing missing values. In principle with the generalized redescription set construction, we can return reduced sets containing accurate redescriptions found with respect to each Jaccard index. Also, with the use of variability index, the framework allows finding redescriptions with accuracy affected to a very small degree by the missing values which is not possible by other redescription mining algorithms in the literature. The only approach working with missing values ReReMi requires preforming multiple runs of the algorithm to make any comparisons between redescriptions mined by using different version of Jaccard index.
4 Data description and applications
We describe three datasets used to evaluate CRM-GRS and demonstrate its application on a Country dataset.
4.1 Data description
The evaluation and comparisons are performed on three datasets with different characteristics: the Country dataset (UNCTAD, 2014; WorldBank, 2014; Gamberger et al., 2014), the Bio dataset (Mitchell-Jones, 1999; Hijmans et al., 2005; Galbrun, 2013) and the DBLP dataset (DBLP, 2010; Galbrun, 2013). Detailed description of each dataset can be seen in Section S2 (Online resource 1).
|Country countries||Numerical () World Bank Year: 2012 Country info||Numerical () UNCTAD Year: 2012 Trade Info|
|Bio geographical locations||Numerical () Climate conditions||Boolean mammal species|
|DBLP authors||Boolean () author-conference bi-partite graph||Boolean ()co-authorship network|
Descriptions of all attributes used in the datasets are provided in the document (Online Resource 2).
4.2 Application on the Country dataset
The aim of this study is to discover regularities and interesting descriptions of world countries with respect to their trading properties and general country information (such as various demographic, banking and health related descriptors). We will focus on redescriptions describing four European countries: Germany, Czech Republic, Austria and Italy, discovered as a relevant cluster in a study performed by Gamberger et al. (2014). This study investigated country and trade properties of EU countries with potential implications to a free trade agreement with China. This or similar use-case may be a potential topic of investigation for economic experts but the results of such analysis could also be of interest to the policymakers and people involved in export or import business.
First step in the exploration process involves specifying various constraints on produced redescriptions. Determining parameters such as minimal Jaccard index or minimal support usually requires extensive experimentation. These experiments can be performed with CRM-GRS with only one run of redescription mining algorithm by using minimal Jaccard index of , minimal support of countries (if smaller subsets are not desired) and -value of . Parameters specifying reduced set construction can now be tuned to explore different redescription set sizes, minimal Jaccard thresholds or minimal and maximal support intervals. Results of such meta analysis (presented in Section S2.2.2 (Online resource 1)) show little influence of setting minimal Jaccard threshold on this dataset, however right choice of minimal support is important. Redescription sets using minimal support threshold of countries show superior properties and may contain useful knowledge.
We present three different redescriptions describing specified countries and revealing their similarity to several other countries (demonstrated in Figure 4).
Redescriptions and are defined as:
(, , , )
(, , )
(, , , )
|% of population aged [0,14]|
|% of population aged 65+|
|Mortality under years per|
|% of population growth|
|% of population living in rural area|
|% of GDP spent on worker’s remittances and compensation|
|% of adults listed by private credit bureau|
|% of GDP as (quasi) money|
|export, import, export to import ratio|
|Miscellaneous manufactured articles|
|Medium - skill, technology - intensive manufactures|
|Electrical machinery, apparatus and appliances|
|All allocated products|
|Cereals and cereal preparations|
|Beverages and tobacco|
Presented redescriptions (attribute descriptions available in Table 2) confirm several findings reported in (Gamberger et al., 2014). Mainly, high export of medium - skill and technology - intensive manufactures, export of beverages and tobacco, low percentage of young population. Additionally, these redescriptions reveal high percentage of elderly population (age and above), lower (compared to world average of ) but still present mortality rate of children under years of age (per living) and small to medium percentage of rural population. The credit coverage (percentage of adults registered for having unpaid depths, repayment history etc.) varies between countries but is no less than adult population. The money and quasi money (M2 - sum of currency outside banks etc.) is between substantial and very large of total country’s GDP. For additional examples see Section S2.2.3, Figure S11 (Online resource 1).
Output of CRM-GRS can be further analysed with visualization and exploration tools such as the Siren (Galbrun & Miettinen, 2012c) (available at http://siren.gforge.inria.fr/main/) or the InterSet (Mihelčić & Šmuc, 2016) (available at http://zel.irb.hr/interset/). In particular, the InterSet tool allows exploration of different groups of related redescriptions, discovery of interesting associations, multi-criteria filtering and redescription analysis on the individual level.
5 Evaluation and comparison
In this section we present the results of different evaluations. First, we perform a theoretical comparison of our approach with other state of the art solutions which, includes description of advantages and drawbacks of our method. Next, we apply the generalized redescription set construction procedure to these datasets starting from redescriptions created by the CLUS-RM algorithm. We evaluate the conjunctive refinement procedure and perform a thorough comparison of our reduced sets with the redescription sets obtained by several state of the art redescription mining algorithms. The comparisons use measures on individual redescriptions (Section 2.1) as well as measures on redescription sets (Section 2.2). We also use the normalized query size defined in Section 3.2.
The execution time analysis, showing significant time reduction when using generalized redescription set construction instead of multiple CLUS-RM runs, is described in Section S2.4 (Online resource 1).
5.1 Theoretical algorithm comparison
We compare the average case time and space complexity of the CRM-GRS with state of the art approaches and present the strengths and weaknesses of our framework.
|Algorithm||Time comp.||Space comp.|
|CRM-GRS||(No refinement) (refinement)|
The term in Table 3 denotes the number of nodes in the tree and is constrained by the tree depth . denotes the set of produced maximal closed frequent itemsets, denotes the length of the longest itemset, a set of produced biclusters, and denotes a set of produced redescriptions.
We can see from Table 3 that the CRM-GRS has slightly higher computational complexity than other tree - based approaches (which is based on time complexity of algorithm C4.5), caused by complexity of underlying redescription mining algorithm CLUS-RM. Optimizations proposed in (Mihelčić et al., 2015b) lower average time complexity of basic algorithm to and algorithm with refinement to . Worst-case complexity with the use of refinement is . It is the result of a very optimistic estimate that produced redescriptions satisfying user constraints grow quadratically with the number of nodes in the tree (this is only the case if no constraints on redescriptions are enforced). In reality, it has at most linear growth. Furthermore, term is only dominating if . Since redescription queries become very hard to understand if they contain more than attributes, even with attributes in each of two views, this term is dominated when instances.
Greedy approaches (Gallo et al., 2008; Galbrun & Miettinen, 2012b) are less affected by the increase in number of instances than the tree-based approaches, but are more sensitive to the increase in number of attributes.
Complexity of approaches based on closed and frequent itemset mining (Gallo et al., 2008; Zaki & Ramakrishnan, 2005) depends on the number of produced frequent or closed itemsets which in worst case equals . Similarly, the complexity of approach proposed by Parida & Ramakrishnan (2005) depends on the number of created biclusters and their size.
One property of our generalized redescription set construction procedure (GRSC) is that it can be used to replace multiple runs of expensive redescription mining algorithms. Analysis from Table 3 and in S2.6 (Online resource 1) shows that it has substantially lower time complexity than all state of the art approaches except the MID and the Closed Dset. However, even for this approaches, it might be beneficial to use GRSC instead of multiple runs of these algorithms when .
Since a trade-off between space and time complexity can be made for each of the analysed algorithms, we write the space complexity as a function of stored itemsets, rules, redescriptions or clusters. To reduce execution time, these structures can be stored in memory together with corresponding instances which increases space complexity to for all approaces.
One drawback of our method is increased memory consumption ( in the worst case). Since we memorize all distinct created redescriptions that satisfy user constraints, it is among more memory expensive approaches. Although, the estimate is greatly exaggerated, and is in real applications at most , it is currently the only approach that memorizes and uses all created redescriptions to create diverse and accurate redescription sets for the end users. If memory limit is reached, we use the GRCS procedure (called in line 8 of Algorithm 1) to create reduced redescription sets of predefined properties. Only redescriptions from these sets are retained allowing further execution of the framework.
Greedy and the MID approaches are very memory efficient since they store only a small number of candidate redescriptions in memory. Other tree-based approaches store two decision trees at each iteration, Closed Dset (Zaki & Ramakrishnan, 2005) approach saves a closed lattice of descriptor sets and the relaxation lattice approach (Parida & Ramakrishnan, 2005) saves produced biclusters.
The main advantages of our approach are that it produces a large number of diverse, highly accurate redescriptions which enables our multi-objective optimization procedure to generate multiple, high quality redescription sets of differing properties that are presented to the end user.
5.2 Experimental procedure
In this section we explain all parameter settings used to perform evaluations and comparisons with various redescription mining algorithms.
For all algorithms, we used the maximal -value threshold of (the strictest significance threshold). The minimal Jaccard index was set to for the DBLP dataset based on results presented in Galbrun (2013), Table 6.1, p. 46. The same is set to for the Bio dataset based on results in Galbrun (2013) Table 7, p. 301. The threshold for the Country dataset was experimentally determined. Minimal support was set to elements for the DBLP, based on Galbrun (2013) p.48, and the same is used for the Bio dataset. Country dataset is significantly smaller thus we set this threshold to elements. Impact of changing minimal Jaccard index and minimal support is data dependant. Increasing these thresholds causes a drop in diversity of produced redescriptions, resulting in high redundancy and in some cases inadequate number of produced redescriptions. However, it also increases minimal and average redescription Jaccard index and support size. Lowering these thresholds has the opposite effect, increasing diversity but potentially reducing overall redescription accuracy or support size. Increasing maximal -value threshold allows more redescriptions (although less significant) to be considered as candidates for redescription set construction. The effects of changing minimal Jaccard index and minimal support size on the produced redescription set of size by our framework on Country, Bio and DBLP dataset can be seen in Section S2.2.2 (Online resource 1).
We compared the CLUS-RM algorithm with the generalized redescription set construction procedure (CRM-GRS), to the ReReMi, the Split trees and the Layered trees algorithms implemented in the tool called Siren (Galbrun & Miettinen, 2012c). The specific parameter values used for each redescription mining algorithm can be seen in Section S2 (Online Resource 1).
5.3 Analysis of redescription sets produced with CRM-GRS
We analyse a set containing all redescriptions produced by CLUS-RM algorithm (referred to as a large set of redescriptions) and the corresponding sets of substantially smaller size constructed from this set by generalized redescription set construction procedure (referred to as reduced sets of redescriptions) on three different datasets.
For the purpose of this analysis, we create redescriptions without using the refinement procedure and disallow multiple redescriptions describing the same set of instances. To explore the influence of using different importance weights on properties of produced redescription sets, we use the different weight combinations given in Table 4.
In the rows and of matrix , we incrementally increase the importance weight for the Jaccard index and equally decrease the weight for the element and attribute Jaccard index in order to explore the effects of finding highly accurate redescriptions at the expense of diversity. The last row explores the opposite setting that completely disregards accuracy and concentrates on diversity.
By using importance weights in each row of matrices (Table 4) and (Table 5), we create redescription sets containing and redescriptions. We plot the change in element/attribute coverage, average redescription Jaccard index, average -value, average element/attribute Jaccard index and average query size against the redescription set size. Information about redescriptions in the large set is used as a baseline and compared to the quality of reduced sets.
5.3.1 The analysis on the Bio dataset
We start the analysis by examining the properties of the large redescription set presented in Figure 5. In Figure 6, we compare the properties of redescriptions in the large redescription set, against properties of redescriptions in reduced sets based on different preference vectors. The results are presented only for the Bio dataset, however similar analysis for the DBLP and the Country dataset is presented in Section S2.2.3 (Online Resource 1).
Figure 5 shows distributions of quality measures for redescriptions in the large redescription set constructed with CLUS-RM algorithm. Redescription Jaccard index is mostly in interval, though a noticeable number is in . The -value is at most but mainly smaller than . The maximum average element Jaccard index equals and the maximum average attribute Jaccard index equals which shows a fair level of diversity among produced redescriptions. Over of redescriptions contain less than attributes in both queries, and more than contains less than attributes in both queries which is good for understandability.
Plots in Figure 6 contain graphs demonstrating a specific property of the reduced redescription set and its change with the increase of reduced redescription set size. The Reduced graph demonstrates properties of redescriptions contained in redescription set created with the preference weights from the -th row of . The graph labelled Large set demonstrates properties of redescriptions from a redescription set containing all produced redescriptions.
Increasing the importance weight for a redescription Jaccard index has the desired effect on redescription accuracy in the reduced sets of various size. Large weight on this criteria leads to sets with many highly accurate but more redundant redescriptions (average element Jaccard ) with larger support (average support of the total number of elements in the dataset). Consequence of larger support is increased overall element coverage. The effect is in part the consequence of using the Bio dataset that contains a number of accurate redescriptions with high support (also discussed in (Galbrun, 2013)). This effect is not observed on the Country and the DBLP dataset (Figures S4 and S5), where element and attribute coverage is increased only with increasing diversity weights in the preference vector. The average redescription Jaccard index decreases as the reduced set size increases which is expected since the total number of redescriptions with the highest possible accuracy is mostly smaller than .
Use of weights from the second row of the importance matrix largely reduces redundancy and moderately lowers redescription accuracy in produced redescription set compared to weights that highly favour redescription accuracy. The equal weight combination provides accurate redescriptions (above large set average) that describe different subsets of elements by using different attributes (both below large set average). The average redescription support is lower as a result, around of data elements. Despite this, the element coverage is between and with the sharp increase to for a set containing redescriptions. The element coverage reaches for sets containing at least redescriptions.
Depending on the application, it might be interesting to find different, highly accurate descriptions of the same or very similar sets of elements (thus the weights from the third row of from Table 4 would be applied). Higher redundancy provides different characteristics that define the group. It sometimes also provides more specific information about subsets of elements of a given group.
We found several highly accurate redescriptions describing very similar subsets of locations on the Bio dataset by using weights from the third row of the matrix . These locations are characterized as a co-habitat of the Arctic fox and one of several other animals with some specific climate conditions. We provide two redescriptions describing a co-habitat of the Arctic fox and the Wood mouse.
This redescription describes locations with Jaccard index . One very similar redescription describing locations from which are the same as above, with Jaccard index is:
Examples that are even more interesting can be found on the Country data where very similar sets of countries can be described by using different trading and general country properties. The example can be seen in Section S2.1.3, Figure S11 (Online Resource 1).
5.3.2 Using the redescription variability index on the Country dataset
We analyse the impact of missing values to redescription creation and use newly defined redescription variability index (), in the context of generalized set generation, on the Country dataset with a weight matrix shown in Table 5. The variability weight is gradually increased while other weights are equally decreased to keep the sum equal to (which is convenient for interpretation).
The change in variability index depending on a reduced set size and comparison with the large set can be seen in Figure 7.
As expected, increasing the importance weight for redescription variability favours selecting more stable redescriptions to the changes in missing values.
To demonstrate the effects of variability index to redescription accuracy, we plot graphs comparing averages of optimistic, query non-missing and pessimistic Jaccard index for every row of the weight matrix for different reduced set sizes. The results for row and row can be seen in Figures 8 and 9. Plots for reduced sets obtained with importance weights from the , the and the row of are available in Figure S12 (Online resource 1).
Increasing the weight on the variability index has the desired effect of reducing the difference between values of different Jaccard index measures. However, the average optimistic and query non-missing Jaccard index values in the reduced sets drop as a result.
Redescription with :
is highly accurate and stable redescription constructed by CRM-GRS with the importance weight from the fourth row of a matrix . It is statistically significant with the -value smaller than .
Redescriptions exist for which and . In such cases, the drop in accuracy from to occurs because a number of elements exist in the dataset for which membership in the support of neither redescription query can be determined, due to missing values. Optimizing pessimistic Jaccard index is very strict and can discard some potentially significant redescriptions such as:
. This redescription has and . With the variability index of it describes all elements that can be evaluated by at least one redescription query with the highest possible accuracy.
This example motivates optimizing query non-missing Jaccard with positive weight on the variability index. It is especially useful when small number of highly accurate redescriptions can be found and when a large percentage of missing values is present in the data.
5.4 Evaluating the conjunctive refinement procedure
The next step is to evaluate the conjunctive refinement procedure and its effects on the overall redescription accuracy. We use the same experimental set-up as in Section 5.3 for both sets with the addition of the minimum refinement Jaccard index parameter, which was set to on the Bio dataset and on the Country and the DBLP dataset. The algorithm requires the initial clusters to start the mining process as explained in Section 3.1.1 and in (Mihelčić et al., 2015b). To maintain the initial conditions, we create one set of initial clusters and use them to create redescriptions with and without the conjunctive refinement procedure. Since we use PCTs with the same initial random generator seed in both experiments, the differences between sets are the result of applying the conjunctive refinement procedure. The effects of using conjunctive refinement are examined on sets containing all redescriptions produced by CLUS-RM and on reduced sets created with equal importance weights by the generalized redescription set construction procedure (Row in matrix ).
The effects of using the refinement procedure on redescription accuracy are demonstrated in comparative histogram (Figure 10) showing the distribution of redescription Jaccard index in a set created by CLUS-RM with and without the refinement procedure.
CLUS-RM produced redescriptions, satisfying constraints from Section 5.2, without the refinement procedure and redescriptions with the refinement procedure. The substantial increase in redescriptions satisfying user-defined constraints, when the conjunctive refinement procedure is used, is accompanied by significant improvement in redescription accuracy.
We performed the one-sided independent 2-group Mann-Whitney U test with the null hypothesis that there is a probability ofthat an arbitrary redescription () from a set obtained by using conjunctive refinement has the Jaccard index larger than the arbitrary redescription () from a set obtained without using the conjunctive refinement procedure (). The -value of lead us to reject the null hypothesis with the level of significance and conclude that must be true.
Another useful property of the conjunctive refinement procedure is that it preserves the size of redescription support. The comparative distribution of redescription supports between the sets is shown in Figure 11.
Majority of redescriptions that entered the redescription set because of the improvements made by the conjunctive refinement have supports in the interval elements. Because of that, the average support size in the redescription set obtained by using the refinement procedure () is lower than that obtained without the refinement procedure (). The change in distribution is significant, as shown by the one-sided independent 2-group Mann-Whitney U test. The test rejects the hypothesis with the level of significance (-value equals ), thus showing that .
Using the conjunctive refinement procedure improves redescription accuracy and adds many new redescriptions to the redescription set. However, since the reduced sets are presented to the user, it is important to see if higher quality reduced sets can be created from the large set by using the conjunctive refinement procedure compared to the set obtained without using the procedure.
We plot comparative distributions for all defined redescription measures for reduced sets extracted from the redescription set obtained with (CLRef) and without (CLNRef) the conjunctive refinement procedure. The comparison made on the sets containing redescriptions is presented in Figure 12. The boxplots representing distributions of supports show that the redescription construction procedure extracts redescriptions of various support sizes, which was intended to prevent focusing only on large or small redescriptions based on redescription accuracy.
We compute the one-sided independent 2-group Mann-Whitney U test on the reduced sets for the redescription Jaccard index () and the normalized redescription query size () since there seem to be a difference in distributions as observed from Figure 12. For other measures, we compute the two-sided Mann-Whitney U test to assess if there is any notable difference in values between the sets.
The null hypothesis that is rejected with the -value smaller than , thus the alternative hypothesis holds. The difference in support between two sets is not statistically significant (-value equals , obtained with the two-sided test). Distributions of redescription -values are identical because all redescriptions have equal -value: . The difference in average attribute/element Jaccard index is also not statistically significant (-values and respectively obtained with the two-sided test). The -value for the null hypothesis equals thus the alternative hypothesis