Detecting Important Patterns Using Conceptual Relevance Interestingness Measure

Discovering meaningful conceptual structures is a substantial task in data mining and knowledge discovery applications. While off-the-shelf interestingness indices defined in Formal Concept Analysis may provide an effective relevance evaluation in several situations, they frequently give inadequate results when faced with massive formal contexts (and concept lattices), and in the presence of irrelevant concepts. In this paper, we introduce the Conceptual Relevance (CR) score, a new scalable interestingness measurement for the identification of actionable concepts. From a conceptual perspective, the minimal generators provide key information about their associated concept intent. Furthermore, the relevant attributes of a concept are those that maintain the satisfaction of its closure condition. Thus, the guiding idea of CR exploits the fact that minimal generators and relevant attributes can be efficiently used to assess concept relevance. As such, the CR index quantifies both the amount of conceptually relevant attributes and the number of the minimal generators per concept intent. Our experiments on synthetic and real-world datasets show the efficiency of this measure over the well-known stability index.

READ FULL TEXT VIEW PDF

Authors

page 9

12/20/2018

Relevant Attributes in Formal Contexts

Computing conceptual structures, like formal concept lattices, is in the...
09/07/2021

Identifying Influential Nodes in Two-mode Data Networks using Formal Concept Analysis

Identifying important actors (or nodes) in a two-mode network often rema...
02/12/2018

Introducer Concepts in n-Dimensional Contexts

Concept lattices are well-known conceptual structures that organise inte...
05/25/2020

On Irrelevance of Attributes in Flexible Prediction

This paper analyses properties of conceptual hierarchy obtained via incr...
03/21/2018

On-demand Relational Concept Analysis

Formal Concept Analysis and its associated conceptual structures have be...
11/24/2011

Revisiting Numerical Pattern Mining with Formal Concept Analysis

In this paper, we investigate the problem of mining numerical data in th...
06/18/2017

Detecting Large Concept Extensions for Conceptual Analysis

When performing a conceptual analysis of a concept, philosophers are int...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A wide range of crucial problems in different sub-disciplines including data mining, knowledge discovery, social network analysis, bioinformatics, and machine learning can be formulated as a pattern mining task. Inspired by the mathematical power of Formal concept Analysis (FCA)

(Ganter+1999), the lattice formalisation is always rich with substantial local conceptual structures that are important to data mining tasks (belohlavek2011selecting). In the lattice, each element captures a formal concept and the whole lattice represents a hierarchy of concepts. Unfortunately, it could contain a large amount of irrelevant local structures, which traditionally impose a discrepancy between a real and current settings of a given formal context. The irrelevant objects emerge due to a number of reasons, such as the imprecise and inaccurate collection of the data, e.g., erroneously inserting or omitting objects when describing some attributes. On the flip side of the coin, the irrelevant attributes are redundant and frequently appear due to the completeness property of the lattice. They potentially lead to a high complexity even for small datasets (belohlavek2011selecting). In general, whether it is an irrelevant attribute or object inside a formal concept, it does not sufficiently contribute to the actionability of such a pattern. That is, the irrelevant element often provides useless information when it is involved in patterns extracted from the formal context. So, its removal from the concept has no impact on its conceptual structure and significantly purifies domain description and semantics. Traditionally, there are three main strategies to mine interesting patterns in a concept lattice (Kuznetsov2018)

: (i) formal context reduction using methods such as Singular Value Decomposition or Non-Negative Matrix Decomposition, (ii) background knowledge (e.g., taxonomy or weight on attributes) or constraint consideration, and (iii) concept selection. In this paper we will focus on the third topic which involves picking out a subset of concepts based on interestingness measures

(Belohlavek2013). In the FCA literature, several selection measures (see (Kuznetsov2018)

for a detailed survey) have been proposed such as concept probability, robustness, separation, Monocle

(torim2008sorting), -tolerance closed frequent itemsets (cheng2006delta), margin-closed itemset (moerchen2011efficient) and predictability (belohlavek2013basic) among others. On the basis of the Galois connection between the set of objects and the set of attributes, the closure and derivation conditions of formal concepts are often decisive properties for measuring the importance of these patterns. As such, the stability index (kuznetsov2007reducing), which depends primarily on these two properties, has recently been introduced as the most prominent index for assessing the concept quality (Kuznetsov2018).

Although the stability index serves as a good indication of how much relevant detail is expected inside the concept, it is known that computing stability is P-complete (roth2008succinct), which often requires an exponential time complexity in the size of the intent (or extent), i.e., . For example, we could need to perform more than million computational comparisons to calculate the stability of a concept with an intent size equal to . In (roth2008succinct), it was demonstrated that the stability can be computed based on the generators of the concept intent. However, this often has a time and space complexity of which is at least quadratic in the size of the input lattice (zhi2014calculation; roth2008succinct). This is problematic even for small concepts since the lattice size could be exponential with respect to the size of the context. Furthermore, stability is dependent on generators but not necessarily minimal ones, and it is widely known that using generators causes an overestimation of concept quality, resulting in redundant association rules, including implications.

In this paper, to avoid the limitations of stability, we introduce the Conceptual Relevance () index. At the conceptual level, our overall approach to the index consists of the following basic elements. First, we exploit the fact that using minimal generators of a concept intent only frequently results in less redundant (or more important) patterns (e.g., implications and association rules). Second, we formulate what is meant by relevant attributes. More precisely, these attributes are often essential for maintaining stable conceptual structures. As a result, we design the Conceptual Relevance of the concept to concurrently quantify the amounts of minimal generators and relevance attributes in its intent, packed in one measure. The computation of the index requires a polynomial time in the size of the concept’s upper covers, and is therefore quite fast in practice.

The rest of the paper is organized in the following manner. Section 2 recalls some basic definitions of FCA and stability index. Section 3 explains our proposed Conceptual Relevance index for measuring the concept relevancy in further more detail. In Section 4 we conduct a thorough experimental study with a detailed discussion. Finally, Section 5 presents our conclusions.

2 Background

To illustrate the basic notions in Formal Concept Analysis (FCA) we use the Movies formal context with 11 objects and 7 attributes (see Figure 1).

Figure 1: (left) The Movies formal context and (right) its corresponding concept lattice.

2.1 Formal Concept Analysis

Formal concept analysis (FCA) is a general mathematical framework for building clusters defined as object sets sharing common attributes. Here, we briefly recall the FCA terminology (Ganter+1999). The starting point of an FCA analysis is a formal context, and the main output is a concept lattice (see Figure 1).

Definition 2.1 (Formal context).

A formal context is a triple , where is the set of objects, is the set of attributes, and is a binary relation between and with . For holds iff the object has the attribute .

Given arbitrary subsets and , the Galois connection can be defined by the following derivation operators: and , where is the set of attributes common to all objects of and is the set of objects sharing all attributes from . The closure operator implies the double application of .

Definition 2.2 (Formal concept).

The pair is called a formal concept of with extent and intent iff and .

For example, in Figure 1, is a formal concept with extent and intent . In the sequel, we use as our illustrative example to support the understanding of definitions and principles related to the Conceptual relevance index.

A partial order exists between two concepts and if . The set of all concepts together with the partial order form a concept lattice.

Definition 2.3 (Covers and faces).

For two formal concepts , if , then is a lower cover of , and is an upper cover of , denoted by and respectively. The intentional face of a concept w.r.t. its -th upper cover concept, , is the difference between their intent sets (pfaltz2002closed), i.e., , where is the set of the upper covers of .

Definition 2.4 (Generator (bastide2000mining)).

Given a concept in a formal context , a subset is called a generator of the intent of iff , and it is a minimal generator when such that . We use to denote the set of minimal generators of the intent of concept .

2.2 Concept Interestingness

Interestingness measures of a formal concept are widely used to assess its relevancy. In this context, the stability index of has been found to be prominent for selecting actionable concepts (Kuznetsov2018).

Definition 2.5 (Stability Index).

Let be a formal context and a formal concept of . The intentional stability is:

(1)

where is the power set of . In Equation (1), the intentional stability measures the strength of dependency between the intent and the objects of the extent . More precisely, it expresses the probability to maintain closed when a subset of noisy objects in are deleted with equal probability. This measure quantifies the amount of noise that causes overfitting in the intent .

3 Conceptual Relevance Measure

To set the stage for how the Conceptual Relevance quantifies the relevance of the formal concept, the first thing we do is to define relevant attributes through the lens of FCA.

Definition 3.1 (Conceptually Relevant Attribute (Ganter+1999; ganter2012formal)).

For a formal concept , an attribute is conceptually relevant if

(2)

That is, the concept does not preserve its local conceptual structure after removing from its intent (and accordingly from ). From a statistical perspective, this means that the attribute has a significant statistical influence, and mainly depends on the distribution of the concept’s intent parameter. Intuitively, this implies that the attribute contains certain relevant information in . Thus, taking it off from the concept intent results in the loss of essential conceptual information (which clearly appears through the expansion of its extent). For instance, the attribute ‘drama’ is conceptually relevant in since its removal results in an expansion of the extent and a violation of the derivation condition of the intent, i.e., . On the contrary, ‘adventure’ is not a conceptually relevant attribute in since its removal from the intent does not violate the derivation condition, i.e., . At this point, we have paved the way for Conceptual Relevance index.

Definition 3.2 (Conceptual Relevance Index ).

The intentional Conceptual Relevance of a concept , in the formal context , can be computed as:

(3)

where

(4)

and

(5)

Algorithm 1 gives the pseudo-code for calculating the Conceptual Relevance score in Eqs. 3-5. It takes as input the concept and its set of upper covers. At the first step, it computes the term by iterating through the attributes of the intent to count the number of the conceptually relevant ones that satisfy the derivation condition as in Eq. 4 (Lines 2-10). From a conceptual perspective, the term quantifies the ratio of the relevant attributes that exist in the concept intent . For the term in Eq. (5), we calculate the set of minimal generators of as presented in (szathmary2014fast) (Line 11). We then count the number of minimal generators out of all possible generators of intent , as in Eq. 5 (Lines 12-14). Note that the goal behinds taking minus in the denominator of Eq. 5 is to exclude the trivial empty set and the whole intent. From a pattern mining perspective, the term quantifies the number of potential local relevant substructures inside the concept intent . Note that due to the fact that in the formal context the attributes outside the scope of a given concept intent do not have any influence on its conceptual structure. Therefore, only the attributes of the concept intent should be used to normalize its terms. As a result, the size of the intent and the one of its power set serve as normalization factors to scale both and respectively. Finally, the algorithm computes a function of both and relevance terms.

can be any activation function applied to squash additive, multiplicative, divisor, logarithmic or other linear or non-linear relationships between the

and terms. For instance, can simply be the arithmetic average that computes the linear additive relationship of the two terms. For example, the Conceptual Relevance score using the arithmetic average as an activation of our concept is .

Input: Concept , set of upper covers , activation function .

Output: Conceptual Relevance score .

1:;
2:// 1. Calculate term.
3:if  then
4:   ;
5:   for each attribute in  do
6:      if  then
7:         ;
8:      end if
9:   end for
10:end if
11:;
12:// 2. Calculate term.
13:; // Calculate minimal generators.
14:if  and  then
15:   ;
16:end if
17:;
18:;
Algorithm 1 Computing Conceptual Relevance Index.

Complexity Analysis.

The calculation of the term has time and space complexity of since we store and proceed all attributes to check their relevancy condition. The term needs time and space complexity of because we have to store and process all upper-covers of to calculate the faces, and then iteratively check their intersections with each element in the progressive minimal generator set. Because the time complexity of the first term subsumes the second one , the CR index has total time complexity of .

4 Experimental Evaluation

The objective of our empirical evaluation is to address these two key questions:

  • (Q1.) Is the index empirically accurate compared to the state-of-the-art indices for assessing the relevance of formal concepts? We seek to empirically analyze the accuracy of the Conceptual Relevance index.

  • (Q2.) Is the index faster than the state-of-the-art interestingness indices? We want to validate the efficiency of the Conceptual Relevance index.

4.1 Methodology

In order to obtain robust answers to Questions Q1 and Q2, we first select the following (synthetic and real-life) datasets111Publicly available: https://github.com/tomhanika/conexp-clj/tree/dev/testing-data; http://icfca2012.markuskirchberg.net/index.php?page=webDataSets:

  • Dirichlet (felde2019formal) is a random formal context generated using the Dirichlet model generator222Publicly available: https://github.com/maximilian-felde/formal-context-generator

  • CoinToss (felde2020null) is a random formal context generated by indirect Coin-Toss model generator

  • LinkedIn (nr) contains information about some computer science experts’ job experiences/skills gleaned from their interconnected LinkedIN profiles

  • Diagnosis (czerniak2003application) includes a set of symptoms to make a medical diagnosis of whether a patient has a bladder inflammation or pelvic nephritis

  • PediaLanguages (morsey2012dbpedia) involves the semantic web of official languages spoken by people living in different countries

  • Bottlenose Dolphins333Publicly available: http://www-personal.umich.edu/~mejn/netdata/ (lusseau2003bottlenose) describes a network of frequent community associations of dolphins living in Doubtful Sound, New Zealand.

Their brief descriptions are shown in Table 1.

Dataset ()
Dirichlet 2000 15 18,166 2903
Coin-Toss 793 10 913 645
LinkedIN 1269 34 4847 1473
Diagnosis 120 17 88 81
PediaLanguages 316 169 188 22
Dolphins 62 62 282 16
Table 1: A description of tested datasets where (resp. ) is the number of objects (resp. attributes) while and are the lattice size and the number of shared concepts between and , respectively.

Subsequently, we compared the results of using the arithmetic average as the activation against the intentional Stability (Kuznetsov2018), which is currently the state-of-the-art interestingness index. We then consider the traditional approach (buzmakov2014concept) to validate the two relevance measures. That is, we apply the following scheme:

  1. Divide the dataset horizontally into two disjoint subsets and such that .

  2. Extract the two sets and of the shared formal concepts from and , respectively. Note that a concept and its corresponding one are shared if they have the same intent but not necessarily the same extent.

  3. Use as a reference dataset to calculate the underlying relevance index (e.g., Conceptual Relevance and stability) of the shared concepts in while using as a test dataset to evaluate the relevance index values of the corresponding shared concepts in . It is obvious that .

  4. Record the score list , where and are the relevant measures of the -th concept in and its corresponding concept in , respectively

  5. Draw each pair as a point in a 2D-plot so that the best case is . This means that the tested relevance index produces correct results if there is a strong linear relationship between the relevant evaluations and . We therefore consider the underlying interestingness measure to be accurate if its relevance values for the shared formal concepts of the reference set are close or equal to the relevance values of the corresponding formal concepts obtained from the test set .

Based on this scheme, we consider the following two metrics to assess the accuracy (i.e. the strength of its linearity relationship) and the performance of the results:

1) The Pearson correlation coefficient (and its scatter plot):

(6)

Where and are the mean values of and respectively. We recall that is the number of the shared concepts.

2) The average elapsed time :

(7)

Where and are the elapsed times for computing the underlying index of the concept and its corresponding one respectively.

All the experiments were run on an Intel(R) Core-i7 CPU @2.6GHz computer with 16 GB of memory under macOS Mojave. We implemented the two relevance indices as an extension to the Python package called Concepts 0.7.11, which is implemented by Sebastian Bank444Publicly available: https://pypi.python.org/pypi/concepts.

4.2 Results

We conduct our experimental evaluations through two experiments.

Experiment I. The first experiment is dedicated to answering Q1. In line with the scheme explained above, we first divide each one of the six underlying datasets into reference and tested subsets. Two relevance indices, namely Conceptual Relevance and stability, are then computed on the extracted sets of shared concepts. On that basis, we calculate their Pearson correlation coefficients using their recorded score lists.

Figure 2 displays the Pearson correlation scatter plot of reference vs test on the shared concepts. Notation used to label each figure is as follows: name of the index - [name of the dataset] - (Pearson correlation coefficient value calculated as in Eq. 6). Overall, Conceptual Relevance is the more accurate of the two compared indices, achieving the best Pearson correlation coefficients on the six datasets. Stability comes close behind the Conceptual Relevance on Dirichlet, CoinToss and Diagnosis datasets, but considerably further behind with large margins otherwise. is at least more accurate than stability on LinkedIn dataset, more accurate than it on the Pedia-Languages dataset, and more accurate than it on the Dolphins dataset.

Figure 2: The Pearson correlation scatter plot of reference vs test of the two relevancy measures: (left column) Conceptual relevance (arithmetic), and (right column) Stability on the shared concepts of the six tested datasets. The Pearson correlation coefficient appears between parentheses.

Experiment II.

Figure 3: The average elapsed time of the two relevancy measures: Conceptual relevance (arithmetic), and stability on the shared concepts of the six underlying datasets.

This experiment is performed to answer Q2. We are interested here in assessing the performance of the indices. That is, we rerun Experiment I while reporting their computational times as in Eq. 7. Figure 3 shows the average elapsed time of the two relevance indices, namely the and stability, on the shared concepts of the six underlying datasets. dominates stability on all datasets tested. It performs four times faster than the stability on the LinkedIn dataset, three times faster on the Pedia-Languages and at least twice as fast as stability on the Dirichlet, CoinToss, Diagnosis and Dolphins datasets.

4.3 Discussion

In terms of accuracy, the results of Experiment I in Subsection 4.2 suggest that the Conceptual relevance index outperforms the stability index. It improves the assessment of the concept quality in two ways. First, rather than relying on generators as in stability which may result in redundant patterns, the Conceptual relevance precisely quantifies the amount of both actionable minimal generators and relevant attributes in the concept intent, with the ability to encapsulate that in a single measurement and in a variety of ways using potential activation functions. Second, and unlike the stability index, produces robust relevance scores to effectively discriminate between small and neighbour concepts (i.e., those with an immediate link). This is due to the fact that neighbor concepts cannot have identical set of minimal generators, and hence may have distinct values.

The results of Experiment II about the performance of the two indices in Subsection 4.2 show that the arithmetic considerably prevails over stability. This is due to the fact that it addresses the limitation of the stability, which presents the threat of exponential time and space complexity to validate the probability of satisfying the derivation condition of the concept intent. This is attributable to the virtue of counting the number of relevant attributes in the intent instead of subsets in the intent power set, and also leveraging faces w.r.t. the upper covers to efficiently identify the set of minimal generators. From a computational perspective, requires polynomial time and space complexity that does not depend on the size of the intent power set.

5 Conclusion

Our work here targeted the vital challenge of extracting interesting concepts from a concept lattice using a new relevance score. Since the comparative analysis of the local conceptual structures of the lattice often helps discover new knowledge, we believe that there is a clear gap in the existing FCA literature on how to identify important substructures in a concept intent such as its relevant attributes and generators. On that basis, we proposed Conceptual Relevance index, a scalable interestingness index to assess the quality of the concept. The novelty of the index is twofold: (i) first, it contrasts the attributes of a concept intent to mine the relevant ones that maintain its conceptual structure, and (ii) second, it leverages the strength of minimal generators to quantify the most relevant local substructures of the intent, taking advantage of the fact that minimal generators frequently lead to more relevant (with less redundancy) patterns than non minimal ones. requires only a polynomial time and space complexity in the number of the concept upper covers and minimal generators. In addition, its formula is flexible and can be easily adjusted with several variants of its activation function. The thorough empirical study on several synthetic and real-life concept lattices (see Section 4) shows that the can assess the quality of concepts in a more accurate and efficient manner than state-of-the-art relevance indices like stability. We are presently conducting additional empirical tests on larger datasets to confirm the findings.

References