The problem of computing all functional dependencies (FDs) that hold on a given relational table has received much attention from the academia and the industry due to numerous application demands. FDs capture deterministic relations between attributes of a database. Typically, an FD with column sets and in a given table expresses that the combination of values in columns uniquely determines the value of every column in . This can be of critical use for the database design process, table decomposition, database normalization, and for many data management applications, such as data cleaning (Thirumuruganathan17), data profiling (ChuIP13), and query optimization (IlyasMHBA04; Paulley). Although FDs are essential for many aspects of database management, the problem of discovering FDs in an entire database with multiple joined tables has been considered so far only by computing FDs from single tables in isolation without much investigation on the effect of the join operation on FD discovery. Trivially, one can compute the join results between multiple tables beforehand and then run an existing method to discover FDs. However, such computation could be significantly reduced by inferring FDs from each separate table and computing the join result only when and with what is needed. In this paper, we address the problem of frugal computation of join FDs and rename it as the Join FD discovery problem that aims at finding FDs that hold on the join result of multiple relational tables without computing the full join operations.
The Join FD discovery problem is interesting for many reasons: (1) Join FDs can be used to assess whether the FDs discovered from a single table are meaningful for the entire database when they still hold in the join results of multiple tables. Inversely, join FDs can be used to prune the set of FDs discovered from a single table when they are not persistent across the joined tables. In certain cases, they may hold incidentally on a single table due to some data errors or table incompleteness; (2) New FDs can be discovered through join operations and they can be useful for database administration, query optimization (Paulley), and other application purposes. Therefore, it is beneficial to discover them efficiently; and (3) Most join FDs can be inferred from the single tables participating to the join. This can save computation time since there is no need to compute the join result entirely before FD discovery. To address the Join FD discovery problem, we combine three strategies in one approach: logical inference, selective mining (with vertical data partitions), and selective sampling (with horizontal partitions) for FD discovery and we propose JEDI, an efficient solution for automatically discovering multi-table join FDs with frugal computation, rather than the trivial and costly approach of computing the result of a join operation tables before FD discovery.
Challenges. The first challenge is the exponential complexity of join FD discovery. The straightforward approach is very costly for two reasons: (1) the numbers of attributes of the tables participating in the join are added. Many of the existing methods exhibit poor scalability as the number of attributes increases (despite the efficiency of their pruning methods to search over the lattice of attribute combinations). This problem is already well-known for FD discovery from a single table (Kruse18) but it is amplified in the join FD discovery problem; and (2) the execution times of the join and the FD discovery are also cumulated.
As a second challenge, real-world data sets are plagued with missing or erroneous values that may lead to spurious FDs or low recall in terms of genuine FDs (Berti-EquilleHN18). More particularly, the problem of fully missing records has not been much investigated although it may have a significant impact on the validity of the FD discovery results. We can also observe disparate value cardinalities in the join attributes of the participating tables. The assumption of Preservation of Join Value Sets (booksGarcia) is often violated due to many “dangling tuples”, that is, for , tuples of that join with no tuple of (and inversely). Moreover, some join attribute values (e.g., keys) may be repeated and joined multiple times. Because of these disparities and partial coverage, the union of the FD sets obtained from single tables is not equivalent to the FD set obtained from the join result.
Finally, pruning the join FD search space efficiently is challenging. Existing methods identify minimal FDs from a single table. However, minimality does not guarantee that the set of discovered FDs will be parsimonious (Kruse18; Zhang2020) and minimal FDs from single tables are not necessarily minimal multi-table FDs. In this paper, we propose the first approach that discover efficiently join FDs from multiple tables of a database.
The main contributions of our paper are the following:
[noitemsep, nolistsep, leftmargin=*,topsep=0pt]
We demonstrate several useful properties for inferring exact FDs that can be obtained from various join operators between two tables (i.e., inner join, left and right semi-joins, left and right outer joins) without computing the full join result beforehand;
We propose four algorithms that leverage these properties, and efficiently compute the exact FDs from two or more tables without computing full joins;
We propose JEDI, a system implementing our algorithms. JEDI is available at https://github.com/xxxx with code, scripts, and data sets for the reproducibility of our experiments;
We provide the results of an extensive experimental evaluation: we compare JEDI against three state-of-the-art FD discovery methods over a diverse array of real-world and synthetic data sets with varying numbers of tuples, attributes, domain sizes, types and coverage of the join operations. We find that JEDI outperforms the competing methods by one order of magnitude in terms of execution time for discovering exact FDs while preserving the smallest memory consumption on average.
Outline. Section 2 presents an illustrative example. Section presents the necessary background and notations. In Section 3, we formalize the Join FD discovery problem and provide an overview of JEDI. In Section 4, we present our main contributions as the algorithms at the core of JEDI. We describe our performance experiments evaluating the efficiency and precision of JEDI in Section 5. Finally, we discuss related work in Section 6 and conclude in Section 7.
2. An Illustrative Example
To motivate our approach, we consider the following example illustrated in Figure 1. Samples of two tables are extracted from the clinical database MIMIC-III111https://physionet.org/content/mimiciii/1.4/ (mimiciii): PATIENT table contains information of 5 patients: their identifier (subject_id), gender, date of birth (dob), date of death (dod), and expire_flag. ADMISSION table contains some administrative and clinical information about each patient such as the hospital admission time (admittime), the admission location (admission_location), the insurance, the diagnosis, and h_expire_flag indicating whether the patient died at the hospital. The table contains 10 admission records. We can opt for the NULL semantics where missing values are all equal and extract exact FDs from each table. We obtain 9 exact FDs for PATIENT and 14 exact FDs for ADMISSION. The figure also shows relevant approximate FDs (with degree ) from both tables. In addition, the figure presents the join result of the two tables over the join attribute subject_id. In our example, 42 exact minimal and canonical FDs can be discovered from the join result. Interestingly, we can observe that the 9 exact FDs from the PATIENT table in the left side of the join are preserved (in green), as well as 9 (out of the 14) exact FDs from the ADMISSION table (in pink). Additionally, 10 FDs discovered from the join result can be obtained by inference between the sets of exact FDs discovered respectively from each single table (in blue). For example, is obtained from in ADMISSION and in PATIENT. In the PATIENT table, the subject_id values are unique, whereas in the ADMISSION table, we can observe two repeated records for patients #252 and #249. Moreover, the patient #257 is absent in the ADMISSION table, whereas s/he is present in the table PATIENT, similarly patient #247 is absent in the PATIENT table but present ADMISSION table. Due to these cardinality disparities in the domain of the join attribute subject_id, some approximate FDs discovered from the single tables can become exact FDs in the join result (we will use the term “upstaged” AFDs in the rest of the paper) as it is the case for 4 join FDs in the join result (in grey). The approximation degree means that tuples violate the exact FD. In the case of the AFD in PATIENT table, removing one tuple, either patient #257 or patient #250 will make the AFD become exact. When the two tables are joined, patient #257 is removed due to the absence of the corresponding subject_id value in the ADMISSION table. The same phenomenon occurs for the 3 AFDs with approximation degree 1 discovered from the ADMISSION table, namely: , , and . We note that the 5 remaining exact FDs from the ADMISSION table are reduced to minimal exact FDs in the set of FDs discovered from the join result. Finally, 10 exact FDs (in orange) that hold over the entire join result have to be discovered from the join result. However, if we join the two tables partially, only with the following combinations of tuples: [(#249,#252) or (#249,#251)] and [(#250,#251) or (#250, #252)], we can obtain the remaining 10 join FDs without the full join computation. Note that finding semantically correct FDs is orthogonal to our work and still an open problem for FD discovery from single tables: e.g., discovered from PATIENT and from ADMISSION, both also present in the join result, are not semantically correct. All FDs (from single or multiple tables) must be validated by domain experts before they can be used by downstream applications. Similarly, AFDs can be either semantically incorrect (e.g., ) or correct (e.g., ) and some can only surface as exact FDs in a join result due to errors in the single tables. Moreover, we would like to point out that these observations can be make regardless of the NULL semantics. A better understanding of the mechanisms underlying the appearance of FDs in multi-table settings is hence needed. Based on these observations, the questions that motivated our work were the following: Instead of fully computing the join result before FD discovery, can we infer most of the FDs and accelerate the overall computation? Do we need to compute the full join anyway? How can we select the necessary tuples and attributes in a principled way?
The rest of the paper will attempt to answer these questions and propose efficient solutions as parts of our JEDI framework for discovering join FDs from multiple relational tables.
Next, we recall the necessary definitions of FDs and join operators with their application to our problem.
Definition 0 (Functional dependency satisfaction).
Let be an instance over a relation , and , be two sets of attributes from . satisfies a functional dependency , denoted by if and only if:
Definition 0 (Logical implication between FDs).
Let be a relation, and , , be three sets of attributes from . Then a functional dependency logically implies a functional dependency if, and only if, for every instance over :
Definition 0 (Approximate functional dependency).
The approximate functional dependency (AFD), also called partial FD, denoted is a functional dependency between and that holds if and only if the minimum fraction of tuples at most equal to is removed from the relation such that can hold.
The minimal fraction of tuples is computed as
with , the set of equivalence classes defining a partition of under , . The equivalence class of a tuple with respect to a given set is
Next, we provide the definition of the natural join. Let us consider and , two relations with at least one common attribute; and , two instances over the relations and , respectively; , a function taking an relational instance as input and returning its set of attributes. The set of common attributes between and is denoted .
Definition 0 (Natural join).
The natural join between and , denoted by , is the instance such that
there exists two tuples and such that
We recall the definitions of left and right semi-joins.
Definition 0 (Semi-join).
The left and right semi-joins between and , denoted by and respectively, are the instances such that and
In the case of relational instances without common attributes, the equi-join operator can be used. The equi-join operator is defined as follows.
Definition 0 (Equi-join).
Let and be two attributes. The equi-join between and on , denoted by , is the instance such that: , there exists two tuples and such that:
Finally, we recall the definitions of left and right outer joins as follows.
Definition 0 (Left outer join).
Let represent the null value. The left outer join between and , denoted by , is the instance such that
Definition 0 (Right outer join).
The right outer join between and , denoted by , is the instance such that:
From the previous definitions, we can define the notion of full outer join as follows.
Definition 0 (Full outer join).
The right outer join between and , denoted by , is the instance such that:
4. Efficient Discovery of Join FDs
In this section, we describe the problem we address and explain how we can infer and compute FDs that remain valid through a join operation with frugal computation. First, we focus on the FDs discovered from single relations separately, then we consider the FDs between attributes coming from the two (or more) joined relations and propose efficient methods to discover the exact FDs that cannot be inferred.
Problem statement. Let and be two instances of relations and , respectively; and the respective FD sets of the relational instances such that and ; , a join operator; and the result of joining and over the sets of attributes and with the operator . The resolution of the Join FD Discovery problem aims at producing a set of functional dependencies, denoted such that:
In the following theorem, we show that if the join operation is performed between instances then every FD over the initial instances stands over the joined instance, i.e., that is empty.
Theorem 1 ().
Let and be two instances over relations and , respectively; and be the two sets of all FDs such that and , respectively. Then, we have:
Note that this theorem also holds in the presence of null values. Given the semantics of null values, every null value can be considered as a particular constant value (i.e., when null values are considered as identical, all null values can be handled as one single constant value). In the rest of paper, we apply our algorithms regardless of the null semantics.
Our Solution. In order to compute the set of FDs over , we propose the workflow illustrated in Figure 2. It consists in three steps corresponding to: (1) Discovery of approximate single-table FDs that are upstaged and become exact FDs via the join operation (JEDI step 1); (2) Logical inference and minimality reduction of join FDs (JEDI step 2); and (3) Computation of the remaining join FDs from partial join over selective mining or sampling (JEDI step 3). The steps are detailed in the next sections.
4.1. From AFDs to Exact FDs Through Join
New FDs may appear mechanically due to the join operation when tuples from one table cannot be joined with their counterpart in the other table, i.e., when some join attribute values are missing in one of the tables. In certain cases, FDs that were approximate in a single table become exact in the join result. This mechanism is expressed more formally in the following theorem:
Theorem 2 (Join FDs from Upstaged AFDs).
Let and be two instances over relations and , respectively, and and be the two sets of all FDs such that and , respectively. Then the sets of upstaged FDs denoted and are the sets:
Example 0 ().
To illustrate the case of upstaged AFDs using the example of Figure 1, let us consider the NULL semantics where all null values are identical. The approximate FD in table PATIENT has an approximation degree of 1 as only one patient (#257) violates the FD. However, in the join result of , the violating tuple #257 has no counterpart in the ADMISSION table and it disappears from the join result. Consequently, the FD becomes exact in the join result.
To compute join FDs from upstaged AFDs, we propose Algorithm 1. Lines 2–10 handle the inputs of the user if s/he can provide FDs and/or AFDs for each table participating in the join operation. If not, exact FDs are computed from each single table. For each side of the join, if exact FDs have been provided, the subroutine upstagedFDs is executed (lines# 8 and 12) and computes partially the join only with the join attributes from the left side table (line# 1) to check the assumption of the join value set preservation (booksGarcia). If the assumption is violated (i.e., if some tuples have been deleted through the join operation (line# 1)), some upstaged join FDs are produced. The subroutine discovers the FDs in the input instance by taking into account the previously discovered FDs and improves the pruning (line# 1 of upstagedFDs). On the other hand, if AFDs are provided as inputs of Algorithm 1 (lines # 5–6 and 25), the subroutine upstagedAFDs will check for each AFD if the join of its set of violating tuples with the instance from the other side of the join leads to an empty instance (line# 1). In this case, the AFD becomes exact in the joined instance and thus is added to the output set of exact FDs. In this algorithm, the computation is performed over one table at a time, and not over the complete join result. Thus, the FD discovery focuses on FDs whose attributes belong to one joined table only. Next, we discover the FDs containing attributes from both instances by relying on the characteristics of the join and on the FDs discovered previously.
4.2. Exact FDs from Joined Tables
Other join FDs are FDs containing attributes from the result of a join operation between multiple tables. Compared to the previous join FDs from upstaged AFDs, they will include a mix of attributes coming from each table participating in the join. Their definition is formalized as follows.
Definition 0 ().
Let and be two instances over relations and , respectively. An FD is said to be multi-table and specific to the joined instance if: holds in ; and contains at least one attribute occurring in and one attribute occurring in .
Example 0 ().
In our example of Figure 1, FDs specific to the joined result are highlighted in orange. For example, gender, h_expire_flag insurance is specific to the join of Patient and Admission. It holds in PATIENT ADMISSION. Attributes gender and expire_flag come from PATIENT and attribute insurance comes from ADMISSION.
Next, we consider the case of FDs that have all attributes in their left-hand side coming from only one relational instance and we prove several interesting properties about these logically inferable FDs. Then, we examine the properties of multi-table join FDs with left-hand side attributes obtained from both initial instances.
4.2.1. Logically inferable Join FDs
We will now show that the join FDs with left-hand side (lhs) attributes coming from only one initial instance can be deduced from the sets of FDs over the initial instances. To prove the theorem, we first state the following lemma showing that an FD with lhs attributes coming from only one single instance cannot exist if their right-hand side (rhs) is not functionally defined by the set of join attributes.
Theorem 6 ().
Let and be two instances over relations and , respectively. Let be a join result with , . For all and :
Let being values over the attributes in , and being values over the attributes in . If then there exist two tuples:
in such that: and there exists a tuple in with , the values of the attributes in (otherwise, tuples and would have been filtered during the join operation). Thus, the join leads to the two tuples:
which violate the FD and ∎
Example 0 ().
To illustrate the property proved in Theorem 6, we observe that the diagnosis is not determined by the patient identifier in Figure 1, for example patient #249 has been admitted three times for a different pathology each time, i.e. : . From Theorem 6, we know that diagnosis in the join result cannot be determined by any set of attributes coming from PATIENT table. Such similar inferences may be trivial for the user, but they usually require the knowledge of the attribute semantics. If not encoded, they are difficult to capture by a system. However, the property proved in Lemma 6 can be used to drastically reduce the set of possible FDs that can appear after a join operation.
From our proof of Theorem 6, we can characterize a subset of the set of FDs with lhs attributes coming from only one initial instance and that hold in the joined instance.
Theorem 8 ().
Let and be two instances over relations and , respectively.
Let be a join result with , .
For all and ,
This is trivially proved by transitivity, with the use of Armstrong’s transitivity axiom. ∎
Example 0 ().
In result illustrated in Figure 1, we observe that the diagnosis determines the date of birth, i.e., The reason is that we have: admission_location, in ADMISSION and in PATIENT. Since these tables do not contain any null values for the lhs and rhs attributes, joining them with attribute subject_id leaves these FDs unchanged with no violation. By transitivity, we obtain:
To compute the set of FDs with lhs attributes coming from a single instance, as described in Theorem 8, we propose Algorithm 2. First, the subroutine infer extracts the FDs that can be retrieved by transitivity (lines# 2 and 2 in subroutine infer). Note that in the case of equijoins, equality of values might be enforced between sets of attributes with different names (i.e., and might be different), thus for the general case, the FD (line# 2) cannot be simplified into an FD . At the opposite, if we restrict our join operations to natural joins only, such a simplification can be made.
Then, for each FD returned by infer, the subroutine refine checks whether the FD is minimal or if a subset of its lhs leads to a minimal FD. To do so, subroutine refine uses an horizontal partition of the joined instances in which only the necessary attributes to perform the verification are considered (line# 2). These necessary attributes are the join attributes (to perform the join operation), and the lhs and rhs attributes of the refined FD (line# 2) as refine only considers candidates with subsets of as lhs and as rhs (lines# 2 and 2).
4.2.2. FDs with lhs attributes from multiple tables
Now, we characterize the set of FDs which hold on a join result and such that their lhs attributes come from both initial instances. We characterize two kinds of FDs with multi-table attributes in lhs: (1) FDs that can be deduced directly using a simple logical reasoning, and (2) FDs that need to be discovered and validated from the data. For example, gender, expire_flag insurance of our example has attributes from PATIENT in lhs and attributes from ADMISSION in rhs and it cannot be inferred logically. Other FDs with the same properties are illustrated in orange in Figure 1. In the following theorem, we show that if lhs attributes of an FD come from the instances participating in the join, then we cannot predict their validity without checking them directly with some representative (if not all) tuples of the join result:
Theorem 10 ().
Let and be two instances over relations and , respectively. Let be a join result with , . We cannot guarantee that all FDs over can be inferred from Armstrong’s axioms over the FDs over and taken separately.
In the two following instances and , we can see that only the FDs and hold.0 0 1 0 1 1 2 2 0 0 0 1 0 0 1 1 1 2 1 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0 1 1 1 1 2 2 1 0
In the join result, the FD holds but it cannot be inferred using Armstrong’s axioms over the FDs discovered from each instance and . ∎
This theorem motivates the need for designing a new method for computing FDs from partial join results, as we cannot always infer all the FDs only using logical reasoning. However, we can rely on the following theorem to greatly reduce number of remaining FDs to check from the data:
Theorem 11 ().
Let and be two instances over relations and , respectively. Let be a join result with , . For all , and : If , Then .
In-line with Theorem 11, we propose Algorithm 3 for selective mining and use the FDs previously discovered with Algorithms 1 and 2 to compute the remaining join FDs. Intuitively, Theorem 11 shows that a given attribute can be a rhs of a remaining join FDs only if we have previously found an FD of the form with being the join attributes of the instance containing . Thus, it allows us to focus only on the plausible rhs (lines# 3 and 3 in subroutine discover) and explore their candidate lhs (lines# 3-3 and 3-3). In practice, there is no need to generate every candidate FDs initially. Instead, candidate FDs can be explored by generating a first level containing only the smallest candidates and by generating upper levels only when currently evaluated candidates are not valid. Moreover, we can avoid the computation of the full join by deleting a given lhs attribute if is not a possible rhs and, for every FD candidate such that , is logically implied by previously discovered FDs.
4.2.3. Sampling-based Discovery of Join FDs
As the computation of the join FDs by Algorithm 3 can be expensive both in time and space, we propose Algorithm 4 which relies on selective sampling. To this extent, Algorithm 4 starts with the micro_join subroutine that compute FDs from micro-joins only between the tuples from and that are selected by the selective_sampling subroutine (lines #6 and #19–25). The set of tuple ids to join is generated in the generate_ids_set subroutine (lines #23-24, #26-38). It builds a tree where the depth levels correspond to the order of the attributes with the fewest number of distinct values as the first level and the last level to consider (line #29); the nodes correspond to distinct values per attribute (i.e., level) (line #33). Then, if an attribute value corresponds to a unique tuple for a level less than , the tuple id is selected as the representative tuple of a branch to be considered for the micro-join; if multiple tuples share the same value, tuples will be selected. Then, for each sample, the instances are joined (line# 4) and the FDs over the resulting instance are computed and added to , the set of all computed FDs (lines# 4 and 4). Finally, subroutine computeLvLFDs (line# 4) extracts from the set the FDs that are valid in every sample. An FD can be considered as valid in a set of FD if either (1) there exists a logically equivalent FD in or (2) there exists an FD logically implying in . However, we cannot guarantee that an FD discovered from samples is exact in the joined instance without the full exploration. Indeed, for a given FD , in the worst case, the projection of over the attributes in needs to be explored to check the non-existence of a pair of tuples violating . We refer to (davies1994np) for a formal proof of the NP-completeness of finding minimal FDs.
4.3. The 4 C’s
We observed that the cardinalities and overlap of the join attribute values are rarely preserved through a join operation and this has a great impact on FD discovery depending on the join operator used. Various sets of join FDs can be computed and we summarized the different cases in Figure 3 that recaps when JEDI algorithms are executed. Our example corresponds to the last line and column of the table with cardinalities where the join attribute subject_id with value #257 in PATIENT has no counterpart in ADMISSION table; #247, 248, and #253 in ADMISSION have no counterpart in PATIENT table, and #252 and #249 in PATIENT are present multiple times in ADMISSION table. To quantify this phenomenon, we define the notion of join coverage and compute it as follows:
and denote the join attributes of and , respectively. is a considered instance and the considered join attribute. If , no tuple from can be joined with tuples in . For , some tuples in (or ) may be missing from the join result, as it is the case for patients #257 in PATIENT and #247, #248, and #253 in ADMISSION that do not have their counterparts in the other table in our example. For , there are as many tuples in both tables and as in the join result. For , there are more tuples in the join result than in tables or as some tuples may be repeated through the join: for example, one patient may have multiple admissions. In Figure 1, .
4.3.2. Completeness and correctness of Jedi with Selective Mining
Algorithm 1 either checks if the tuples violating the AFDs are excluded by the join operation or, if AFDs are not provided, it mines the exact FDs after the tuples are filtered by the join operation. Algorithm 2 retrieves the minimal FDs from the logically inferred FDs, thus no lhs subset of FDs remains unchecked. Theorem 11 indicates which part of the candidate FD lattice can be pruned. Then, Algorithm 3 explores the candidate FDs using a classic bottom-up approach, thus no minimal FD remains unchecked. Overall, our algorithms explore the lattice of candidate FDs until they find minimal FDs; they avoid only the parts of the lattice that do not contain valid candidate FDs, thus JEDI with selective mining retrieves the complete set of minimal candidate FDs.
In Algorithm 1, valid exact FDs are either discovered from the data or deduced (because the join operation filters the tuples violating some FDs that became consequently exact after the join). Both cases of FDs are guaranteed to hold on the joined instance and only minimal lhs are kept. Theorem 8 shows the correctness of the set of FDs inferred through logical inference. Then, the subroutine refine checks the correctness of its candidates FDs holding on the data. Therefore, the set of FDs discovered by Algorithm 2 is such that . In Algorithm 3, Theorem 11 enforces the retrieval of FDs based only on the attributes that can become rhs in the joined instance, then only plausible candidate FDs are explored. Therefore, every discovered FD holds in the joined instance. By construction, Algorithm 2 and 3 lead to the retrieval of FDs with minimal lhs only.
4.3.3. Completeness and Correctness of Jedi with sampling
JEDI with sampling uses Algorithm 1 and 2 with the same completeness and correctness guarantees exposed previously for JEDI with selective mining. Algorithm 4 uses Algorithm 3 over samples, the retrieved FDs are minimal exact FDs holding on the samples. If there exists a minimal FDs in a sample, an FD such that cannot become a minimal FD in the full instance. Therefore, the FDs retrieves in from the samples imply the set of FDs over the full joined instance. The correctness of Algorithm 4
cannot be guaranteed as there is a probability of not sampling a pair of counter-example tuples for a given AFD, and thus to confound the AFD with an exact FDs in the join results. Due to the NP-completeness of the minimal FD mining problem(davies1994np), every pair of tuples must be checked in order to find a possible counter-example, thus to guarantee that the discovered FDs from samples are exact FDs. We would like to note that developing a framework to guarantee the accuracy of the sampling-based FD discovery approach is a challenging problem and a focus of our future research.
Algorithm 1 subroutine updstagedFDs is based on a level-wise algorithm through the attributes lattice. Its complexity is exponential in the number of attributes of the considered table. It prunes candidates at each level when it is possible. In terms of memory, only two levels are required. The memory size is bounded by where is the number of attributes. The complexity of the second subroutine upstagedAFDs of Algorithm 1 is where is the maximal number of tuples and the maximal number of FDs either from the left or the right table. The join computation is linear because we use a merge join algorithm over indexed data. It should be noted that subroutine upstagedAFDs does not compute the full join but instead a subset of the full join (Algorithm 2 line# 2). Algorithm 2 infers and refines FDs coming from the previous step with complexity , where is the number of validated FDs and the maximal number of tuples in the left or right instance. The complexity of Algorithm 3 is where is the maximal number of validated FDs in the left or right instance and the number of validated FDs in the join instance. Algorithm 4 operates on micro-joins computed by micro_join subroutine with complexity with the number of tuples in the micro-join , and the number of tuples in and participating to the micro-join, and , the set of consistent FDs discovered from multiple micro-joins FD sets. The complexity of selective sampling is , with , the maximum number of attributes in or .
Evaluation Goal. We compare the two variants of our method which computes only necessary micro-joins on-the-fly against the straightforward approach that consists of computing first the full join of two or more tables and then mine all the FDs from the joined result. The two main points we seek to validate are: (1) Does our approach enable us to discover join FDs accurately (i.e,. high precision) in an efficient manner and faster than the straightforward approach? (2) What is the impact of different data and joins characteristics on JEDI performance?
Setup. We perform all experiments on a laptop Dell XPS machine with an Intel Core i7-7500U quad-core, 2.8 GHz, 16 GB RAM, powered by Windows 10 64-bit. Our algorithm implementation in Java use only one thread. Our code, scripts, and data sets are available at https://github.com/xxxx.
Methods. We compare JEDI algorithms for selective mining (JEDI _SM) and sampling-based join FD discovery (JEDI _SB) against with four state-of-the-art FD discovery methods: (1) TANE (HKPT98; HKPT99), (2) Fast_FDs (WGR01), and (3) FUN (NoCi01icdt), and (4) HyFD (HyFD), using Java implementation of Metanome (Papenbrock:2015). Data sets are stored in a Postgres DBMS. Join attributes are indexed with B-Tree and hash indexes.
|Data set||Table||(Att# ; Tup#)||FD#|
|Patients||(7 ; 46.52k)||11|
|MIMIC-III||Admissions||(18 ; 58.976k)||285|
|Diagnoses_icd||(4 ; 651.047k)||2|
|D_icd_Diagnoses||(3 ; 14.710k)||2|
|PTE||pte_bond||(4 ; 9.317k)||3|
|pte_atm||(5 ; 9.189k)||5|
|pte_drug||(1 ; 340)||0|
|PTC||connected||(3 ; 24.758k)||3|
|bond||(3 ; 12.379k)||2|
|molecule||(2 ; 343)||1|
|Supplier||(7 ; 10k)||34|
|TPC-H||Customer||(8 ; 150k)||51|
|Nation||(4 ; 23)||9|
|Region||(3 ; 5)||6|
|Part||(7 ; 200k)||99|
|Partsupp||(5 ; 800k)||11|
Evaluation Metrics. To measure the performances of join FD discovery, we use precision defined as the fraction of correctly discovered FDs on-the-fly by JEDI from partial joins that are common with the true FDs discovered from the full join result by the straightforward method over the total number of discovered true FDs. For each experiment, we consider ten runs per setting and report the average performance of each method. We examine precision, runtime, and memory consumption for different join operators with varying: (1) the size of the input tables and joins, (2) the cardinalities of the join attribute values and join coverage, and (3) the sample size.
Data. We use three real-world datasets and one synthetic data set in our experiments: (1) the clinical database MIMIC-3222https://physionet.org/content/mimiciii/1.4/ (mimiciii); (2) PTE333https://relational.fit.cvut.cz/dataset/PTE, a database for predictive toxicology evaluation, used to predict whether the compound is carcinogenic, and (3) PTC444https://relational.fit.cvut.cz/dataset/PTC, the data set from the Predictive Toxicology Challenge that consists of more than three hundreds of organic molecules marked according to their carcinogenicity on male and female mice and rats; and (4) the TPC-H Benchmark555http://www.tpc.org/tpch/ with scale-factor 1. Data sets characteristics are given in Table 1 and the characteristics of the join operations in Figure 4.
5.1. Efficiency and Effectiveness Evaluation
In a first set of experiments, we evaluate the runtime and memory consumption of JEDI algorithms compared to the state-of-the-art FD discovery methods that follow the straightforward approach over a total of 55 join operations on the real-world and synthetic data sets with various coverage values illustrated in Figure 4. Variants of the same join operation with different orderings of the participating tables have the same coverage. We want to evaluate the gain of the key components of JEDI for join FD discovery and therefore focus on JEDI Algorithm 1 (join FD discovery from upstaged AFDs), Algorithm 2 (logical inference), and Algorithm 3 (selective mining of the remaining FDs from the join result) for JEDI _SM and algorihms 1,2, and 4 (selective sampling) for JEDI _SB.
Figure 5 shows the average runtime (in seconds) for the two variants of JEDI . We log the average total runtime of join FD discovery (including data loading) for all methods over 10 runs. For the competing methods, we added the average execution time of each join operation over the indexed data. For PTE and PTC data sets with low coverage, JEDI _SM is much faster than the other methods operating on the pre-computed full join results with one order of magnitude on average. However, for MIMIC3 and joins with higher coverage, JEDI _SM is not as competitive as HyFD applied to the full join result. This result actually motivated the design of a sampling-based method that can be more efficient when join coverage is high () in large table sizes (i.e., when many tuples from one table are repeated in the join operation). JEDI _SB shows similar or better performances for low coverage joins PTE and PTC with a sampling size around 28% in average to reach precision of 1. However, when both join coverage and table size increase, JEDI _SB needs 63% of the sample size on average to reach precision 1 as shown for MIMIC3 to compete with HyFD that operates over the precomputed full join. Also, the figure shows that both variants of JEDI are sensitive to the table ordering in the join operation, even more when the join size increases.
5.1.2. Memory Consumption
As shown in Figure 6, average maximal memory consumption of JEDI _SM is the lowest for PTE and PTC data sets with low coverage. Memory consumption of JEDI _SB (when precision is 1) is between 23.4% and 34.7% lower than JEDI _SM on average for PTE and PTC and, which makes it perform better than HyFD in certain cases. However, for MIMIC3 as the number of attributes and the coverage increase, HyFD generally outperforms the other methods with JEDI _SB has the second position.
5.2. Evaluation of Jedi with Selective Mining
In Figure 7, we report the average runtime breakdown of each algorithm 1, 2, and 3 in JEDI
_SM (in the horizontal histograms) and their respective percentages of discovered FDs (in the adjacent pie charts with the same color coding). Error bars represent the standard deviation of the average total runtime of the algorithms over ten runs. The data distributions, cardinalities of the join attribute values, and the join coverage have very different characteristics across the data sets to illustrate the behavior of our algorithms on different join orderings and different coverage values for PTC and MIMIC3. We observe the same behavior in PTE which is not shown due to space limitation. Algorithm 1 manipulates different sets of upstaged AFDs per input table, as we can see with the 3-size joins in MIMIC3. The logical inference made by Algorithm 2 from Algorithm 1 output will lead to different resulting sets of inferred FDs. For MIMIC3,JEDI 1 alone retrieves 52% of the true FDs on average. Algorithm 2 helps only for patients [diagnoseicd dicddiagnoses], the only case when the logical inference can be triggered from the FDs discovered by Algorithm 1.
The figure shows that different orderings in the tables participating in the join operations change not only the runtime but also the contribution of each algorithm in retrieving the true FDs. The reason is that various orderings of the input tables lead to different sets of potential upstaged AFDs and trigger different logical inferences. Consequently, the remaining set of join FDs to discover (by Algorithm 3 or 4) will be different. Future work will be to find the optimal ordering of the input tables and predict the cases where logical inference could prevail over selective mining to discover FDs more efficiently.
5.3. Coverage Analysis
In this experiment reported in Figure 8, we use subsets of MIMIC3 database with 100, 500, and 1000 tuples of PATIENT joined with 100 to 1000 tuples of ADMISSION with (a) inner, (b) outer left, and (c) outer right join operators for PATIENTADMISSION. We evaluate the precision of the three algorithms of JEDI _SM (Y-axis) as the coverage rate (X-axis) increases across 30 joins for each join operator (90 joins). Coverage increases with the size of the rhs table participating in the join operation which indicates cardinalities as more and more tuples are repeated through the join operation. Precision of Algorithm 1 decreases as the coverage and sizes of both lhs and rhs tables of the join increase for the three types of join showing the difficulty of leveraging AFDs when the number of repetitions increases. JEDI 1 precision is relatively low for inner and outer right joins with best values (around .45) when coverage is below .5. Precision of Algorithm 2 is close to 0 because logical inference cannot be leveraged from Algorithm 1’s outputs, when the coverage and sizes of the joined table increase. We observed that Algorithms 1 and 2 tend to be more effective when the coverage is relatively low as shown in Figure 8 for the three types of join operations.
5.4. Evaluation of Jedi with Sampling
In this last set of experiments, we evaluate JEDI 1,2,4 and show that it outperforms JEDI _SM and other competing methods in terms of runtime for precision equals to 1, in particular for the worst case scenarios when coverage is high. First, we investigate the effect of the sample size on the precision of JEDI Algorithms 1,2, and 4.
Figure 9 shows precision of selective sampling with JEDI 1,2,4 compared to JEDI 1,2 followed by random sampling (averaged over 10 runs) for the inner join between 1,000 tuples of PATIENT and 1,000 tuples of ADMISSION with high coverage (¿1) (cf. Figure 8(a)). For this experiment, we used as input parameter for selective sampling to pick only one tuple as the representative tuple of a branch and we vary from 0 to to exclude the attributes with the highest number of distinct values. JEDI
1,2,4 reaches precision (and recall) of 1 when the size of the selected sample is 62.3% of the table input size. Interestingly, even for a sample of size of 10% with tuples selected by Algorithm 4, we can see an improvement of +0.153 overJEDI 1 and 2 precision.
|Average Runtime (seconds)|
|Database||Join||Coverage||Join Size Ratio||JEDI_SM||JEDI_SB||JOIN+HyFD||JOIN+FUN||JOIN+TANE||JOIN+FastFDs|
|MIMIC3||diagnoseicd patients||7.50||429,530/651,047 (66%)||79.862.33||12.73 0.85||10.230.53||61.233.35||80.144.07||¿1800|
|patients diagnoseicd||422,120/651,047 (65%)||62.353.16||10.12 0.34||62.783.26|
|dicddiagnoses diagnosesicd||22.84||430,355/634,709 (68%)||10.720.42||7.30 0.94||10.320.51||13.210.15||17.982.34|
|diagnosesicd dicddiagnoses||400,314/634,709 (63%)||19.250.66||10,17 1.58||18.41 0.43|
|[diagnoseicd patients] dicddiagnoses||410,452/634,709 (65%)||472.5731.64||15.64 1.73||16.321.33||140.0712.15||473.8527.23|
|patients [diagnoseicd dicddiagnoses]||424,273/634,709 (67%)||220.6712.88||17.55 1.18||221.4410.98|
|TPCH||part partsupp||2.5||525,752/800K (22%)||721.0847.41||3.37 0.52||3.520.54||149.0715.25||¿2100|
|nation region||3||17/25 (65%)||14.360.41||3.07 0.17||3.320.51||3.030.01||3.070.12||3.390.17|
|supplier nation||200.5||6,524/10K (66%)||3.060.01||3.02 0.28||3.620.57||3.060.21||3.330.27||3.430.27|
|customer nation||3000.5||101,076/150K (45%)||10.790.18||3.49 0.13||3.380.52||3.500.65||3.440.33||3.640.23|
|customer nation region||102,162/150K (77%)||13.59 0.69||3.550.25||45.521.44||3.560.85||3.720.24||3.750.14|
This shows the main advantage of our approach, reducing drastically the execution time of the join FD discovery. Table 2 completes the comparative study of the two variants of JEDI against the competing methods for the worst case scenarios when join coverage is high for large table size. Recall that JEDI _SM has shown the best performances over the state-of-the-art methods for PTE and PTC joins with relatively low join coverage and small table sizes (from 2 to 5 attributes and 300 to 24k tuples). So, we focus on MIMIC3 and TPC-H joins and report for each join: its coverage, the Join size ratio (expressed as the sample size over the full join size, and as a percentage in parentheses, and the average runtime of JEDI _SM, JEDI _SB at precision 1, FUN, TANE, HyFD, and FastFDs.
For MIMIC3, we observe that JEDI _SB outperforms all the methods with at least one of the variants of the join ordering, otherwise HyFD applied to the full join result is slightly faster. This shows that the ordering of large tables is critical for size 3 and more joins but most importantly, high coverage impacts the sampling strategy. The reason is that many tuples from each input table are repeated through the join operation and the choice of more than one representative tuples jeopardizes the sampling result, in particular when it does not preserve the balance and distributions of the repeated tuples from the full join to the multiple micro-joins.
For TPC-H, we observe the same phenomenon for the 3-size join where the straightforward approach with HyFD is slighlty more efficient (or equal with TANE). We observe that the join size ratio to reach precision 1 is influenced by the coverage and also the number of distinct values for each attributes of the joined tables. For example, for the join part, the average number of distinct values per attribute is 52,984 for part and 223,971 for partsub which facilitates the sampling strategy in selecting violating tuples and reducing the space of candidates FDs. This can explain the relatively low ratio (22%) despite high join coverage.
These observations show that there is room for improvement in the two following directions: (1) to find the optimal ordering of the tables to maximize the performance of FD discovery (in particular in Algorithm 1) and and (2) to improve the sampling strategy and better tune the sample size to maximize the number of violations and reduce the space of candidate join FDs.
6. Related Work
In the last three decades, numerous approaches from the database and the data mining communities have been proposed to extract automatically valid exact and approximate FDs from single relational tables (KiMa95; caruccio2015relaxed). Liu et al. (Liu2012) have shown that the complexity of FD discovery is in where is the number of attributes and
the number of records considered. To find FDs efficiently, existing approaches can be classified into three categories: (1) Tuple-oriented methods (e.g., FastFDs(WGR01), DepMiner (lopes2000efficient)) that exploit the notion of tuples agreeing on the same values to determine the combinations of attributes of an FD; (2) Attribute-oriented methods (e.g., Tane (HKPT98; HKPT99), Fun (NoCi01icdt; NoCi01is), FDMine (YH08)) that use pruning techniques and reduce the search space to the necessary set of attributes of the relation to discover exact and approximate FDs. HyFD (PapenbrockN16) exploits simultaneously the tuple- and attribute-oriented approaches to outperform the previous approaches; and more recently (3) Structure learning methods relying on sparse regression (Zhang2020), or on entropy-based measures (kenig2019mining) to score candidate constraints (not limited to FDs alone). More particularly, FDX (Zhang2020) performs structure learning over a sample constructed by taking the value differences over sampled pairs of tuples from the raw data. In addition, incremental approaches (SchirmerP0NHMN19; CaruccioCDP19) have been developed to tackle data volume and velocity with updating all valid FDs when new tuples are inserted outperforming classical approaches that recalculate all FDs after each data update. Extensive evaluations of FD discovery algorithms can be found in (DurschSWFFSBHJP19; PapenbrockEMNRZ15). To the best of our knowledge, previous work on FD discovery did not attempt to address the problem join FD discovery in an efficient manner. Our approach combining logical inference, partitioning, and selective sampling is the first solution in this direction.
We introduced JEDI, a framework to solve the problem of FD discovery from multiple joined tables without the full computation of the join result beforehand. The salient features of our work are the following: (1) We leverage single-table approximate FDs that become exact join FDs due to the join operation when the join value sets are not preserved; (2) We leverage logical inference to discover join FDs from the sets of single-table FDs without computing the full join result; and (3) We find new multi-table join FDs from partial join and micro-joins using respectively selective mining on the necessary attributes and selective sampling on the necessary tuples. We empirically show that JEDI outperforms, both in terms of runtime and memory consumption, the state-of-the-art FD discovery methods applied to the join results that have to be computed beforehand. We hope that this will open a new line of research where the community will examine FDs across multiple tables and an entire database. Since join FDs are resilient to the join operators, they can be leveraged for pruning and assessing the validity and prevalence of a large set of discovered single-table FDs.