Learning Functional Dependencies with Sparse Regression

05/04/2019
by   Zhihan Guo, et al.
0

We study the problem of discovering functional dependencies (FD) from a noisy dataset. We focus on FDs that correspond to statistical dependencies in a dataset and draw connections between FD discovery and structure learning in probabilistic graphical models. We show that discovering FDs from a noisy dataset is equivalent to learning the structure of a graphical model over binary random variables, where each random variable corresponds to a functional of the dataset attributes. We build upon this observation to introduce AutoFD a conceptually simple framework in which learning functional dependencies corresponds to solving a sparse regression problem. We show that our methods can recover true functional dependencies across a diverse array of real-world and synthetic datasets, even in the presence of noisy or missing data. We find that AutoFD scales to large data instances with millions of tuples and hundreds of attributes while it yields an average F1 improvement of 2 times against state-of-the-art FD discovery methods.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/10/2018

Graphical Models for Processing Missing Data

This paper reviews recent advances in missing data research using graphi...
10/11/2021

Bayesian Regularization for Functional Graphical Models

Graphical models, used to express conditional dependence between random ...
08/11/2021

Bayesian functional graphical models

We develop a Bayesian graphical modeling framework for functional data f...
05/25/2017

Discovering Reliable Approximate Functional Dependencies

Given a database and a target attribute of interest, how can we tell whe...
07/21/2014

PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems

In the big data era, scalability has become a crucial requirement for an...
05/17/2021

Discovery and Contextual Data Cleaning with Ontology Functional Dependencies

Functional Dependencies (FDs) define attribute relationships based on sy...
12/17/2013

Markov Network Structure Learning via Ensemble-of-Forests Models

Real world systems typically feature a variety of different dependency t...

1. Introduction

Functional dependencies (FDs) are an integral part of data management systems. They are used in database normalization to reduce data redundancy and improve data integrity (Garcia-Molina:1999:DSI:553977)

. FDs are also critical in data preparation tasks, such as data profiling and data cleaning. For instance, FDs can help guide feature engineering in machine learning pipelines 

(PhysRevLett.114.105503) or can serve as a means to identify and repair erroneous values in the given dataset (hc; holistic). Unfortunately, FDs are typically unknown and significant effort and domain expertise are required to identify them.

Various works have focused on automating FD discovery, both in the database (kruse2018efficient; tane; Papenbrock:2015:FDD:2794367.2794377) and the data mining communities (mandros2017discovering; reimherr). The works in the database community study how to infer FDs that a dataset instance does not violate. These approaches are well-suited for database normalization purposes and for applications where strong closed-world assumptions on the given dataset

hold. In contrast, the data mining community views FDs as statistical dependencies manifested in a dataset and has focused on information theoretic measures to estimate FDs. These approaches are more suited for data profiling and data cleaning applications. In this paper, we focus on FDs that correspond to statistical dependencies in the generating distribution of a given dataset.

Challenges

Inferring FDs from data observations poses many challenges. First, to discover FDs one needs to identify an appropriate order of the attributes that captures the directionality of functional dependencies in a dataset. This leads to a computational complexity that scales exponentially in the number of attributes in a dataset. To address the exponential complexity of FD discovery, existing methods rely on pruning methods to search over the lattice of attribute combinations (kruse2018efficient; mandros2017discovering). Despite the use of pruning many of the existing methods are shown to exhibit poor scalability as the number of columns increases (kruse2018efficient; mandros2017discovering).

Second, FDs capture deterministic relations between attributes. However, in real-world datasets missing or erroneous values introduce uncertainty to these relations. This poses a challenge as noise can lead to the discovery of spurious FDs or to low recall with respect to the true FDs in a dataset. To deal with missing values and erroneous data, existing FD discovery methods focus on identifying approximate

FDs, i.e., dependencies that hold with high probability in a given dataset. To identify approximate FDs, existing methods either limit their search over clean subsets of the data 

(papenbrock2015functional) or employ a combination of sampling methods with error modeling (papenbrock2016hybrid; kruse2018efficient). These methods are robust to noisy data. However, their performance, in terms of runtime and accuracy, is sensitive to factors such as sample sizes, prior assumptions on error rates, and the amount of records available in the input dataset. This makes these methods cumbersome to tune and apply to heterogeneous datasets with varying number of attributes, records, and errors.

Finally, most dependency measures used in FD discovery, such as co-occurrence counts (kruse2018efficient) or criteria based on mutual information (Cavallo:1987:TPD:645914.671645) promote complex dependency structures (mandros2017discovering). The use of such measures leads to the discovery of spurious FDs in which the determinant set contains a large number of attributes. Such FDs are hard for humans to interpret and validate, especially when the goal is to use these FDs in downstream data preparation tasks. To avoid overfitting to complex FDs existing methods rely on post-processing procedures to simplify the structure of discovered FDs or ranking based solutions. The most common approach is to identify minimal FDs (papenbrock2015functional). An FD is said to be minimal if no subset of determines . In many cases, this criterion is also integrated with search over the set of possible FDs for efficient pruning of the search space (papenbrock2016hybrid; kruse2018efficient). Minimality is shown to be effective in practice, however, it does not guarantee that the overall set of discovered FDs will be parsimonious (kruse2018efficient).

Our Contributions

We propose AutoFD, a framework that relies on structure learning (Koller:2009:PGM:1795555) to solve FD discovery. Specifically, we leverage the strong dependencies that FDs introduce among attributes, introduce a probabilistic graphical model to capture these dependencies, and show that discovering FDs is equivalent to learning the graph structure of this model.

A key result in our work is to model the distribution that FDs impose over pairs of records instead of the joint distribution over the attribute-values of the input dataset

.

AutoFD’s model has one binary random variable for each attribute in the input dataset and expresses correlations amongst random variables via a graph that relates random variables in a linear way. We leverage linear dependencies to recover the directionality of FDs. Given a noisy dataset, AutoFD proceeds in two steps: First, it estimates the undirected form of the graph that corresponds to the FD model of the input dataset. This is done by estimating the inverse covariance matrix of the joint distribution of the random variables that correspond to our FD model. Second, our FD discovery method finds a factorization of the inverse covariance matrix that imposes a sparse linear structure to the FD model, and thus, allows us to obtain parsimonious FDs.

We present an extensive experimental evaluation of AutoFD. First, we compare our method against state-of-the-art methods from both the database and data mining literature over a diverse array of synthetic and real-world datasets with varying number of attributes, domain sizes, records, and amount of errors. We find that AutoFD scales to large data instances with hundreds of attributes and yields an average improvement in discovering true FDs of more than compared to competing methods.

We also examine the effectiveness of AutoFD on downstream data preparation tasks. Specifically, we apply our FD discovery method on the task of weakly supervised data repairing. Recent work (hc) showed that integrity constraints (including functional dependencies) can be used to obtain noisy labeled data which can in turn be used to obtain state-of-the-art machine learning-based data repairing systems. We show that dependencies discovered via our method lead to high-quality repairs that are comparable to manually specified dependencies. This demonstrates that our FD discovery method offers a viable solution to automating weakly supervised data preparation tasks.

Outline

In Section 2, we discuss necessary background. In Section 3, we formalize the problem of FD discovery and provide an overview of AutoFD. In Section 4, we introduce the probabilistic model at the core of AutoFD and the structure learning method we use to infer its graphical structure. Finally, in Section 5, we present an experimental evaluation of AutoFD, and conclude in Section 6.

2. Preliminaries

We review some basic background material and introducing notation for the structure learning problem studied in this paper.

2.1. Functional Dependencies

We review the concept of functional dependencies and related probabilistic interpretations. We consider a dataset that follows a relational schema . An FD is a statement over the set of attributes and an attribute denoting that all tuples in uniquely determine the values in  (Garcia-Molina:1999:DSI:553977; papenbrock2015functional). Formally, we consider to be the value of tuple for attribute ; the FD holds iff for all pairs of tuples the following holds: if then . A functional dependency is minimal if no subset of determines , and it is non-trivial if . Under this logic-based interpretation, to discover all FDs in a dataset, it suffices to discover all minimal, non-trivial FDs. This interpretation makes strong closed-world assumptions and aims to find all FDs that hold in . It does not aim to find FDs that hold in the generating distribution of .

To relax these closed-world assumption, a probabilistic interpretation of FDs can be adopted. Let each attribute have a domain and the domain of a set of attributes be defined as . Also, assume that every instance of is associated with a probability density

such that these densities form a valid probability distribution

. Given the distribution , we say that an FD , with and , holds if there is a function such that for all :

(1)

This probabilistic definition represents a hard constraint that is not robust to noisy data. To relax this, a series of works have adopted information theoretic measures for FDs (Cavallo:1987:TPD:645914.671645; mandros2017discovering) by considering the ratio of the mutual information between and (where is the conditional entropy of given ) and the entropy of . To discover FDs one needs to identify sets of attributes in such that . This requires estimating the entropy and conditional entropy from a given instance of . We also adopt a probabilistic interpretation of FDs but build upon the framework of probabilistic graphical models to define FD discovery.

2.2. Probabilistic Graphical Models

We review key concepts in probabilistic graphical models (Koller:2009:PGM:1795555).

Undirected Graphs

Let be a probability distribution and an undirected graph where and . We say that is a conditional independence graph for if: For all disjoint triples such that separates from in we have that and are independent given , where for any subset . We also say that represents the distribution . When is a strictly positive distribution (i.e., for all ), then we have that for some potential functions defined over the set of cliques of . Undirected graphical models are also known as Markov Random Fields.

Directed Acyclic Graphs

We now consider a directed graph . We say that is a directed acyclic graph (DAG) if there are no directed paths starting and ending at the same node. For each node we define be the parent set of , and write to emphasize the dependence on the structure of . A DAG represents a distribution if . This factorization implies that given an observation for all parent nodes of , is independent of all non-descendant nodes (i.e., nodes that cannot be reached via a directed path from ) excluding .

Learning Parsimonious Graph Structures

Graphical models can encode simple or low-dimensional models. The complexity of a graphical model is related to the number of edges in . It is easier to understand this notion of complexity if one considers the connection between graphical models and generalized linear models (GLIMs). An example of this connection is the Gaussian Markov Random Field model (Koller:2009:PGM:1795555; rue2005gaussian). In GLIMs, parsimony is achieved by forcing the inverse covariance matrix (a.k.a. precision matrix) of the model to be sparse. This is because the conditional dependencies amongst the variables in the model are captured in the off-diagonal entries of the inverse covariance matrix . Zero off-diagonal entries in represent conditional independencies amongst the variables of the model. Given this observation and the connection of Graphical Models to GLIMs, one can learn a parsimonious structure for a graphical model by obtaining a sparse estimate of the models inverse covariance matrix from observed data. Many techniques have been proposed to obtain a sparse estimate for  (pourahmadi2011) ranging from optimization methods (meinshausen2006high) to regression methods (friedman2008sparse).

3. The AutoFD Framework

We formalize the problem of functional dependency discovery and provide an overview of AutoFD.

3.1. Problem Statement

We consider a relational schema associated with a probability distribution . We assume access to a noisy dataset that follows schema and is generated by the following process: first a clean dataset is sampled from and a noisy channel model introduces noise in to generate obtain . We assume that and have the same cells but cells in may have different values than their clean counterparts. We consider an error in to correspond to a cell for which . This generative process is also considered in the database literature to model the creation of noisy datasets (icdt).

Given a noisy data instance , our goal is to identify the functional dependencies that characterize the distribution that generated the clean version of . In our work, we combine the probability-based and logic-based interpretations of FDs (see Section 2). For any pair of tuples and sampled from , we denote where is the indicator function, and denote the value assignment for attributes in tuple . We say that iff . Given a distribution , we say that an FD , with and , holds for if for all pairs of tuples in we have that

(2)

with . This condition states that the two random events and are deterministically correlated when the FD holds, otherwise they are independent. Under this interpretation, the problem of FD discovery corresponds to learning the structural dependencies amongst attributes of that satisfy the above condition.

Figure 1. An overview of our structure learning framework for FD discovery

3.2. Solution Overview

We leverage the above probabilistic definition of FDs and build upon structure learning to solve FD discovery. An overview of our framework is shown in Figure 1. The input to our framework is a noisy dataset and the output of our framework is a set of discovered FDs. The workflow of our framework follows three steps:

Dataset Transformation First, we use the input dataset and generate a collection of samples that correspond to outcomes of the random events and . The output of this process is a new dataset that has one attribute for each attribute in but in contrast to it only contains binary values. We describe this step in Section 4.1.

Structure Learning Dataset contains samples from the distribution of events and . We consider a probabilistic graphical model associated with a graph that represents these events (see Section 4.1) and use the samples in to learn the structure of . Here, we leverage the fact that that our model corresponds to a generalized linear model, and learn its structure by obtaining a sparse estimate of its inverse covariance matrix. We describe our structure learning method in Section 4.2.

FD generation Finally, we use the estimated inverse covariance matrix to generate a collection of FDs. We do so by considering the non-zero off-diagonal entries of the estimated inverse covariance matrix. The final output of our model is a collection of discovered FDs of the form where and .

4. FD Discovery in AutoFD

We first introduce the probabilistic graphical model that AutoFD uses to represent FDs and then describe our approach to learning its structure. Finally, we discuss how our approach compares to a naive application of structure learning to FD discovery.

4.1. The AutoFD Model

AutoFD’s probabilistic graphical model is inspired by the FD definition described in Equation 2 and aims to capture the distribution of the random events and . AutoFD’s model consists of random variables that model these two random events. The edges in the model represent statistical dependencies that capture the relation in Equation 2.

We have one random variable per attribute in . For each attribute , we denote the random event of sampling two tuples from distribution such that they have the same value for attribute . In other words, for any sample from , the binary random variable takes iff . We now define the edges over the set of binary random variables . Assume that the FD holds and hence the correlation defined in Equation 2 holds. We represent the dependency between attributes and be having a directed edge from each attribute to attribute . Each true FD in the data generating distribution corresponds to a directed subgraph with V-structure. Let . For Equation 2 to hold, the entries of the conditional probability table for the subgraph corresponding to FD should such that: , , and all other entries should be set such that they force an independence structure. We assume acyclic FDs, i.e., we do not allow for sets of FDs such as and . As a result, the graphical structure of this model corresponds to a directed probabilistic graphical model where each FD introduces a V-structure subgraph. We assume a global order over the FDs which also defines the global order of the random variables in the above model.

Our goal is to learn the graphical structure of the model described above. However, learning the structure of a directed graphical model with V-structure patterns is NP-hard (Chickering:2004:LLB:1005332.1044703). In fact, it is only for tree-based directed graphical models that one can obtain guarantees for graph-based structure learning methods (Koller:2009:PGM:1795555). Given this hardness result, we turn our attention to structure learning for parsimonious generalized linear models (see Section 2. Specifically, we relax our initial model to a linear structural equation model that approximates the condition in Equation 2. This is the actual model that AutoFD uses for FD discovery. We next describe this relaxed model.

First, we relax the random variables to take values in instead of . Second, we have that when it must be that . To represent this condition for real-values random variables we rely on soft logic (Bach:2017:HMR:3122009.3176853). Soft logic allows continuous truth values from the interval instead of , and the Boolean logic operators are reformulated as: , , and . Based on this formulation of conjunction, we can approximate the condition in Equation 2 by requiring that when the FD holds. We leverage this relaxed condition to derive AutoFD’s model for FD discovery.

We consider the random vector

that corresponds to the random variables associated with the attributes in schema . Based on the aforementioned relaxed condition, FDs force this random vector to follow a linear structured equation model. Hence, we can write that:

(3)

where we assume that and for all , where denotes conditional independence. Since our model corresponds to a directed graphical model, matrix is a strictly upper triangular matrix. is known as the autoregression matrix of the system (loh2014high). For DAG with vertex set and edge set , the joint distribution factorizes as . Given samples , our goal is to infer the unknown matrix .

4.2. Structure Learning in AutoFD

Our structure learning algorithm follows from results in statistical learning theory. We build upon a recent result of Loh and Buehlmann 

(loh2014high) on learning the structure of linear causal networks via inverse covariance estimation. Given a linear model as the one shown in Equation 3, it can be shown that the inverse covariance matrix of the model can be written as:

(4)

where

is the identity matrix,

is the autoregression matrix of the model, and with denoting the covariance matrix. This decomposition of is also commonly used in generalized linear models for learning parsimonious models (pourahmadi2011).

Given Equation 4, FD discovery in AutoFD proceeds as follows: First, we transform the sample data records in the input dataset to samples for the linear model in Equation 3 (see Algorithm 2); Second, we obtain an estimate of the inverse covariance matrix and factorize the estimate to obtain an estimate of the autoregression matrix ; Third, we use the estimated matrix to generate FDs (see Algorithm 3).

Input: A noisy relational dataset following schema .
Output: A set of FDs of the form on .
Set (See Alg. 2);
Obtain an estimate of the inverse covariance matrix (e.g., using Graphical Lasso);
Factorize with being upper triangular;
Set ;
Set (See Alg. 3);
return Discovered FDs
Algorithm 1 FD discovery with AutoFD

An overview of AutoFD’s FD discovery method is shown in Algorithm 1. The structure learning part in this algorithm proceeds as follows: Suppose we have observations and let by the empirical covariance matrix of these observations. It is a standard result (meinshausen2006high) that the sparse inverse covariance can be estimated by solving the following optimization problem: . Friendman et al. (friedman2008sparse) have shown that one can approximate the solution to this problem by solving a series of LASSO problems. This method is known as Graphical Lasso and is one of the de-facto algorithms for structure learning. Graphical Lasso is shown to scale favorably to large instances and hence is appropriate for our setting. In our experimental evaluation, we show that our methods can scale to datasets with millions of records and tens of attributes. Given the estimated inverse covariance matrix , we use the Bunch-Kaufman algorithm to obtain a factorization of and obtain an estimate for the autoregression matrix . To generate FDs from we use Algorithm 3.

Input: A dataset with rows and columns
Output: A dataset with rows and columns
columns ;
shuffle rows of ;
;
for i = 1 : k do
        sort by attribute ;
        circular shift of rows in by 1;
        for j = 1 : n do
               for l = 1 :k do
                      ;
                     
               end for
              
        end for
       
end for
return
Algorithm 2 Data Transformation

We now turn our attention to how we transform the input dataset into a collection of observations for the linear model of AutoFD (see Algorithm 2). We use the differences of pairs of tuples in dataset to generate . As shown in Algorithm 2, we perform a self-join over the input dataset and consider the value differences between the generated pairs of tuples to obtain observations for the random variables in AutoFD’s probabilistic model. Our method can support diverse data types (e.g., categorical, real-values, text data, binary data, or mixtures of those) as we can use a different difference operation for each of these types.

Input: An autoregression matrix of dimensions , A schema
Output: A collection of FDs
;
for j = 1 : m do
        Set the column vector ;
        Take the attributes in that corresponds to non-zero entries in ;
        Let be the attribute in with coordinate ;
        if  then
               ;
              
        end if
       
end for
return FDs
Algorithm 3 FD generation

4.3. Discussion

There are certain benefits that AutoFD’s model offers when compared to applying structure learning directly on .

Our transformation allows us to solve a structure learning where we have access to an increased amount of training data. As we will show in Section 5, existing methods are not robust when the sample size is small. Information-theoretic approaches, such as the one by Mandros et al. (mandros2017discovering), tend to assign a low-confidence score to FDs for small sample sizes. Hence, they exhibit limited recall.

Structure learning for the model described in Section 4.1

enjoys better sample complexity than applying structure learning on the raw input dataset. We focus on the case of discrete random variables to explain this argument. Let

be the size of the domain of the variables. The sample complexity of state-of-the-art structure learning algorithms is proportional to  (wu2018sparse). Our model restricts the domain of the random variables to be , and hence, yields better sample complexity than applying structure learning directly on the raw input. We demonstrate this experimentally in Section 5.

5. Experiments

We compare AutoFD against several FD discovery methods on diverse datasets. The main points we seek to validate are: (1) does structure learning enables us to discover FDs with accurately (i.e., with high precision and recall), (2) can AutoFD scale to large datasets, and (3) can AutoFD provide FDs that are useful for downstream data preparation tasks. We also perform micro-benchmark experiments to examine the effectiveness and sensitivity of our model.

5.1. Experimental Setup

Datasets: We use both synthetic and real-world datasets in our experiments. Our synthetic datasets aim to capture different data properties with respect to four key factors that affect the performance of FD discovery algorithms: (1) Noise Rate (denoted by ). It stresses the robustness of FD discovery methods; (2) Number of Tuples (denoted by ). It affects the sample size available to the FD discovery methods; (3) Number of Attributes (denoted by ); It stresses the scalability of FD discovery methods; (4) Domain Cardinality (denoted by ) of the left-hand side for an FD; It evaluates the sample complexity of FD methods. For our end-to-end evaluation (see Section 5.2), we consider 24 different setting combinations for these four dimensions (summarized in Table 1). For each setting we use a mixture of FDs for which the cardinality of ranges from one to three.

Property Settings
Noise Rate (n) 0% (Zero), 1% (Low), 30% (High)
Tuples (t) 1,000 (Small), 100,000 (Large)
Attributes (r) 8-16 (Small), 40-80 (Large)
Domain Cardinality for FD (d) 64-216 (Small), 1,000-1,728 (Large)
Table 1. The different settings we consider for synthetic datasets. We use the description in parenthesis to denote each of these settings in our experiments.

We follow the next process to generate synthetic data. Given a schema with attributes our generator first assigns a global order to these attributes and splits the ordered attributes in consecutive attribute sets, whose size is between two and four (so that we obey the cardinality of the FD as we discussed above). Let be the attributes in such a split. Our generator samples a value from the range associated with the setting for Domain Cardinality and assigns a domain to each attribute in such that the cartesian product of the attribute values corresponds to that value. It also assigns the domain size of to be .

To simulate real-world data, we introduce FD dependencies as well as correlations in the splits obtained by the above process. For half of the (, ) groups generated via the above process, we introduce FD-based dependencies that satisfy the property in Equation 1. We do so by assigning each value to a value uniformly at random and generating samples, where

is the value for the Tuples parameter. For the remainder of those groups we force the following conditional probability distribution: We assign each value

to a value . Then we generate samples with and . Here, is a hyper-parameter that is sampled uniformly at random from . This process allows us to mix FDs with other correlations, and hence, evaluate the ability of FD discovery mechanisms to differentiate between true FDs and strong correlations. Finally, to test how robust FD discovery algorithms are to noise, we randomly flip cells that correspond to attributes that participate in true FDs to a different value from their domain. The percentage of flipped cells is controlled by the Noise Rate setting.

Dataset Size Attributes Errors (# of cells)
Hospital 1,000 19 504
Food 170,945 15 31,296
Physician 2,071,849 18 174,557
Table 2. Real-world datasets for our experiments.

For real-world datasets, we use three noisy datasets. Table 1 provides information for these datasets. (1) The Hospital dataset is a small benchmark dataset used in several data cleaning papers (hc; holistic). Errors are artificially introduced by injecting typos; (2) The Food dataset contains information on food establishments in Chicago. Errors correspond to typos; (3) The Physician dataset form Medicare.gov111https://data.medicare.gov/data/physician-compare. Errors correspond to typos and null values.

Methods: We compare AutoFD against:

PYRO (kruse2018efficient): PYRO is the state-of-the-art FD discovery method in the database community (kruse2018efficient). The code we used for experiments is released by the authors.222https://github.com/HPI-Information-Systems/pyro/releases. The scalability of the algorithm is controlled via an error rate hyper-parameter.

Reliable Fraction of Information (RFI)(mandros2017discovering): This method is the state-of-the-art FD discovery approach in the data mining community. It relies on an information theoretic score to identify FDs and uses an approximation scheme to optimize performance. The approximation ratio is controlled by a user specified hyper-parameter . We evaluate RFI for where a value of corresponds to no approximation. The code we used is released by the authors.333http://eda.mmci.uni-saarland.de/prj/dora/ This implementation discovers FDs for one attribute at a time. To discover all FDs in a dataset, we run the provided method once per attribute.

Graphical Lasso (GL): We also evaluate a state-of-the-art structure learning algorithm on the raw input dataset . Graphical Lasso provides as with an estimate of the inverse covariance of that problem. Graphical Lasso is shown to recover the true structure of the undirected graphical model that represents the distribution that corresponds to  (wu2018sparse). In this case we cannot factorize to generate FDs. To find FDs that determine attribute , we take the neighborhood (as defined by of the corresponding random variable and perform a local graph search to find high-score directed structures (Koller:2009:PGM:1795555).

Evaluation Setup: To measure accuracy, we use Precision (P) defined as the fraction of correctly discovered FDs by the total number of discovered FDs; Recall (R) defined as the fraction of correctly discovered FDs by the total number of true FDs in the dataset; and is defined as . For synthetic dataset, each setting has five corresponding dataset instances. To ensure that we maintain the coupling amongst Precision, Recall, and , we report the median performance. For all methods, we fine-tuned their hyper-parameters to optimize performance. In the case of Pyro we consulted the authors for this process. All experiments without specific description were executed on a machine with Two Intel Xeon Silver 4114 10-core CPUs at 2.20 GHz and 192GB Memory. Every time we run 2 datasets in parallel and each dataset is assigned 16 isolated threads and 93GB Memory.

5.2. End-to-end Performance

We evaluate the performance of AutoFD against competing approaches on the synthetic and real-world data described above. We first present quantitative results on the synthetic data (since we know the exact FDs) and then present qualitative results on the real-world datasets.

n t r d AutoFD GL PYRO RFI () RFI () RFI ()
High l l l P 0.500 0.143 0.001 - - -
R 1.000 0.100 0.200 - - -
0.667 0.118 0.001 - - -
s P 0.435 0.353 0.001 - - -
R 1.000 0.600 0.300 - - -
0.606 0.444 0.002 - - -
s l P 0.400 0.000 0.005 - - -
R 0.500 0.000 0.250 - - -
0.500 0.000 0.009 - - -
s P 0.500 0.333 0.006 - - -
R 0.500 0.500 0.500 - - -
0.500 0.400 0.013 - - -
s l l P 0.600 0.000 0.001 - - -
R 0.400 0.000 0.400 - - -
0.471 0.000 0.002 - - -
s P 0.304 0.000 0.001 - - -
R 0.700 0.000 0.200 - - -
0.424 0.000 0.001 - - -
s l P 0.250 0.000 0.000 0.000 0.000 0.000
R 0.500 0.000 0.000 0.000 0.000 0.000
0.333 0.000 0.000 0.000 0.000 0.000
s P 0.400 0.000 0.000 0.000 0.000 0.000
R 1.000 0.000 0.000 0.000 0.000 0.000
0.571 0.000 0.000 0.000 0.000 0.000
Low l l l P 0.400 0.364 0.000 - - -
R 1.000 0.400 0.200 - - -
0.571 0.381 0.000 - - -
s P 0.714 0.353 0.000 - - -
R 1.000 0.600 1.000 - - -
0.833 0.444 0.000 - - -
s l P 0.667 0.333 0.008 0.375 - -
R 1.000 0.500 0.500 0.750 - -
0.800 0.400 0.016 0.500 - -
s P 1.000 0.500 0.002 1.000 - -
R 0.500 1.000 1.000 1.000 - -
0.667 0.667 0.004 1.000 - -
s l l P 0.533 0.017 0.000 - - -
R 0.700 0.100 0.300 - - -
0.640 0.029 0.000 - - -
s P 0.909 0.167 0.000 - - -
R 1.000 0.100 1.000 - - -
0.952 0.143 0.000 - - -
s l P 0.667 0.000 0.008 0.250 0.250 0.250
R 1.000 0.000 0.500 0.500 0.500 0.500
0.800 0.000 0.016 0.333 0.333 0.333
s P 1.000 0.000 0.005 0.143 0.286 0.286
R 1.000 0.000 1.000 0.500 1.000 1.000
1.000 0.000 0.010 0.222 0.444 0.444
Zero l l l P 0.667 0.214 - - - -
R 0.600 0.300 - - - -
0.632 0.250 - - - -
s P 0.667 0.421 - - - -
R 1.000 0.800 - - - -
0.800 0.552 - - - -
s l P 1.000 0.667 0.000 - - -
R 1.000 0.500 0.000 - - -
1.000 0.667 0.000 - - -
s P 1.000 0.400 0.006 1.000 1.000 -
R 1.000 1.000 0.500 0.500 0.500 -
1.000 0.500 0.012 0.667 0.667 -
s l l P 0.714 0.017 0.000 - - -
R 0.500 0.100 0.200 - - -
0.588 0.029 0.000 - - -
s P 0.769 0.143 - - - -
R 1.000 0.100 - - - -
0.870 0.118 - - - -
s l P 0.667 0.000 0.001 0.000 0.000 0.000
R 1.000 0.000 0.500 0.000 0.000 0.000
0.800 0.000 0.003 0.000 0.000 0.000
s P 1.000 0.100 0.001 0.200 0.200 -
R 1.000 0.500 0.500 0.500 0.500 -
1.000 0.167 0.003 0.286 0.286 -
  • ’-’: method exceeds runtime limit (8 hours), or runs out of memory, or output is more than 7 GB.

Table 3. Precision, Recall and -score of different methods for different synthetic settings. A description of the different settings is provided in Table 1.

5.2.1. Accuracy

Table 3 shows the precision, recall, and -score obtained by different methods. As shown, AutoFD consistently outperforms all other methods in terms of -score across all settings, with an improvement of more than 2X on average. More importantly, we find that AutoFD is less affected by limited sample sizes and high-cardinality domains compared to other FD discovery methods. In detail, we find that AutoFD maintains good precision and recall for datasets with low amount of noises ( 1%) with an average precision of 85.52% and an average recall of 99.75%. Despite the fact that it exhibits an average drop of 27.38% for datasets with high noise rate, AutoFD still yields better precision and recall than competing methods. This verifies our hypothesis that structure learning along with the data transformation step introduced in Section 4.1 leads to more a accurate FD discovery solution.

We focus on the results for competing methods. We start with PYRO. To optimize PYRO’s performance we set its error rate hyper-parameter to the noise level for each dataset. For low noise-rates PYRO may not terminate. We see that in most cases PYRO obtains high recall but low precision. This behavior is expected as PYRO follows a logic-based interpretation of FDs (see Section 2) and aims to discover all FDs that hold for a given dataset instance. It is not designed to find the true FDs in the data generating distribution or interpretable FDs for data preparation tasks. For example, for datasets with small number of attributes (8-16), PYRO finds 446 FDs on average, excluding the outputs ranging from 7.8 GB to 10 GB that we cannot handle, which may affect the performance in downstream data preparation tasks.

We now turn our attention to RFI. As shown, RFI exhibits poor scalability as in many cases it fails to terminate within 8 hours and in others it raises out-of-memory issues. For the cases that RFI terminates we find that it exhibits high precision for small cardinality domains when a large number of samples is available and the noise rate is low. As the sample size decreases or the noise rate increases we find that the performance of RFI drops significantly. We further investigated the performance of RFI for partial executions. Recall that due to the implementation of RFI, we have to run it for each attribute separately. We evaluated RFI’s accuracy for each of the attributes processed within the 8-hour time window. Our findings are consistent with the aforementioned observation. The precision of RFI is very high but its recall is lower than AutoFD. The main takeaway is that RFI has high sample complexity.

Finally, we see that the high sample complexity of structure learning on the raw input (see Section 4.3) leads to GL exhibiting low accuracy. This becomes more clear, if we compare the performance of GL with a large number of tuples to that with a small number of tuples while keeping other variables constant. We can see a consistent drop of performance when the data sample becomes limited. This validates our modeling choices for AutoFD.

n t r d AutoFD GL PYRO RFI () RFI () RFI ()
High l l l 305.451 5.027 9.165 - - -
s 259.571 4.370 6.608 - - -
s l 8.821 0.740 1.974 15879.989 40814.085 -
s 10.147 0.799 1.662 7212.395 17868.892 21866.164
s l l 3.050 0.280 1.741 - - -
s 3.064 0.253 1.590 - - -
s l 0.290 0.096 0.505 869.717 1450.224 1720.670
s 0.287 0.077 0.578 434.343 713.357 650.564
Low l l l 285.167 4.993 69.377 - - -
s 256.525 4.432 458.153 - - -
s l 8.762 0.721 1.665 20763.900 24873.611 -
s 10.156 0.720 4.135 8784.491 6108.177 27178.642
s l l 3.001 0.281 3.906 - - -
s 3.061 0.284 40.593 - - -
s l 0.285 0.075 0.508 747.225 859.139 1610.464
s 0.307 0.085 0.752 361.877 586.522 522.050
Zero l l l 287.191 4.898 - - - -
s 259.578 4.350 - - - -
s l 8.714 0.737 6.995 24068.404 24868.802 45127.042
s 10.027 0.799 7.590 8136.108 6511.796 24328.727
s l l 3.006 0.289 965.906 - - -
s 3.110 0.245 - - - -
s l 0.294 0.079 0.800 731.388 928.043 1204.162
s 0.294 0.091 1.260 309.829 547.799 669.768
  • ’-’: method either exceeds runtime limit (8 hours) or runs out of memory.

Table 4. Average runtime (in seconds) of different methods for different synthetic settings.

5.2.2. Runtime

We measure the total wall-clock runtime of each data repairing method for all datasets. The results are shown in Table 4. AutoFD and GL are python based, non-parallelized programs, while RFI and PYRO are Java based, parallelized program. Since, most methods finish within hundreds of seconds, we limit the maximum runtime to eight hours. Overall, we see that AutoFD’s runtime is better than RFI’s and AutoFD has better column-wise scalability than both methods though poor row-wise scalability than PYRO.

5.3. Performance on Real-World Data

We evaluate the performance of all methods on the real-world datasets described in Section 5.1. We first report the runtime of different methods and then present a qualitative analysis of the FDs they discover. A summary of our findings is shown in Table 5. We first focus on runtime. As shown both AutoFD and PYRO can scale to large real-world noisy data instances. We see that AutoFD only requires only 79 seconds to analyze a dataset with million tuples and 18 attributes. As with the synthetic data RFI scales poorly. We next focus on the FDs discovered by the different methods.

Dataset AutoFD GL PYRO RFI(.3) RFI(.5) RFI(1.0)
Hospital runtime (sec) 0.318 - 1.029 3249.8 10272.8 17712.8
# of FDs 9 - 434 16 16 16
Food runtime (sec) 14.433 0.924 5.059 - - -
# of FDs 11 16 156 - - -
Physician runtime (sec) 79.068 5.920 55.978 - - -
# of FDs 4 6 528 - - -
  • ’-’ for GL: too few data samples makes the matrix to ill-conditioned to solve

  • ’-’ for RFI: did not complete within eight hours.

  • * this experiment was executed on a different machine with 4 CPUs (each is a 20-core Intel(R) Xeon(R) Gold 6148 with hyper-threading), 0.5TB RAM

Table 5. Quantitive Results over Real-world Datasets
Figure 2. The autoregression matrix estimated by AutoFD for the Hospital dataset.

We see that AutoFD, GL, and RFI find a number of FDs that is always less than the number of attributes in the input dataset. On the other hand, PYRO finds hundreds of FDs for each dataset. These results are consistent with the FD interpretation adopted by each system 2. We now analyze some of the FDs discovered different systems. We focus on the FDs discovered for Hospital. We consider the FDs discovered by AutoFD. A heatmap of the regression matrix of AutoFD’s model is shown in Figure 2. We find that the discovered FDs are meaningful. For example, we see that attributes ‘Provider Number’ and ‘Hospital Name’ determine most other attributes. We also see that ‘Address1’ determines location-related attributes such as ‘City’, ‘Zip code’ and ‘County’. We also find that attribute ‘Measure Code’ determines ‘Measure Name’ and that they both determine ‘StateAvg’. In fact, ‘StateAvg’ corresponds to the concatenation of the ‘State’, and ‘Measure Code’ attributes. The reader may wonder why the ‘State’ attribute is found to be independent of every other attribute. We attribute this to the fact that hospital dataset only contains two states with one appearing nearly 89% of time. Enforcing a sparse structure, AutoFD weakens the role of ‘State’ in deterministic relations. These results show that AutoFD can identify meaningful FDs in real-world datasets.

Figure 3. The FDs discovered by RFI for Hospital.

We consider the competing methods. For RFI, the results are consistent across all three alphas, so we pick the one with highest alpha (lower approximate rate). RFI outputs 18 FDs that are shown in Figure 3. The value in the parenthesis is the reliable fraction of information, the score proposed by RFI to select AFDs. After eliminating FDs with low score, we find that most of FDs discovered by RFI are also meaningful. However, it has the problem of overfitting to the dataset. Specifically, for the FD ‘ZipCode’ ‘EmergencyService’, this relation holds for the given dataset instance, but does not convey any real-world meaning. We attribute this behavior to the fact that the domain of ‘ZipCode’ is really large while ‘Emergency Service’ only has a binary domain. This makes it more likely to observe a spurious FD when the number of data samples is limited. This finding matches RFI’s performance for the synthetic datasets. For PYRO, we find that it discovers hundreds of FDs that are not particularly meaningful for data preparation tasks. For instance, PYRO finds 24 FDs that determine the attribute ‘Address1’.

5.4. Using AutoFD to Automate Data Cleaning

Recent work (hc) showed that integrity constraints such as FDs can be used to train machine learning models for data cleaning in a weakly supervised manner. A limitation of this work is that it relies on users to specify these constraints. Here, we test if AutoFD can be used to automate this process and address this pain point. For our experiments, we use the open-source version of the system from (hc), as it provides a collection of manually specified FDs for the Hospital dataset. We perform the following experiment: we compare the manual FDs in that repository with the FDs discovered by AutoFD. The precision, recall, and reported by the data cleaning system for the manual constraints are 0.91, 0.70, and 0.79 respectively, while the corresponding metrics for the FDs discovered by AutoFD is 0.93, 0.72, and 0.81. We se that this performance is comparable to the manually specified FDs, thus, providing evidence on the applicability of AutoFD to discover FDs that are useful in downstream data preparation tasks.

5.5. Micro-benchmark Results

Finally, we report micro-benchmarking results: (1) we evaluate the scalability of AutoFD and demonstrate its quadratic computational complexity with respect to number of attributes; (2) evaluate the effect of increasing noise rates on the performance of AutoFD.

5.5.1. Column-wise Scalability

Based on our discussion in Section 4, AutoFD exhibits quadratic complexity instead of exponential complexity with respect to the number of columns in a dataset. We experimentally demonstrate AutoFD’s scalability. We generate a collection of synthetic datasets where we keep all settings fixed except for the number of attributes, which we range from 4 to 190 with a increase step of two. For each number of columns, we generate five datasets and calculate the average runtime for each columns size. In addition, we log both the total runtime (including data loading and data transformation) and the structure learning runtime. The results are shown in Figure 4 and validate the quadratic scalability of AutoFD as the number of attributes increase.

Figure 4. Columns-wise Scalability of AutoFD.

5.5.2. Effect of Increasing Noise Rates

In this experiment, we evaluate how AutoFD performs as the noise rate increases. For this experiment we generate a new set of synthetic datasets that is different from that of Section 5.2. Again we generate five instances per dataset setting (see Table 1 for our settings) and measure the performance of AutoFD for noise rates in . We report the median score in Figure 5. As expected, the performance of AutoFD deteriorates as the noise increases, however, AutoFD is shown to be robust to high error rates.

Figure 5. Effect of Increasing Noise Rates. Dataset names correspond to the setting that was used (see Table 1).

6. Conclusions

We introduced AutoFD, a structure learning framework to solve the problem of FD discovery in relational data. A key result in our work is to model the distribution that FDs impose over pairs of records instead of the joint distribution over the attribute-values of the input dataset. Specifically, we introduce a method that convert FD discovery to a structure learning problem over a linear structured equation model. We empirically show that AutoFD outperforms state-of-the-art FD discovery methods and can produce meaningful FDs that are useful for downstream data preparation tasks.

References