Data is central to modern applications in a wide range of domains, including healthcare, transportation, finance, and many others. Such data-driven applications typically rely on learning components trained over large datasets, or components specifically designed to target particular data and workloads. While these data-driven approaches have seen wide adoption and success, their reliability and proper function hinges on the data’s continued conformance to the applications’ initial design. When data deviates from these initial parameters, application performance degrades and system behavior becomes unpredictable. In particular, as data-driven prediction models are employed in applications with great human and societal impact, we need to be able to assess the trustworthiness of their predictions.
The ability to understand and quantify the deviations of data from the ranges and distributions that an application can safely support is critical in determining whether the prediction of a machine-learning system can be trusted [DBLP:conf/hicons/TiwariDJCLRSS14, DBLP:journals/crossroads/Varshney19, DBLP:journals/corr/abs-1904-07204, DBLP:conf/kdd/Ribeiro0G16, DBLP:conf/nips/JiangKGG18], when a system needs to be retrained because of data drift [DBLP:conf/kdd/QahtanAWZ15, DBLP:journals/tnn/KunchevaF14, DBLP:journals/csur/GamaZBPB14, DBLP:journals/jss/BarddalGEP17], and when a database needs to be retuned [koch2013]. In this paper, we characterize and quantify such deviations with a new data-profiling primitive, a data invariant, that captures an implicit constraint over multiple numerical attributes that tuples in a reference dataset satisfy. We proceed to describe a real-world example of data invariants, drawn from our case-study evaluation on the problem of trusted machine learning (TML), which aims to quantify trust in the prediction of a machine-learned model over a new input tuple.
We used a dataset with flight information that includes data on departure and arrival times, flight duration, etc. (Figure 1
) to train a linear regression model to predict flight delays. The model was trained on a subset of the data that happened to include only daytime flights (such as the first 4 tuples in Figure1). In an empirical evaluation of the regression accuracy, we found that the mean absolute error of the regression output more than quadruples for overnight flights (such as the last tuple in Figure 1), compared to daytime flights. The reason is that data for overnight flights deviates from the profile of the training data. Specifically, daytime flights satisfy the invariant that “arrival time is later than departure time and their difference is very close to the flight duration”, which does not hold for overnight flights. Critically, even though this invariant is unaware of the target attribute (delay), it was still a good proxy of the regressor’s performance.
In this paper, we propose data invariants, a new data-profiling primitive that complements the existing literature on modeling data constraints. Specifically, data invariants capture arithmetic relationships over multiple numerical attributes in a possibly noisy dataset. For example, the data invariant of Example 1 corresponds to the constraint: , where and are small values. Data invariants can capture complex linear dependencies across attributes. For example, if the flight departure and arrival data reported the hours and the minutes across separate attributes, the invariant would be . Existing constraint models, such as functional dependencies and denial constraints, do not capture such arithmetic dependencies, and are typically sensitive to noise and not well-suited for numerical attributes.
A key insight in our work is that learning systems implicitly rely on data invariants (e.g., by reducing the weight of an attribute that can be deduced by others); thus, we can use a tuple’s deviation from these invariants as a proxy for the trust on the system’s prediction for that tuple. We focus on quantitative semantics of data invariants, so that we not only capture the (Boolean) violation of data invariants by a new tuple, but we can also measure the degree of violation. Through this mechanism, data invariants can quantify trust in prediction outcomes, detect data drift, and specify when a database should be retuned.
We first proceed to discuss where data invariants fit with respect to the existing literature on data profiling: specifically, functional dependencies and denial constraints. Then, we provide the core intuition and insights for modeling and deriving data invariants.
Prior art on modeling arithmetic constraints
Data invariants, just like other constraint models, fall under the umbrella of data profiling, which refers to the task of extracting technical metadata about a given dataset [DBLP:journals/vldb/AbedjanGN15]. A key task in data profiling is to learn relationships among attributes. Denial constraints (DC) [DBLP:journals/pvldb/ChuIP13, DBLP:journals/pvldb/BleifussKN17, pena2019discovery] encapsulate a number of different data-profiling primitives such as functional dependencies (FD). However, most DC discovery techniques are restricted to hard constraints—(1) all tuples must satisfy the constraints, and (2) the constraints should be exactly satisfied—and are not suitable when the data is noisy.
DCs can adjust to noisy data by adding predicates until the constraint becomes exact over the entire dataset, but this can make the constraint extremely large, complex, and uninterpretable. Moreover, such a constraint might not even provide the desired generalization. For example, a finite DC—whose language is limited to universally-quantified first-order logic—cannot model the data invariant of Example 1, which involves an arithmetic expression (addition and multiplication with a constant). Expressing data invariants require a richer language that includes linear arithmetic expressions. Pattern functional dependencies [qahtan2020pattern] move towards addressing this limitation of DCs using regular expressions, but they only focus on non-numerical attributes. While techniques for approximate DC discovery [pena2019discovery, huhtala1999tane, kruse2018efficient] exist, they rely on the users to provide an acceptable error threshold. In contrast, data invariants do not require any error threshold.
Using a FD like , , , to model the invariant of Example 1 suffers from several limitations. First, since the data is noisy, in the traditional setting, no FD would be learned. Metric FDs [koudas2009metric] allow small variations in the data (similar attribute values are considered identical); however, to model this invariant using a metric FD, one needs to specify non-trivial similarity metrics involving multiple attributes. In our example, such a metric would need to encode that (1) the attribute combination hour = 4, min = 59 is very similar to hour = 5, min = 1 and (2) the attribute combination hour = 4, min = 1 is not similar to hour = 5, min = 59. Moreover, existing work on metric FDs only focuses on their verification [koudas2009metric]; to the best of our knowledge, no prior work exists on the discovery of metric FDs.
Our work targets the following challenges.
- Numerical attributes.
Numerical attributes are inherently noisy and their domain can be infinite. Most existing approaches are restricted to categorical or discretized attributes and typically do not handle or, even when they handle, perform poorly with numerical attributes. Data invariants focus on numerical attributes specifically, and can model complex linear dependencies among them.
- Impact of violation.
In applications of constraint violations, some violations may be less critical than others. Our work considers a notion of invariant importance, and weighs violations against invariants accordingly. Intuitively, violating a stricter invariant (an invariant specifying low variability in the underlying data) is likely more critical than violating a looser one.
- Measuring violation.
Existing approaches typically require that constraints are exactly satisfied by all tuples. As a result, they discard inexact constraints that may still be useful and treat violation as Boolean (a tuple satisfies the constraint or not). Data invariants offer flexibility in the derived constraints, and specifically allow for measuring the degree of violation. This measure weighs the violation of each invariant based on a measure of its impact.
Unlike data invariants, most existing approaches focus on precise and exact constraints. When the size of the data grows, especially in the presence of noise, these approaches extract less useful information as exact constraints become more rare. Thus, scaling to larger data sizes can render these methods less effective.
The complexity of existing techniques grows exponentially with the number of attributes, making dependency discovery impractical for even a moderately large number of attributes (). In contrast, the complexity of data-invariants computation is only cubic in the number of attributes.
In summary, we envision data invariants—which encode constraints that express approximate arithmetic relationships among numerical attributes—as an essential primitive that will enrich the data-profiling literature and is complementary to existing techniques in denial constraints.
Key insights of data invariants
Data invariants fall under the umbrella of the general task of data profiling. They specify constraints over numerical attributes, complementing existing work on data constraints. In this paper in particular, we focus on data invariants specifying linear arithmetic constraints. Data invariants focus on finding a closed-form (and thus explainable) function over the numerical attributes, such that the function, when evaluated on the tuples, results in low variance.
We present a method for learning linear invariants inspired by principal component analysis (PCA). Our key observation is that the principal components with low variance (on the dataset) yield strong data invariants. Note that this is different from—and in fact completely opposite to—the traditional approaches that perform multidimensional analysis after reducing dimensionality using PCA [DBLP:conf/kdd/QahtanAWZ15]. Beyond simple linear invariants—such as the one in Example 1—we also derive disjunctive linear invariants, which are disjunctions of linear invariants. We achieve this derivation by dividing the dataset into disjoint partitions, and learning simple linear invariants for each partition.
In this paper, we introduce a simple language for data invariants. Furthermore, given an invariant and a tuple, we derive a numerical score that measures how much the tuple violates the invariant: a score of zero indicates no violation and a positive score indicates that the tuple violates the invariant, with higher score indicating greater violation. We also provide a mechanism to aggregate the violation of a set of data invariants, by weighing violation of stricter (i.e., low-variance) invariants more and looser (i.e., high-variance) invariants less. Our experimental evaluation (Section 6) demonstrates that this violation score is an effective measure of confidence in the prediction of learned models and effectively captures data drift.
Contributions. Our work makes the following contributions:
We ground our motivation and our work with two case studies on trusted machine learning (TML) and data drift. (Section 2)
We introduce and formalize data invariants, describe a language for expressing them, and discuss their semantics. (Section 3)
We formally establish that strong data invariants are constructed from derived attributes with small variance and small mutual correlation on the given dataset. We provide an efficient, scalable, and highly parallelizable PCA-based algorithm for computing simple linear invariants and disjunctions over them. We also analyze their time and memory complexity. (Section 4)
We formalize the notion of non-conforming tuples in the context of trusted machine learning and provide a mechanism to detect whether a tuple is non-conforming using data invariants. To that end, we focus on weak invariants, whose violation is sufficient, but not necessary, to detect non-conformance. (Section 5)
We empirically analyze the effectiveness data invariants in our two case-study applications—TML and data-drift quantification. We find that data invariants can reliably predict the trustworthiness of linear models and quantify data drift precisely, outperforming the state of the arts. We also show how an intervention-centric explanation tool—built on top of data invariants—can explain causes for tuple non-conformance (by assigning responsibility to the attributes) on real-world datasets. (Section 6)
2 Case studies of data invariants
Like other data-profiling mechanisms, data invariants have general applicability in understanding and describing datasets. Within the scope of our work, we focus in particular on the utility of data invariants in detecting when applications may operate outside their safety envelope [DBLP:conf/hicons/TiwariDJCLRSS14], i.e., when the operation of a system may become untrustworthy or unreliable. We describe two case studies that motivate our work. We later provide an extensive evaluation over these two applications in Section 6.
Trusted Machine Learning. Trusted machine learning (TML) refers to the problem of quantifying trust in the prediction made by a machine-learned model on a new input tuple. This is particularly useful in case of extreme verification latency [souzaSDM:2015], where ground-truth outputs for new test tuples are not immediately available to evaluate the performance of a learned model. If a model is trained using dataset , then data invariants for specify a safety envelope [DBLP:conf/hicons/TiwariDJCLRSS14] that characterizes the tuples for which the learned model is expected to make trustworthy predictions. If the new tuple falls outside the safety envelope, i.e., it violates the invariants, then the learned model is likely to produce an untrustworthy prediction. Intuitively, the higher the violation, the lower the trust. While some machine-learning approaches return a confidence measure along with the prediction, these confidence measures are not well-calibrated, and it is well-known that there are several issues with interpreting these confidence measures as a measure of trust in the prediction [DBLP:conf/nips/JiangKGG18, guo2017calibration].
In data-driven systems, feature-drift [DBLP:journals/jss/BarddalGEP17] is one of the reasons for observing non-conforming tuples. In the context of trusted machine learning, we formalize the notion of non-conforming tuples. A key result that we present is that data invariants provide a sound and complete procedure for detecting whether a given tuple is non-conforming. Since our proof of this result is non-constructive, we present a second result that establishes sufficient, but not necessary, conditions for solving it (Section 5). This result indicates that the search for invariants should be guided by the class of models considered by the corresponding machine learning technique.
Data drift. Our second use-case focuses on detecting and quantifying data drift [DBLP:conf/kdd/QahtanAWZ15, DBLP:journals/tnn/KunchevaF14, DBLP:journals/csur/GamaZBPB14, DBLP:journals/jss/BarddalGEP17]. Data drift specifies a significant change in the underlying data distribution that typically requires that systems be updated and models retrained. Our data invariant-based approach here is simple: given two datasets and , we first compute data invariants for , and then evaluate the invariants on . If satisfies the invariants, then we have no evidence of drift. In contrast, if violates the invariants, then that serves as an indication of drift; and the degree of violation measures how much has drifted from .
While we focus on these two applications in this paper, we point out to few other applications of data invariants in Section 7.
3 Data Invariants
In this section, we define data invariants, a new data-profiling primitive that allows us to capture complex dependencies across numerical attributes. Intuitively, a data invariant is an implicit constraint that the data satisfies. We first provide the general definition of data invariants, and then propose a language for representing them. We then define quantitative semantics over data invariants, which allows us to quantify their violation.
Basic notations. We use to denote a relation schema where denotes the attribute of . We use to denote the domain of attribute . Then the set specifies the domain of all possible tuples. We use to denote a tuple in the schema . A dataset is a specific instance of the schema . For ease of notation, we assume some order of tuples in and we use to refer to the tuple and to denote the value of the attribute of .
We start with a strict definition of data invariants, and then explain how it generalizes to account for noise in the data.
Definition 1 (Data invariant (strict)).
A data invariant for a dataset is another dataset s.t. .
Intuitively, a data invariant specifies a set of allowable tuples (). The tightest invariant for is itself, whereas the loosest invariant is . We represent an invariant
using its characteristic function (or formula). By definition, if and only if . A characteristic function is an invariant for if . We write and to denote that satisfies and does not satisfy the invariant , respectively. In this paper, we do not distinguish between the invariant and its representation , and use them interchangeably.
In practice, because of noise, some tuples in may not satisfy an invariant, i.e., s.t. . To account for noise, we relax the definition of invariants as follows.
Definition 2 (Data invariant (relaxed)).
A relaxed data invariant for a dataset is another dataset s.t. .
The set denotes atypical points in that do not satisfy the invariant (and thus are not in ). In our work, we do not need to know the set , nor do we need to purge the atypical points from the dataset. Our techniques derive invariants in ways that ensure a small (Section 4). In this paper, when we talk about data invariants, we refer to relaxed data invariants, unless otherwise specified.
3.1 A Language for Data Invariants
Projection. A central concept in our language for data invariants is that of a projection. A projection is a function that maps a tuple to a real number . We extend a projection to a dataset by defining to be the sequence of reals obtained by applying on each tuple in .
Our language for data invariants consists of formulas generated by the grammar:
The language consists of (1) bounded-projection constraints where is a projection on , is the tuple of formal parameters , and are reals; (2) equality constraints where is an attribute and is a constant in the ’s domain; and (3) operators (, , and ,) that connect the constraints. Intuitively, is a switch operator that specifies which invariant applies based on the value of the attribute , denotes conjunction, and denotes disjunction.
Invariant formulas generated by are called simple invariants and those generated by are called compound invariants. Note that a formula generated by only allows equality constraints on a single attribute, namely , among all the disjuncts.
3.2 Quantitative Semantics of Data Invariants
Data invariants have a natural Boolean semantics: a tuple either satisfies an invariant or it does not. However, Boolean semantics is of limited use in practical applications, because it does not quantify the degree of invariant violation. We interpret data invariants using a quantitative semantics, which quantifies violations. Quantitative semantics has the additional benefit that it reacts to noise more gracefully than Boolean semantics.
Given a formula , the quantitative semantics is a measure of the violation of on a tuple —with a value of indicating no violation and a value greater than 0 indicating violation. If is (in Boolean semantics), then will be . Formally, is a mapping from to
Quantitative semantics of simple invariants. The quantitative semantics of simple invariants is defined as:
The quantitative semantics uses the following parameters:
Scaling factor .
Projections are unconstrained functions and different projections can map the same tuple to vastly different values. We use a scaling factor to standardize the values computed by a projection
, and to bring the values of different projections to the same comparable scale. The scaling factor is automatically computed as the inverse of the standard deviation:. (We set to a large positive number when = .)
Normalization function .
The normalization function maps values in the range to the range . While any monotone mapping from to can be used, we pick .
Importance factors , .
The weights control the contribution of each bounded-projection invariant in a conjunctive formula. This allows for prioritizing invariants that may be more significant than others within the context of a particular application. In our work, we derive the importance factor of an invariant automatically, based on its projection’s standard deviation over .
Quantitative semantics of compound invariants. Compound invariants are first simplified into simple invariants, and they get their meaning from the simplified form. We define a function that takes a compound invariant and a tuple and returns a simple invariant. It is defined recursively as follows:
If the condition in the definition above does not hold for any , then is undefined and is also undefined. If is undefined, then . When is defined, the quantitative semantics of is given by:
Since compound invariants simplify to simple invariants, we mostly focus on simple invariants. Even there, we pay special attention to bounded-projection invariants () of the form , which lie at the core of simple invariants.
4 Synthesizing Data Invariants
In this section, we describe our techniques for deriving data invariants. We first focus on the synthesis of simple invariants (the invariants in our language specification), followed by compound invariants (the invariants in our language specification). Finally, we analyze the time and memory complexity of our algorithms.
4.1 Simple Invariants
We now describe how we discover simple invariants for a given dataset. We start by discussing how we synthesize bounds for a given projection. We then describe a principle for identifying effective projections. We establish that: a strong data invariant for a dataset is made from projections that (1) do not have large correlations among each other and (2) have small standard deviations on that dataset. Finally, we provide a constructive procedure—based on principal component analysis—to pick the appropriate projections to use in a simple invariant, along with their importance factors. By putting all these pieces together, we get a procedure for synthesizing simple invariants.
4.1.1 Synthesizing Bounds for Projections
Fix a projection and consider the bounded-projection invariant of the form . Given a dataset , a trivial way to compute bounds is: and . However, this choice is very sensitive to noise: adding a single “atypical” tuple to can produce very different invariants. Hence, we instead use the following more robust choices:
Here, and denote the mean and standard deviation of the values in , respectively, and is some positive constant. With these bounds, implies that is within from the mean . In our experiments, we set , which ensures that for many distributions of the values in . Specifically, if
follows a normal distribution,of the population is expected to lie within standard deviations from mean.
Setting the bounds and as -away from the mean, and the scaling factor , guarantees the following property for our quantitative semantics:
Let be a dataset and let be for . Then, for any tuple , if , then .
Plainly, this means that larger deviation from an invariant (proportionally to the standard deviation) results in higher degree of violation under our semantics.
The proof follows from the fact that the normalization function is monotonically increasing, and hence, is a monotonically non-decreasing function of .
4.1.2 Principle for Synthesizing Projections
To understand how to derive the right projections, we need to first understand what makes an invariant more effective than others in a particular task. Primarily, an effective invariant: (1) should not overfit the data, but rather generalize by capturing the properties of the data, and (2) should not underfit the data, because it would be too permissive and fail to identify deviations effectively. Our flexible bounds (Section 4.1.1) serve to avoid overfitting. In this section, we focus on identifying the principles that help us avoid underfitting.
An effective invariant should help identify deviating tuples. To analyze what makes an invariant more effective than another, we formalize two terms: (1) the strength of invariants as it corresponds to the degree of violation, and (2) incongruous tuples, which are those tuples that deviate from the relative trend of two invariants.
Stronger. An invariant is stronger than another invariant on a subset if .
Incongruous. For a dataset and a projection , let . For projections and , the correlation coefficient is defined as . Informally, an incongruous tuple for and is one that does not follow the general trend of correlation between and ; e.g., if and are positively correlated (i.e., ), an incongruous tuple deviates in opposite ways from the mean of each projection, i.e., . More formally, a tuple is incongruous w.r.t. a projection pair on if:
Let be a dataset with two attributes and . The projections and are positively correlated (); hence, the tuples and are both incongruous, whereas and are not incongruous w.r.t. .
We proceed to show that when two projections are highly correlated, their linear combination leads to a projection with lower standard deviation and a stronger invariant. We will then generalize this result to multiple projections in Theorem 3. This provides the key insight of this analysis, which is that projections with low standard deviation define stronger invariants (and are thus preferable), and that an invariant with multiple highly-correlated projections is suboptimal (as highly-correlated projections can be linearly combined into one with lower standard deviation). We write to denote , the invariant synthesized from . (Proofs are in the Appendix .)
Let be a dataset and be two projections on s.t. . Then, s.t. and for the new projection :
and , and
is stronger than both and on the set of tuples that are incongruous w.r.t. .
We can now use this lemma in an inductive argument to generalize the result to multiple projections.
Theorem 3 (Low Standard Deviation Invariants).
Given a dataset , let denote a set of projections on s.t. with . Then, there exist a nonempty subset and a projection , where s.t.
, is stronger than on the subset , where
The theorem establishes that to detect violations for certain tuples (those in ) (1) projections with low standard deviation are better and (2) an invariant with multiple highly-correlated projections may be suboptimal. Note that
is a conservative estimate for the set of tuples whereis stronger than each ; there exist tuples outside of for which is stronger.
Consider and projections and . On , both projections have the same mean and standard deviation . The correlation coefficient is since on . We derive a new projection , and note that and hence, . Furthermore, is stronger than and on all tuples s.t. , i.e., .
Note that there can be tuples outside of for which is not stronger than and . For example, for the tuple , and . Hence, the invariant is satisfied (violation score is ). However, the invariants and are not satisfied (violation score ). The intuition is that falls outside the observed trends for and ( and ), but it is still within the combined trend (), which better generalized the observed data in .
4.1.3 PCA-inspired Projection Derivation
Theorem 3 sets the requirements for good projections: prefer projections with small standard deviation because they are more sensitive to change [tveten2019principal, tveten2019tailored, DBLP:journals/tnn/KunchevaF14], and avoid highly correlated projections. We now present Algorithm 1, inspired by principal component analysis (PCA), for generating linear projections over a dataset that meet these requirements:
- Line 1
Drop all non-numerical attributes from to get .
- Line 1
Add a new column to that consists of the constant , that is, , where
denotes the column vector witheverywhere.
- Line 1
Compute eigenvectors of the square matrix , where denotes the number of columns in .
- Lines 1–1
Remove the first element of each eigenvector and normalize them to generate projections. (The 2-norm is .)
- Line 1
Compute importance factor for each projection.
- Line 1
Return the linear projections with corresponding normalized importance factors.
We now claim that the projections returned by Algorithm 1 include the projection with minimum standard deviation.
Theorem 4 (Correctness of Algorithm 1).
Given a numerical dataset , let be the set of linear projections returned by Algorithm 1. Let . Then,
for every possible linear projection , and
s.t. , .
If Algorithm 1 returns projections , and importance factors , then we generate the simple (conjunctive) invariant with conjuncts: , where the bounds and are computed as described in Section 4.1.1; we use the importance factor for the conjunct in the quantitative semantics.
Note that the lowest variance principal component of
is close to the ordinary least squares estimate for predictingfrom . However, PCA offers multiple projections at once that range from low to high variance, and have low mutual correlation. For robustness, rather than discarding high variance projections, we assign them significantly small importance factors.
4.2 Compound Invariants
The quality of our PCA-based linear invariants (simple invariants) relies on how many low variance linear projections we are able to find on the given dataset. For many datasets, it is possible we find very few, or even none, such linear projections. In these cases, it is fruitful to search for compound invariants; we first focus on disjunctive invariants (defined by in our language grammar).
Our PCA-based approach fails in cases where there exist different piecewise linear trends within the data. If we apply PCA to learn invariants on the entire dataset of Figure 1(a), it will learn a low-quality invariant, with very high variance. In contrast, partitioning the dataset into three partitions (Figure 1(b)), and then learning invariants separately on each partition, will result in significant improvement of the learned invariants.
A disjunctive invariant is a compound invariant of the form , where each is not necessarily an invariant for all of , but for a specific partition of . Finding disjunctive invariants involves partitioning the dataset into smaller (disjoint) datasets , where each has the same attributes as but only a subset of the rows of .
Our strategy for partitioning is to use categorical attributes with a small domain in ; in our implementation, we use those attributes for which . If is such an attribute with values , we partition into disjoint datasets , where . Let be the simple invariants we learn for using Algorithm 1, respectively. We compute the following disjunctive invariant for :
Under closed-world semantics (i.e., always takes one of the values ) and Boolean violations, we can express this disjunctive invariant using notation from traditional denial constraints [DBLP:journals/pvldb/ChuIP13]:
Note however that linear arithmetic inequalities are disallowed in denial constraints, which only allow atomic constraints that involve only one or two attributes in (with no arithmetic allowed). Our key contribution is discovering simple linear invariants , involving multiple numerical attributes. Also note that under an open-world assumption, compound invariants are more conservative than denial constraints. For example, a new tuple with , where , will satisfy the denial constraint but not the compound invariant.
We repeat this process and partition over multiple attributes and generate a compound disjunctive invariant for each attribute. Finally, we generate a compound conjunctive invariant (), which is the conjunction of all these compound disjunctive invariants as the final data invariant for .
4.3 Complexity Analysis
When computing simple invariants, there are two main computational steps: (1) computing , where is an matrix with tuples (rows) and attributes (columns), which takes
time, and (2) computing the eigenvalues and eigenvectors of anpositive definite matrix, which has complexity [DBLP:conf/stoc/PanC99]. Once we obtain the linear projections using the above two steps, we need to compute the mean and variance of these projections on the original dataset, which takes time. In summary, the overall procedure is cubic in the number of attributes and linear in the number of tuples.
When computing disjunctive invariants, we greedily pick attributes that take at most (typically small) distinct values, and then run the above procedure for simple invariants at most times. This adds just a constant factor overhead per attribute.
The procedure can be implemented in space. The key observation is that can be computed as , where ’s are the tuples in the dataset. Thus, can be computed incrementally by loading only one tuple at a time into memory, computing , and then adding that to a running sum, which can be stored in space. Note that instead of such an incremental computation, this can also be done in an embarrassingly parallel way where we partition the data (row-wise) and each partition is then computed in parallel. Due to the low time and memory complexity, our approach scales gracefully to large datasets.
5 Trusted Machine Learning
In this section, we investigate the use of data invariants in Trusted Machine Learning (TML). In particular, we undertake a theoretical analysis of trusted machine learning by formalizing, in an ideal (noise-free) setting, the notion of “non-conformance” of a new tuple w.r.t. an existing dataset . We show that data invariants provide a sound and complete check for non-conformance. This provides justification for using data invariants to achieve trusted machine learning. Since we perform our theoretical study in a noise-free setting, we use the strict notion of data invariants all through this section.
In trusted machine learning, we are interested in determining whether we can confidently use a prediction made by some machine-learned model on a new tuple. Since data can only provide an incomplete specification for most tasks, there is no certainty in predictions made using models learned from data, but some predictions are nevertheless more trustworthy than others. We now formalize the notion of non-conforming
tuples, on which a machine-learned model produces an untrustworthy prediction. We focus on the setting of supervised machine learning, but our problem definition and solution approach naturally generalize to the setting of unsupervised learning as well.
5.1 Non-conforming Tuples
Consider the task of predicting an output for a tuple , where is the output domain. Let be a function in that represents the ground truth, i.e., the correct output , where denotes some class of functions from to . Suppose that, a machine learning procedure learns a function , using as the training dataset and as the training output, where .111With slight abuse of notation, we use to denote that , where is the label for , for all . So for tuple , it predicts as output.
Assuming the dataset is an matrix—containing tuples, each with attributes—we use to denote the annotated dataset given by an matrix, that is obtained by appending as a new column to . We denote the row of an annotated dataset by .
Definition 3 (Non-conforming tuple).
Let be a collection of functions with signature , and be an annotated dataset. is a non-conforming tuple w.r.t. and , if there exist s.t. but .
Intuitively, a tuple is non-conforming if it is possible to learn two different functions from the training data such that they both agree on all tuples in the training data, but disagree on . This means that we cannot be sure whether or is the ground truth, because both are consistent with the observations , generated by the (unknown) ground truth, for . This would not have been a problem if both and predicted the same output for
. Therefore, when that is not the case, we classifyas a non-conforming tuple, and argue that we should be cautious about the prediction on , made by any model that learned from .
If is a non-conforming tuple w.r.t. and , then for any s.t. , there exists a s.t. but .
Note that even when we mark as non-conforming, it is possible that the learned model makes the correct prediction. However, it is still useful to be aware of non-conforming tuples, because a-priori we do not know if the learned model actually matches the ground truth on tuples that fall outside of .
The key point, when deciding whether a tuple is non-conforming or not, is that we have access to the class of functions (over which the learning procedure searches for a model) and the annotated dataset , but not to the actual learned model or the ground-truth function that generated from . Hence, the computational procedure for detecting whether a tuple is non-conforming can only use knowledge of the class , , and .
Let be a dataset with two attributes , and let the output be , , and , respectively. Let be the class of linear functions over two variables and . Consider a new tuple . This is non-conforming since there exist two different functions, namely and , that agree with each other on , but disagree on . In contrast, is not non-conforming because there is no function in that maps to , but produces an output different from .
We start by providing some intuitions behind the use of data invariants in characterizing non-conforming tuples and then proceed to discuss the data invariant-based approach.
5.2 Data Invariants as Preconditions for TML
Let be a fixed class of functions. Given a dataset , suppose that a tuple is non-conforming. This means that there exist s.t. , but . Now, consider the logical claim that . Clearly, is not identical to since . Therefore, there is a nontrivial “proof” (in some logic) of the fact that “for all tuples ”. This “proof” will use some facts (properties) about , and let be the formula denoting these facts. If is the characteristic function for , then the above argument can be written as,
where denotes logical implication. In words, is a data invariant for and it serves as a precondition in the “correctness proof” that shows (a potentially machine-learned model) is equal to (potentially a ground truth) . If a tuple fails to satisfy the precondition , then it is possible that the prediction of on will not match the ground truth .
5.3 Non-conforming Tuple Detection: A Data Invariant-based Approach
Given a class of functions, an annotated dataset , and a tuple , our high-level procedure for determining whether is non-conforming involves the following two steps:
Learn a data invariant for the dataset .
Declare as non-conformingif does not satisfy .
This approach is sound and complete, thanks to the following proposition that establishes existence of a data invariant that characterizes whether a tuple is non-conforming or not.
There exists an invariant for s.t. the following statement is true: “ iff is non-conforming w.r.t. and for all ”.
Proposition 6 establishes existence of an ideal invariant, but does not yield a constructive procedure. In practice, we also have the common issue that the ideal invariant formula may not have a simple representation. Nevertheless, Proposition 6 provides motivation for finding invariants that can approximate this ideal invariant.
5.4 Sufficient Check for Non-conformance
In practice, finding invariants that are necessary and sufficient for non-conformance is difficult. Hence, we focus on weaker invariants whose violation is sufficient, but not necessary, to classify a tuple as non-conforming. We can use such invariants in the high-level approach mentioned in 5.3 to get a procedure that has false negatives (fails to detect some non-conforming tuples), but no false positives (never classifies a tuple as non-conforming when it is not).
Model Transformation using Equality Invariants
For certain invariants, we can prove that an invariant violation by implies non-conformance of by showing that those invariants can transform a model that works on to a different model that also works on , but . We claim that equality invariants (of the form ) are useful in this regard. First, we make the point using the scenario from Example 5.
Consider the set of functions, and the annotated dataset from Example 5. The two functions and , where and , are equal when restricted to ; that is, . What property of suffices to prove , i.e., ? It is . Going the other way, if we have , then . Therefore, we can use the equality invariant to transform the model into the model in such a way that the continues to match the behavior of on . Thus, an equality invariant can be exploited to produce multiple different models starting from one given model. Moreover, if violates the equality invariant, then it means that the models, and , would not agree on their prediction on ; for example, this happens for .
Let be an equality invariant for the dataset . If a learned model returns a real number, then it can be transformed into another model , which will agree with only on tuples where . Thus, in the presence of equality invariants, a learner can return or its transformed version (if both models are in the class ). This condition is a “relevancy” condition that says that is “relevant” for class . If the model does not return a real, then we can still use equality invariants to modify the model under some assumptions that include “relevancy” of the invariant.
A Theorem for Sufficient Check for Non-conformance
We first formalize the notions of nontrivial datasets—which are annotated datasets such that at least two output labels differ—and relevant invariants—which are invariants that can be used to transform models in a class to other models in the same class.
Nontrivial. An annotated dataset is nontrivial if there exist s.t. .
Relevant. An invariant is relevant to a class of models if whenever , then for a constant tuple and real number . The if-then-else function returns when , returns when , and is free to return anything otherwise. If tuples admit addition, subtraction, and scaling, then one such if-then-else function is .
We now state a sufficient condition for identifying a tuple as non-conforming. (Proofs are in the the Appendix .)
Theorem 7 (Sufficient Check for Non-conformance).
Let be an annotated dataset, be a class of functions, and be a projection on s.t.
is a strict invariant for ,
is relevant to ,
is nontrivial, and
there exists s.t. .
For , if , then is non-conforming.
We caution that our definition of non-conforming is liberal: existence of even one pair of functions —that differ on , but agree on the training set —is sufficient to classify
as non-conforming. It ignores issues related to the probabilities of finding these models by a learning procedure. Our intended use of Theorem7 is to guide the choice for the class of invariants, given the class of models, so that we can use violation of an invariant in that class as an indication for caution. For most classes of models, linear arithmetic invariants are relevant.
Consider the annotated dataset and the class , from Example 7. Consider the equality invariant , where the projection is defined as . Clearly, , and hence, is an invariant for . The invariant is also relevant to the class of linear models . Clearly, is nontrivial, since . Also, there exists (e.g., ) s.t. . Now, consider the tuple . Since , Theorem 7 implies that is a non-conforming tuple.
6 Experimental Evaluation
We evaluate data invariants over our two case-study applications (Section 2): TML and data drift. Our experiments target the following research questions:
How effective are data invariants for trusted machine learning? Is there a relationship between invariant violation score and the ML model’s prediction accuracy? (Section 6.1)
Can data invariants be used to quantify data drift? How do they compare to other state-of-the-art drift-detection techniques? (Section 6.2)
Can data invariants be used to explain the causes for tuple non-conformance? (Section 6.3)
Efficiency. In all our experiments, our algorithms for deriving data invariants were extremely fast, and took only a few seconds even for datasets with 6 million rows. The number of attributes were reasonably small (40), which is true for most practical applications. As our theoretical analysis showed (Section 4.3), our approach is linear in number of data rows and cubic in number of attributes. Since the runtime performance of our techniques is straightforward, we opted to not include further discussion of efficiency here and instead focus this empirical analysis on the techniques’ effectiveness.
We create an open-source implementation of data invariants and our method for synthesizing them,DISynth, in Python 3. Experiments were run on a Windows 10 machine (3.60 GHz processor and 16GB RAM).
Airlines [airlineSource] contains data about flights and has 14 attributes, such as departure and arrival time, carrier, delay, etc. We used a subset of the data containing all flight information for year 2008. The training and test set contain 5.4M and 0.4M rows, respectively.
Human Activity Recognition (HAR) [sztyler2016onbody] is a real-world dataset about activities for 15 individuals, 8 males and 7 females, with varying fitness levels and BMIs. We use data from two sensors—accelerometer and gyroscope—attached to 6 body locations—head, shin, thigh, upper arm, waist, and chest. We consider 5 activities—lying, running, sitting, standing, and walking. The dataset contains 36 numerical attributes (2 sensors 6 body-locations 3 co-ordinates) and 2 categorical attributes—activity-type and person-ID. We pre-processed the dataset to aggregate the measurements over a small time window, resulting in 10,000 tuples per person and activity, for a total of 750,000 tuples.
Extreme Verification Latency (EVL) [souzaSDM:2015] is a widely used benchmark to evaluate drift-detection algorithms in non-stationary environments under extreme verification latency. It contains 16 synthetic datasets with incremental and gradual concept drifts. The number of attributes of these datasets vary from 2 to 6 and each of them has one categorical attribute.
Datasets for non-conformance explanation case studies. We evaluate the effectiveness of data invariants in explaining tuple non-conformance through an intervention-centric explanation tool built on top of DISynth, called ExTuNe [DBLP:conf/sigmod/FarihaTRG20]. We use four datasets for this evaluation: (1) Cardiovascular Disease [cardioSource] is a real-world dataset that contains information about cardiovascular patients with attributes such as height, weight, cholesterol level, glucose level, systolic and diastolic blood pressures, etc. (2) Mobile Prices [mobilePriceSource] is a real-world dataset that contains information about mobile phones with attributes such as ram, battery power, talk time, etc. (3) House Prices [housePriceSource] is a real-world dataset that contains information about houses for sale with attributes such as basement area, number of bathrooms, year built, etc. (4) LED (Light Emitting Diode) [DBLP:journals/jmlr/BifetHKP10] is a synthetic benchmark. The dataset has a digit attribute, ranging from 0 to 9, 7 binary attributes—each representing one of the 7 LEDs relevant to the digit attribute—and 17 irrelevant binary attributes. This dataset includes gradual concept drift every 25,000 rows.
6.1 Trusted Machine Learning
We now demonstrate the applicability of DISynth in the trusted machine learning problem. We show that, test tuples that violate the data invariants derived from the training data are non-conforming, and therefore, a machine-learned model is more likely to perform poorly on those tuples.
Airlines. For the airlines dataset, we design a regression task of predicting the arrival delay and train a linear regression model for the task. Our goal is to observe how the mean absolute error (MAE) of the predicted value correlates to the invariant violation for the test tuples. In other words, we want to observe whether DISynth can correctly detect the non-conforming tuples.
In a process analogous to the one described in Example 1, our training dataset (train) comprises of daytime flights, i.e., flights that have arrival time later than the departure time. We design three test sets: (1) Daytime: flights that have arrival time later than the departure time (similar to train), (2) Overnight: flights that have arrival time earlier than the departure time (the dataset does not explicitly report the date of arrival), and (3) Mixed: a mixture of Daytime and Overnight.
Figure 3 shows the average violations of invariants derived by DISynth and the mean absolute errors (MAE) computed from the values predicted by a linear regression model. We note that invariant violation is a good proxy for prediction error. The reason is that DISynth derives invariants, such as “arrival time is later than departure time and their difference is very close to the flight duration,” for train; the regression model makes the implicit assumption that these invariants always hold. Thus, when this assumption fails, the data invariant is violated and the regression performance also degrades.
To further investigate on the tuple granularity, we sample 1000 tuples from Mixed. We compute their invariant violations and show them in descending order of violations (Figure 4). Tuples on the left incur high violations and predictions for them also incur high absolute errors. Note that although DISynth was unaware of the target attribute (delay), it still correctly predicts when a tuple is non-conforming and the prediction is potentially untrustworthy.
HAR. On the HAR dataset, we design the supervised classification task to identify persons from their activity data. We construct train_x with data for sedentary activities (lying, standing, and sitting), and train_y with the corresponding person-IDs. We learn data invariants on train_x
, and train a Logistic Regression classifier using the annotated dataset. During test, we mix mobile activities (walking and running) with held-out data for sedentary activities and observe how the classification’s mean accuracy-drop relates to average invariant violation. Figure 4(a) depicts our findings: classification degradation has a clear positive correlation with violation (pcc = 0.99 with p-value = 0).
6.2 Data Drift
We now present results of using DISynth as a drift-detection tool; specifically, for quantifying drift in data. Given a baseline dataset , and a new dataset , the drift is measured as average violation of tuples in on invariants learned for .
HAR. We perform three drift-quantification experiments on the HAR dataset which we discuss next.
Gradual drift. For observing how DISynth detects gradual drift, we introduce drift in an organic way. The initial training dataset contains data of exactly one activity for each person. This is a realistic scenario as one can think of it as taking a snapshot of what a group of people are doing at a particular, reasonably small, time window. We introduce gradual drift to the initial dataset by altering the activity of one person at a time. To control the amount of drift, we use a parameter . When , the first person switches her activity, when , the second person also switches her activity, and so on. As we increase from to , we expect a gradual increase in the drift magnitude compared to the initial training data. When , all persons switch their activities, and we expect to observe maximum drift. We repeat this experiment times, and display the average invariant violation in Figure 4(b). We note that the drift magnitude (violation) indeed increases as more people alter their activities.
In contrast, the baseline weighted-PCA (W-PCA) method fails to detect this drift. This is because W-PCA does not model local invariants (who is doing what), and learns some global invariants from the overall data. Thus, it fails to detect the gradual local drift, as the global situation “a group of people are performing some activities” is not changing. In contrast, DISynth learns disjunctive invariants that encode which person is performing which activity, and hence, is capable to detect drift when some individuals switch their activities.
Inter-person drift. The goal of this experiment is to observe drift among persons. We use DISynth to learn disjunctive invariants for each person over all activities, and use the violation w.r.t. the learned invariants to measure how much the other persons drift. Figure 7 illustrates our findings. The violation score at row p1 and column p2 denotes how much p2 drifts from p1. We use half of each person’s data to learn the invariants, and compute violation on the held out data. While computing drift between two persons, we compute activity-wise invariant violation scores and then average them out. As one would expect, we observe a very low self-drift across the diagonal. Interestingly, our result also shows that some people are more different from others, which appears to have some correlation with (the hidden ground truth) fitness and BMI values. This asserts that the invariants we learn for each person are an accurate abstraction of that person’s activities, as people do not deviate too much from their usual activity patterns.
Inter-activity drift. Similar to inter-person invariant violation, we also compute inter-activity invariant violation. Figure 7 shows our findings. Note the asymmetry of violation scores between activities, e.g., running is violating the invariants of standing
much more than the other way around. A close observation reveals that, all mobile activities violate all sedentary activities more than the other way around. This is due to the fact that, the mobile activities behave as a “safety envelope” for the sedentary activities. For example, while a person walks, she also stands (for a brief moment). But the opposite does not happen.
EVL. We now compare DISynth against other state-of-the-art drift detection approaches on the EVL benchmark.
Baseline Approaches. In our experiments, we use two drift-detection approaches as baselines which we describe below:
(1) PCA-SPLL [DBLP:journals/tnn/KunchevaF14] similar to us, also argues that principal components with lower variance are more sensitive to a general drift, and uses those for dimensionality reduction. It then models multivariate distribution over the reduced dimensions and applies semi-parametric log-likelihood (SPLL) to detect drift between two multivariate distributions. However, PCA-SPLL discards all high-variance principal components and does not model disjunctive invariants.
(2) CD (Change Detection) [DBLP:conf/kdd/QahtanAWZ15] is another PCA-based approach for drift detection in data streams. But unlike PCA-SPLL, it ignores low-variance principal components. CD projects the data onto top high-variance principal components, which results into multiple univariate distributions. We compare against two variants of CD: CD-Area, which uses the intersection area under the curves of two density functions as a divergence metric, and CD-MKL, which uses Maximum KL-divergence as a symmetric divergence metric, to compute divergence between the univariate distributions.
Figure 8 depicts how DISynth compares against CD-MKL, CD-Area, and PCA-SPLL, on 16 datasets in the EVL benchmark. For PCA-SPLL, we retain principal components that contribute to a cumulative explained variance below 25%. Beyond drift detection, which just detects if drift is above some threshold, we focus on drift quantification. A tuple in the plots denotes that drift magnitude for dataset at time window, w.r.t. the dataset at the first time window, is . Since different approaches report drift magnitudes in different scales, we normalize the drift values within . Additionally, since different datasets have different number of time windows, for the ease of exposition, we normalize the time window indices. Below we state our key findings from this experiment:
DISynth’s drift quantification matches the ground truth. In all of the datasets in the EVL benchmark, DISynth is able to correctly quantify the drift, which matches the ground truth exceptionally well.222EVL video: sites.google.com/site/nonstationaryarchive/home In contrast, as CD focuses on detecting the drift point, it is ill-equipped to precisely quantify the drift, which is demonstrated in several cases (e.g., 2CHT), where CD fails to distinguish the deviation in drift magnitudes. In contrast, both PCA-SPLL and DISynth correctly quantify the drift. Since CD only retains high-variance principal components, it is more susceptible to noise and considers noise in the dataset as significant drift, which leads to incorrect drift quantification. In contrast, PCA-SPLL and DISynth ignore the noise and only capture the general notion of drift. In all of the EVL datasets, we found CD-Area to work better than CD-MKL, which also agrees with the authors’ experiments.
DISynth models local drift. When the dataset contains instances from multiple classes, the drift may be just local, and not global. Figure 9 demonstrates such a scenario for the 4CR dataset. If we ignore the color/shape of the tuples, we will not observe any significant drift across different time steps. In such cases, PCA-SPLL fails to detect drift (4CR, 4CRE-V2, and FG-2C-2D). In contrast, DISynth learns disjunctive invariants and can quantify local drifts accurately.
6.3 Explaining Non-conformance
When a test dataset is determined to be sufficiently different or drifted from the training set, the next step often is to characterize the difference. A common way of characterizing these differences is to perform a causality or responsibility analysis to determine which attributes are most responsible for the observed drift (non-conformance). We use the violation values produced by data invariants, along with well-established principles of causality, to quantify responsibility for non-conformance.
ExTuNe. We built a tool ExTuNe [DBLP:conf/sigmod/FarihaTRG20], on top of DISynth, to compute the responsibility values as described next. Given a training dataset and a non-conforming tuple , we measure the responsibility of the