Detecting model drift using polynomial relations

by   Eliran Roffe, et al.

Machine learning (ML) models serve critical functions, such as classifying loan applicants as good or bad risks. Each model is trained under the assumption that the data used in training, and the data used in field come from the same underlying unknown distribution. Often this assumption is broken in practice. It is desirable to identify when this occurs in order to minimize the impact on model performance. We suggest a new approach to detect change in the data distribution by identifying polynomial relations between the data features. We measure the strength of each identified relation using its R-square value. A strong polynomial relation captures a significant trait of the data which should remain stable if the data distribution does not change. We thus use a set of learned strong polynomial relations to identify drift. For a set of polynomial relations that are stronger than a given desired threshold, we calculate the amount of drift observed for that relation. The amount of drift is estimated by calculating the Bayes Factor for the polynomial relation likelihood of the baseline data versus field data. We empirically validate the approach by simulating a range of changes in three publicly-available data sets, and demonstrate the ability to identify drift using the Bayes Factor of the polynomial relation likelihood change.



There are no comments yet.


page 1

page 2

page 3

page 4


Automatically detecting data drift in machine learning classifiers

Classifiers and other statistics-based machine learning (ML) techniques ...

Understanding Continual Learning Settings with Data Distribution Drift Analysis

Classical machine learning algorithms often assume that the data are dra...

An Auto-ML Framework Based on GBDT for Lifelong Learning

Automatic Machine Learning (Auto-ML) has attracted more and more attenti...

Concept Drift Detection: Dealing with MissingValues via Fuzzy Distance Estimations

In data streams, the data distribution of arriving observations at diffe...

A probability theoretic approach to drifting data in continuous time domains

The notion of drift refers to the phenomenon that the distribution, whic...

Concept Drift and Covariate Shift Detection Ensemble with Lagged Labels

In model serving, having one fixed model during the entire often life-lo...

Human-in-the-loop Handling of Knowledge Drift

We introduce and study knowledge drift (KD), a complex form of drift tha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning (ML) models may serve critical business, such as classifying loan applicants as good or bad risks, or classifying mammogram images as malignant or benign. It is generally assumed that the data used in training, and the data used in field come from the same underlying unknown distribution. Often this assumption is broken in practice, and in these cases the distribution change can cause the model’s performance to degrade or change in some way. It is therefore desirable to be able to identify when the data distribution has changed. In addition, it may be difficult to characterize the training data distribution in a way that is general but will still capture import aspects of the data, like complex relations between features.

The contribution of this paper is a new approach to detect change in the data distribution by identifying polynomial relations between the dataset features. A relation consists of an estimated equation relating, say, features to and along with an estimation of the error, where , , and

are features in the dataset; these can be constructed by linear regression, for example. Relations that are particularly strong (e.g., low error) on the baseline data can be said to characterize an aspect of that data. We then apply ML and statistical techniques (e.g., linear likelihood, Bayes Factor) over these relations to identify when they change significantly between the baseline and field data. In this case, if a relation holds on the baseline data, and no longer applies when the model is deployed, it indicates a change in the data distribution (and may indicate a model drift). Our contribution is an end-to-end way to identify the most relevant relations and then use these to identify distribution change.

In this work, we do not experiment with any example ML prediction models to verify that the model’s performance changes due to distribution drift, since this will depend on the specific model used. Rather, we simply assume that the existence of significant distribution drift between a baseline and field data set has the potential to degrade the model performance; we consider this assumption highly intuitive without need for explicit demonstration.

As a motivating example of distribution drift, consider a tabular dataset where each observation is a cell phone model, and features include the phone’s price and technical aspects and physical measurements of the phone. In a dataset of phones from the pre-smartphone era, say there is a particularly strong correlation (i.e, a relation) between the phone screen size and battery life. Because smartphones and non-smartphones are fundamentally different in nature, the same relation may not hold between these features if the same data were collected today on a sample of mostly smartphones. We would like to identify this kind of distribution change in data set features without domain knowledge; a ML model trained to, say, predict phone price based on other features, which was trained on the pre-smartphone data set, would likely have poor predictive performance today. As another example, PHP and Javascript experience used to be important in predicting a programmer’s salary. In past years, PHP is no longer crucial, so its effect on salary has decreased (or ceased to exist).

Our paper proceeds as following: In Section 2, we mention some related work in drift detection which motivates this one. In Section 3, we outline the methodology to construct polynomial feature relations and select the relevant subset of them, and to detect drift on a relation’s strength using Bayes Factors. In Section 4, we induce simulated drift of various degrees, and show that greater drift can be detected by higher values in the Bayes Factor in the strong relations we select. These simulations (Section 4.2) represent drift in the form of row and permutations to disrupt inter-feature correlations, or increasing ‘unfairness’ in the form of changes in relations between features. The results suggest that our method should be able to detect arbitrary forms of distribution drift, and that larger degrees of distortion should be easier to detect through their effect on the relations’ Bayes Factors. In Section 5, we discuss some challenges we intend to address in future work. Section 6 concludes.

2 Related Work

Several works capture challenges and best practices in detecting drift in the data distribution. Some of these practical issues have been mentioned in earlier works (e.g., [AFRZZ19], [ADFRZ20] and [ARZ20]).

These works tried to detect drift indirectly by non-parametrically testing the distribution of model prediction confidence for changes. This generalizes method and sidesteps domain-specific feature representation. For example, [ADFRZ20]

handled a subset of MNIST digits and then inserted drift (e.g., an unseen digit class) to deployment data in various settings (gradual/sudden changes in the drift proportion). A novel loss function was introduced to compare the performance (detection delay, Type-1 and 2 errors) of a drift detector under different levels of drift class contamination.

In addition to the drift detection works, there is an extensive literature on finding relations between variables/features in a given data. This can be done for features in a data set, columns in a database and more. Among these works there are [MAHall] and [CBK]

which suggests correlation-based algorithms for feature selection, and


which detects correlated columns in a relational database. These works and others relies on approaches to calculate correlation between features (e.g., Pearson correlation, Spearman correlation), and approaches to measure the similarity between features (e.g. cosine similarity).

The work we report here is motivated by these works. We extend these earlier works by presenting a new automatic approach to identify potential relations between features in the data set which are not necessarily linear and contains combinations of these features. Then we determine the influence of changes in a new data among the relation. This work is automating the whole process of identifying the changes in the data distribution, and does it in a more readable way of relations, and a specific value which symbolizes the amount of calculated drift.

3 Methodology

For the remainder of this paper, we assume we have a tabular structured data set, where rows are observations and columns, of which there are , are features. The features considered are all numeric (e.g., integer or real-valued) and not categorical. We will typically assume the observations are mutually independent, and not time-indexed or spatially-related, such as a matrix representation of image pixel values. We suggest a new approach for drift detection consisting of two main steps. The first step (Section 3.1) involves identifying relations between features and the strength of each relation. The second step (Section 3.2) consists of detecting a drift by applying ML and statistical techniques (e.g., linear regression, linear likelihood, Bayes Factor) over these relations.

3.1 Identifying relations between data set features

We define a search space and a possible high-level outline of how to search it for statistically-significant relations. We proceed to implement this approach for the case of polynomial relations. Specifically, for a given feature , we consider all polynomial relations of features and degree , that is all relations of the form where

is a vector of some

features from the data set.

The search space of all such polynomials is infinite. We assume there are

numeric features. To heuristically make the search efficient, we focus on the

most correlated (e.g. Pearson correlation) features to the feature. This is done for every feature, .

Next, we create a square matrix of size ( is the number of features), and initialize every entry to 0. We calculate the correlation for each pair of features, , and set the appropriate matrix entry to the absolute value of the correlation. The calculation is done for all pairs except the elements on the diagonal (which are perfectly correlated so we ignore them).

For each feature , the algorithm finds its most correlated features ; without lack of generality, we label these when . Then, we define the polynomial embedding as:

The embedding generates a set of all possible candidate terms for inclusion in the polynomial that will keep it under degree . That is, all possible products (constant term, single features raised to exponents , and multi-feature interactions) of the features raised to positive integer exponents, where the exponents in each product sum to at most . Exponents where , which do not contribute to the sum, mean that feature is excluded. In practice, we use only a small ; for instance, to restrict the relations to consider 3-way feature combinations (including itself), we set .

For instance, for a maximum features and a target feature , we have , where are two different features. Thus, the polynomial embedding with degree is . For the first term, for instance, the exponent , so is ignored there. For convenience, let , denote the term in the embedding.

We now define another mapping prototype . defines a univariate real-valued output from real-valued vector embedding . In particular, the output of will be a prediction of the value of given the set of numeric predictors .

A particular form of the embedding mapping prototype is a linear combination of the form

where are a set of real-valued constants. For instance,

could represent ordinary least-squares (OLS) linear regression (the technique we use in this paper), or Least Absolute Shrinkage and Selection Operator (LASSO) or other forms of linear regression. Since

is used to model the relationship between the feature and the embedding features , its output represents a prediction of ’s value, so we can say that a linear polynomial relation takes the form .

Any particular choice of the form of will give a particular set of values for the constants (e.g., LASSO will tend to set for non-significant regressors ) as well as a measurement of the error of the prediction from the true value . For OLS linear regression, as noted in Section 3.2

, this is done by estimating the variance of a Gaussian distributed error term, and by the R-squared coefficient of determination. The

value is , with higher values indicating that provides a more accurate prediction of , that is, that this relation is strong. In our method using linear regression, we measure the error of each relation over the field set by its , and select only relations for which the ; this threshold was identified as an acceptable value in previous experiments. The value thus serves as a hyper-parameter which influences the number of relations identified, and can be set as an input to the algorithm.

3.2 Detecting drift using identified relations

Here we explore the utilization of a set of generated relations for model drift detection. The likelihood function of a model or distribution is a function which takes as inputs the model parameters and data inputs (and possibly if it is a prediction model). Here, we use the general notation and for inputs and output of a regression problem, but in our case, (the polynomial embedding) and , the target feature we are trying to relate linearly to ; the number of features in , , is thus . For an independent sample of values and some parametric distribution with parameters , the likelihood is the product of the density function evaluated at . For a predictive model, such as linear regression, the likelihood relates the fit of the estimated regression parameters, given input (which produces a prediction ) to the true values .

For linear regression, the parameters are . Assume contains observations and feature columns; if an intercept is fit, the values in the first column of are all 1. Let be the column (feature) of the data matrix , and , be the value (for the observation) for this feature; is the corresponding target feature value for this observation. are the regression parameters (including an intercept, if desired), which has length . The general equation for regression, relating the inputs and is that

where . If there is an intercept, then . For conciseness below, we will limit ourselves to cases where there are 1 or 2 features in , not including the intercept; these correspond to 2-way and 3-way feature relations, including the dependent variable (which corresponds to the target in our earlier discussion).

The regression prediction error variance is , where the numerator is the dot product of the prediction residual (). Given and inputs (these can be the particular inputs used to fit the model, or other ones, and

, of the same form, such as a field set), the regression likelihood follows from the fact that the prediction errors are normally distributed and independent (an assumption of linear regression):

Taking the log of the above, and factoring out unnecessary constants, we get the log-likelihood:

The dimension of a model is the number of estimated parameters. In the case of linear regression, this is (number of regression parameters plus 1 for the error variance). The Bayesian Information Criterion (BIC) for on a given input and , that is, a given dataset, is . The BIC takes into account the fit (via the log-likelihood) in addition to the number of parameters and dataset size . Lower BIC is better; this means a given fit (log-likelihood) achieved with more parameters is worse (i.e., model size is penalized).

Now, given a dataset and prediction , and two competing models and to explain , we can decide which model fits better, we can calculate the Bayes factor, which is approximately

If the , this is more evidence in favor of being a better fit. If , the evidence favors . The further the BF is from 1 (exactly equal evidence for both models) is more evidence in favor of the respective model. Typically the BF has to be above some threshold , or below , to make a firm decision.

For conciseness below, we consider a relation between a set of 2 or 3 variables, say or . will always stand for the target feature being modeled. As a shorthand, we will use the notation, say, to mean , where we implicitly include the intercept and disregard the exact values of the coefficients and of the error variance . Note this is very similar to when we said , where , except here we use only a subset of the possible terms in the relation, and coach it in the form of a regression. A two-way linear relation is, say, or , where the relevant coefficients are likewise ignored. The notation , for instance, means “ and are jointly linearly related”. We refer to this shorthand as the ‘structure’ of the relation, and we can say that the relation has a different structure than does because is not squared in the second.

Consider the case where we have two datasets and , where the data matrices and have the same feature columns . and stand for the value of the target feature in each of the datasets; recall that in our search algorithm, each feature is rotated as being the target feature , so the ’s are always all the other features not . Consider a subset of variables, say or , that the linear relations algorithm identified as having a strong relation in the two datasets. However, the relations (i.e., the linear models and coefficient values) discovered are different. For instance, in the first, maybe and in the second, . This corresponds to a case where the relation depends linearly on rather than just , along with changes in the coefficient values. The structure may stay the same, but the coefficient values may change, for instance.

We want to see which model fits for the data set better. This means that is a relation that has been identified from the first dataset, and we want to replace it with on the second data if the latter fits better. When we refer to model , we mean the set of parameters that are fit from the linear regression, as well as the particular structure (e.g., vs ). We can fit the Bayes Factor using the approximation using the BIC.

If the far enough, it means that is better, and we should reject . This means the model is drifted.

In addition to changes in the regression terms (e.g., using instead of ) in the equation of the linear relation, because the parameters includes both the regression coefficients and the estimation error variance, the likelihood (and thus the Bayes Factor) can capture changes in the strength of the relationship. For instance, say in both cases the relation remains , but either the regression coefficients change, or they remain the same but the relationship becomes weaker in that there is more noise around the regression line. The BF should still capture the changes in the model fit. If the data really does have much more variance around the regression line, the model with the higher estimation variance will fit it better than the one with lower variance, in this way we can judge that there has been a significant shift.

In the equation of the BF above, we have neglected that the prior odds in favor of

or can be included, if we have reason to believe that one should fit better; or, we may want to favor the odds towards if we want to declare drift if there is so much more evidence in favor of . The BF above assumes the odds are , in which case the multiplier is 1. If, say, the BF above was 10, but we had prior odds of in favor of (corresponding to an odds ratio of in favor of , reversing the order), we would multiply to get the posterior BF of 5, which would be the final decision.

3.3 Related work: using multiple Bayes Factors in hypothesis testing

In Section 3.2, we noted we will use Bayes Factors to measure whether the fit of a linear model changes significantly between two datasets and . Specifically, as discussed in Section 4.2, we will actually use the set of Bayes Factors on a set of relations , rather than just one. In the field of multiple hypothesis testing, a set of (typically related) statistical hypotheses are conducted, and it is desired to make a single decision based on the combined results. For instance, each hypothesis may yield a p-value , and we want to make a single decision based on rather than rejecting each hypothesis individually or not. While the literature on multiple testing is very extensive, it seems that similar use of multiple BFs in a decision is limited. For instance, [LCLHMGT17] use BFs on genetic multiple markers, rather than a single locus, to measure the association between genetic regions. This is an extension we will pursue in future work, but for now we simply examine the ability for an individual linear relation to detect drift through changes in its BF.

4 Experiments

In this section, we describe our experiments. First, we discuss the data sets on which the experiments were conducted, and then we detail the experiment design and results.

4.1 Data sets

We validate our algorithm on several data sets. Following is a description of the data sets being used in our experiments:

  • Rain in Australia ([Rain]) is a structured data set which contains daily weather observations from numerous Australian weather stations. It consists 142k records with 15 numerical features.

  • London bike sharing ([LondonBike]) is a structured data set which contains historical data for bike sharing in London. It consists of 17k records with 12 numerical features.

  • Personal loan modeling ([PersonalLoan]) is a structured data set which predicts whether a customer will respond to a Personal Loan Campaign. It consists of 5k records with 14 features.

4.2 Experimental Design

The goal of the experiments is to estimate the effectiveness of the algorithm for identifying drift over the mentioned data sets. Using these data sets, we will simulate scenarios where there is a change in the data distribution over time as well as in the influence of existing features over other features. For simplicity, we generated polynomial relations under degree 2 based on a 3-way feature combinations (including the feature ).

Our drift is induced by gradually changing the values of controlling parameters to induce a greater level of drift. Observations from this new ‘drift data’ should typically cause the model’s performance to change; here that is determined by the BF indicating it is no longer a good fit on the drifted data set.

First, we randomly split each data set into a baseline data and a field data of equal size. A set of linear relations are determined on the baseline data ; we can determine the strength of each relation by its regression model metric, where a higher value means the estimated relation is stronger between the chosen features.

Let be the field data after a simulated alteration is performed to induce drift. When no drift is induced, we expect a given relation to perform roughly equivalently on and that is, its BF should be close to 1 when these data sets are compared due to the randomness of the split. A relation is a good diagnostic sensor of drift if its BF reacts strongly to induction of drift, when and the drifted are compared, but it does not (BF remains close to 1) when drift is not induced on . Our contention here is that the stronger relations , based on features which are more strongly correlated with the target , will be better drift sensors than weaker relations.

We conduct two types of drift simulation, described as follows:

  1. Row permutations: This technique is described in full in [ADFRZ21] (Section III.C, and corresponds to experiment settings with ). For each column, a fixed proportion (‘perc’) of rows are randomly selected, and their values permuted within the column. This is random drift which maintains the typical marginal feature distributions, but disrupts inter-feature associations, which can affect linear relations between them on the baseline data . If , the rows of the full dataset are resampled with replacement (should result in very similar marginal and inter-feature distributions), while higher correspond to greater amounts of shuffling of values.

  2. Relation unfairness: In contrast to the permutation drift, which has non-controlled affects on associations, here we create a distortion from by changing linear relations in a targeted way. This is a synthetic example couched in the context of algorithmic fairness, using the personal loan dataset ([PersonalLoan]). There is a growing literature on the principle of algorithmic fairness in ML, in which, say, a certain prediction or decision by an ML model should not be affected by certain ‘sensitive’ features (e.g., a person’s race or sex), conditional on other features that are relevant. For instance, a black family (‘RACE’ is a a sensitive feature) should not be denied a loan when a white family with the same financial (relevant) characteristics is granted one. However, there may be certain associations between the sensitive and relevant features; for instance, black families may, on average, be less likely to receive loans if black families on average may tend to have lower income, and not because of their race.

    Here, we use only numeric features and simulate a scenario in which a feature (mortgage loan, for which we use the feature for the current mortgage value) should only depend on a subset of relevant features , but not on certain sensitive features . Here, we have an unfairness parameter . When (completely fair), should depend only on features in . This is done by generating a synthetic from a linear regression on only the relevant features, with Gaussian noise, so . Of course, may still be correlated with features in , due to their correlations with features in , but it isn’t determined by them (just like the bank loan shouldn’t be determined by race, given other features).

    When , is generated by a regression of on the original , and the features of scaled by

    (this is done by pre-standardizing all features, saving the original standard deviations, then rescaling the standardized coefficients for

    by , to generate in a similar domain space; see [G21]). Increasing means the influence of the sensitive features on increases; when features in and should be equally influential on , while for , features in have more influence than those in . In all cases, is centered and scaled to have a fixed mean and standard deviation.

    Thus, drift detection in this simulation entails first finding strong relations on when , and then creating for increasing levels of . Any relation involving the synthetic mortgage target and features in and should tend to change as changes, because is synthetically generated in a different way depending on . We expect that BFs for these relations will change with .

We illustrate some examples of drift insertion for type 2. Here, the relevant (‘fair’) features are , and the sensitive (‘unfair’) ones are . Figures 2 and 2 show scatterplots of the synthetic target vs each of the regression features. Increasing should increase the influence of features in , and decrease the influence of features in , on . Although, for instance, the slope of the regression line of on the relevant feature INCOME actually increases with , Figure 3, the effect is that the regression decreases for and increases for , and that the correlations of decrease in absolute value for features in , while increasing for features in . These correspond with a given relation between and , which may be very strong initially (high when ) losing predictive power with increasing ; the reverse is true for relations involving .

Figure 1: Synthetic mortgage vs relevant features (), for increasing unfairness .
Figure 2: Synthetic mortgage vs sensitive features (), for increasing unfairness .
Figure 1: Synthetic mortgage vs relevant features (), for increasing unfairness .
Figure 3: Regression and univariate correlation for increasing unfairness . In the right panel, the orange features belong to .

In either of the two ways of inducing drift discussed above, our experiments will be based on the following procedure: First, is the baseline dataset, on which a given relation is identified. is a corresponding dataset after drift is induced. We will calculate the following Bayes Factor to see if the model fit changes significantly between and :

By strong relations we mean relations with metric higher than the threshold. For strong relations we expect a growth in the Bayes Factor value when drift is introduced (through the shuffling of values). In contrast, for weak relations the calculated Bayes Factor is not supposed to be affected in a monotonic way by the drift.

These approaches significantly change the data distribution and the relations between features in the field data. Since a ML model was trained over the baseline set, its distributions and relations are less accurate in the updated field data. This will highly influence the ML model.

4.3 Experimental Results

In Section 4.2 we suggested a way to estimate the effectiveness of the approach for identifying drift and used it over the data sets mentioned in Section 4.1. We calculated the potential relations, and partitioned them to strong and weak relations as mentioned earlier. The next step was a random split of the data sets, and introduce of a drift to the field data by shuffling the rows.

Figure 4 shows the results for strong relation. Specifically, the selected relations has metric value which is very high. For these relations, the Bayes Factor value was rapidly changed due to the row shuffle drift value. In contrast, Figure 5 shows the results for weak relations with low metric. For these relations, the Bayes Factor value was always around 1 (no effect).

This presents the ability to identify a drift in the data according to the Bayes Factor value calculated over the strong relations. Moreover, it demonstrates a clear difference between the influence of strong relations for the drift identification in compare to the weak relations.

(a) =0.979
(b) =0.928
Figure 4: Bayes Factor according to row shuffle drift value: strong relations
(a) =0.182
(b) =0.122
Figure 5: Bayes Factor according to row shuffle drift value: weak relations

For the relations unfairness experiment, we took the Personal Loan Modeling data set from Section 4.1 and calculated a relation for the synthetic target feature ”Mortgage”. As mentioned in Section 4.2, we partitioned the data set to baseline and field, introduced an increased unfairness to the field data, and calculated the Bayes Factor. The features selected as correlated for the relation were ”Income” and ”CCAvg”. The experiment results shows in Figure 6 that the Bayes Factor value was rapidly increased according to the amount of unfairness introduced. We also tried to predict the target feature by the features ”Age” and ”Experience” which made a weak relation (low metric value). In this case we show that the Bayes Factor was almost not changed.

(a) =0.9979
(b) =0.019
Figure 6: Fairness experiment results: Bayes Factor increased due to unfairness

5 Discussion

We report on initial promising results of detecting drift in the data distribution using polynomial relations.

In our experiments we demonstrated a strong connection between the drift and the ability to detect it via the proposed method of polynomial relations. We plan to extend this to additional drift types and more types of potential relations. We believe that our technology has other useful use cases. One example for a use case may be comparing two existing data sets (with a similar structure) and returning a result which indicates the similarity of these data sets.

There are additional challenges in improving the polynomial relations approach. One important challenge is an improvement of the heuristic that selects the features which participate in the relation. Today, our heuristic works well mainly with features that are directly correlated. We plan to extend it for cases in which several features are only correlated together to a target feature. For example, the relation , where and are not individually highly correlated to , but the function is highly correlated with .

Another important challenge is the trade-off between the complexity of the relation, over-fitting, understanding the relation, and the ability to detect drift. We plan to experiment with techniques to address that. One example is the selection of the right degree of the polynomial relation. Improving the heuristic and accounting for the various trade-offs should result in a faster and more accurate detection of drift.

Also, we plan to experiment with using LASSO with a regularization penalty, rather than OLS linear regression, to determine the optimal relation on a given embedding when is the set of most correlated features with the target . If the maximal number of features is larger, or if the maximal degree is larger, the embedding will grow exponentially, likely causing overfitting. LASSO can decide which embedding elements to actually include in the relation. Furthermore, we plan to see what is the effect of standardizing the embedding features before determining the relation.

Lastly, following the method in [LCLHMGT17], we want to see whether we can use multiple hypothesis methods to decide on the basis of multiple relations’ Bayes Factors, if the entire dataset seems to have drifted, instead of examining whether each relation individually has changed when deciding if the dataset has drifted.

6 Conclusion

In this work, we addressed the challenge of detecting data distribution drift which may influence the performance of ML models. We automatically identified relations between data set features and statistically detected significant changes in the data which may degrade the model performance if applied on the drifted field data set. We simulated a change in the data over time, and demonstrated the ability to identify it according to the Bayes Factor value calculated over the strong relations.