Picket: Self-supervised Data Diagnostics for ML Pipelines

Data corruption is an impediment to modern machine learning deployments. Corrupted data can severely bias the learned model and can also lead to invalid inference. We present, Picket, a first-of-its-kind system that enables data diagnostics for machine learning pipelines over tabular data. Picket can safeguard against data corruptions that lead to degradation either during training or deployment. For the training stage, Picket identifies erroneous training examples that can result in a biased model, while for the deployment stage, Picket flags corrupted query points to a trained machine learning model that due to noise will result to incorrect predictions. Picket is built around a novel self-supervised deep learning model for mixed-type tabular data. Learning this model is fully unsupervised to minimize the burden of deployment, and Picket is designed as a plugin that can increase the robustness of any machine learning pipeline. We evaluate Picket on a diverse array of real-world data considering different corruption models that include systematic and adversarial noise. We show that Picket offers consistently accurate diagnostics during both training and deployment of various models ranging from SVMs to neural networks, beating competing methods of data quality validation in machine learning pipelines.


page 6

page 9


Toward more generalized Malicious URL Detection Models

This paper reveals a data bias issue that can severely affect the perfor...

Underspecification Presents Challenges for Credibility in Modern Machine Learning

ML models often exhibit unexpectedly poor behavior when they are deploye...

Unknown Examples & Machine Learning Model Generalization

Over the past decades, researchers and ML practitioners have come up wit...

dagger: A Python Framework for Reproducible Machine Learning Experiment Orchestration

Many research directions in machine learning, particularly in deep learn...

Self-Supervised Visual Representation Learning Using Lightweight Architectures

In self-supervised learning, a model is trained to solve a pretext task,...

Using Cross-Loss Influence Functions to Explain Deep Network Representations

As machine learning is increasingly deployed in the real world, it is ev...

Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems

The industrial machine learning pipeline requires iterating on model fea...

1. Introduction

Data quality assessment is critical in all phases of the machine learning (ML) life cycle. Both in the training and deployment (inference) stages of ML models, erroneous data can have devastating effects. In the training stage, errors in the data can lead to biased ML models (Koh et al., 2018; Schelter et al., 2019; Breck et al., ; Baylor et al., 2017), i.e., models that learn wrong decision boundaries. In the deployment stage, errors in the inference queries can result in wrong predictions, which in turn can be harmful for critical decision making systems (Breck et al., ; Steinhardt et al., 2017). ML pipelines need reliable data quality assessment throughout training and inference to be robust to data errors.

This work focuses on designing a data diagnostics system for ML pipelines over tabular data. The role of data diagnostics in ML pipelines is to identify and filter data that lead to degradation either during training or deployment. For the training stage, the goal of data diagnostics is to identify erroneous training examples that if used for learning will lead to a biased model, while for the deployment stage, the goal is to flag erroneous query points to a trained ML model, i.e., points that due to noise will result in incorrect predictions. This work introduces a unified data diagnostics solution for both the training and deployment stages of the ML life cycle.

Data diagnostics for ML exhibit several unique challenges related to data quality. In the training stage, data corruption should not be considered in isolation but always in the context of learning. During learning, low-magnitude noise in the data can be beneficial and act as implicit regularization (Bishop, 1995). Such errors may lead to improved generalization and should not be removed from a training set. On the other hand, not all data points with low-magnitude noise are benevolent. Adversarial data poisoning techniques (Steinhardt et al., 2017; Koh et al., 2018; Muñoz-González et al., 2017; Biggio et al., 2012) typically corrupt the training set using data points with small perturbations. Similarly, in deployment, not all corruptions will flip the prediction of a trained ML model. Different ML models exhibit different degrees of robustness to random corruption, and systematic or adversarial noise may be targeting specific subsets of the data or classes in the ML pipeline (Koh et al., 2018). Detection of inference inputs that will result in a model misprediction due to corruption requires not only knowledge of the data quality, but also knowledge of how the corruption connects to the decision of the model.

The above challenges require rethinking the current solutions for identifying errors in data. Existing error detection methods (Heidari et al., 2019; Eduardo et al., 2019; Mahdavi et al., 2019)

aim to remove all errors from a data set, thus, they might result in sub-optimal ML models by removing more points than those necessary. Outlier detection methods either require access to labeled data 

(Xue et al., 2010), a tedious task that hinders the deployment for data diagnostics for diverse models and pipelines, or impose strong assumptions on the distribution (e.g. Gaussian) of the data (Eduardo et al., 2019; Diakonikolas et al., 2017). Moreover, standard outlier detection methods are not effective against adversarial corruptions (Koh et al., 2018) since these corruptions typically rely on low-magnitude perturbations and introduce erroneous points that are close to clean data. More advanced methods are required to defend against adversarial corruptions (Steinhardt et al., 2017). Nevertheless, current methods are typically limited to real-valued data and focus either on training or inference but not both. Finally, recent techniques for data validation in ML pipelines (Schelter et al., 2019; Breck et al., ) rely on user-specified rule- or schema-based quality assertions evaluated over batches of data and it is unclear if they can support on-the-fly, single point validation, which is required during inference.

We present Picket, a data diagnostics framework for both the training and deployment stages of ML pipelines. Picket provides an offline diagnostics service for data that come in batches at training time and an online diagnostics service for data that come on the fly at inference time. Effectively, Picket acts as a safeguard against data corruptions that lead either to learning a biased model or an erroneous prediction. Picket introduces a new deep learning model, referred to as PicketNet, to learn the characteristics and the distribution of the data on which the ML pipeline operates. PicketNet models the distribution of mixed-type tabular data and is used in Picket to distinguish between clean data points and corrupted ones. We learn PicketNet without imposing any labeling burden to the user by using the paradigm of self-supervision

, a form of unsupervised learning where the data provide the supervision. The design of Picket borrows from our prior extensive experience on ML-based solutions for autonomous data quality management 

(Wu et al., 2020; Heidari et al., 2019), however, there are many practical aspects of automating data diagnostics in ML pipelines that have not been previously considered. We have distilled three principles that have shaped Picket ’s design:

1. Bring All Data Types to Bear: The system should support mixed-type data ranging from numerical and categorical values to textual descriptions. It needs to properly represent and process different data types and capture the dependencies between them.

2. Robustness to Noise: Learning within the system should be robust to noisy values. We want to avoid supervision and the inputs to Picket will not be clean. Further, the detection of problematic samples should depend not only on the input data but also the downstream decision system and take into account its robustness.

3. Plugin to an ML pipeline: The system should serve as a “plugin” to any ML pipeline. It needs to be unsupervised to minimize the burden of deployment; it should learn to distinguish corrupted data that lead to degradation from other data points in a fully unsupervised way. The system also needs to be model-agnostic, i.e., it should not impose any strict requirements or assumptions on the model in the ML pipeline or make any modification to it.

Our work makes the following technical contributions:

Self-Attention for Tabular Data   We propose a novel deep learning model for mixed-type tabular data. As discussed later, data diagnostics are designed as services over this model. The architecture of PicketNet corresponds to a new version of a multi-head self-attention module (Vaswani et al., 2017)

, a core-component in state-of-the-art machine learning models in natural language processing. The new model builds upon our recent ideas 

(Wu et al., 2020) which demonstrate that Attention is key to learning schema-level dependencies in tabular data. However, PicketNet introduces a novel stream-based (schema stream and value stream) architecture that is able to capture not only the dependencies between attributes at the schema-level but also the statistical relations between cell values. We find that compared to schema-only models, our novel two-stream architecture is critical for obtaining accurate predictions across diverse data sets.

Robust Training over Arbitrary Corruptions   To use unsupervised learning, we need a training procedure that is robust to noisy inputs. We propose a self-supervised training procedure that is robust to noisy inputs. The procedure monitors the loss of tuples in the input data over early training iterations and uses related statistics to identify suspicious data points. These points are then excluded from subsequent iterations during training. We show that this procedure allows Picket to be robust to even adversarial noise.

Diagnostic Services for ML Pipelines  

Picket introduces a new architecture where data diagnostics for ML pipelines correspond to services over a pre-trained PicketNet. We show how to use the reconstruction loss of the pre-trained model as the basis of data diagnostic services for both the training and the test stages of supervised ML pipelines, viewing the downstream models as black boxes. We also demonstrate that Picket’s diagnostic services generalize to and can benefit ML pipelines with different models ranging from support vector machines to neural networks. We conduct extensive experiments to evaluate the performance of Picket-diagnostics against a variety of baseline methods both for training time and inference time. These baselines include state-of-the-art outlier detection methods built upon unsupervised autoencoder models 

(Eduardo et al., 2019) and defences to adversarial corruption attacks (Roth et al., 2019; Grosse et al., 2017). For training time detection of corrupted data, Picket outperforms all competing methods, with an average improvement of

area under the receiver operating characteristic curve (AUROC) points, while it consistently achieves an AUROC of above 80 points—an AUROC score of above 80 points is considered as excellent discrimination 

(Hosmer and Lemeshow, 2000). For test time victim sample detection, Picket has the highest quality in most cases with improvements of at least 2-5 points.

2. Background

We review basic background material for the problems and techniques discussed in this paper.

2.1. Data Quality in ML Pipelines

We first discuss different types of data corruption models that are considered in the literature. Then, we review existing approaches to deal with data corruptions, both during training and inference, that are related to the problems we consider in this paper.

Data Corruption Models   We consider data corruptions due to random, systematic, and adversarial noise.

1. Random noise is drawn from some unknown distribution that does not depend on the data. Random noise is not predictable and cannot be replicated by doing the same experiment.

2. Systematic noise exhibits structure and typically leads to errors that appear repeatedly in a subset of data samples. This type of noise introduces bias to the data. Systematic noise is predictable and replicable since it depends on the data or external factors in a deterministic way. For example, systematic corruptions may be introduced after joining a clean and a noise data set to obtain the final training data set for a ML model.

3. Adversarial noise contaminates the data to mislead ML models. At training time, adversarial noise forces a model to learn a bad decision boundary; at test time, adversarial noise fools a model so that it makes the wrong prediction. It usually depends on the data and the target model, although some types of adversarial noise may work well across different models.

Corruptions due to random or systematic noise do not directly lead to degradation of the ML model unless the majority of the data is corrupted. In fact, many ML models are shown to be robust to these types of noise (Li et al., 2019). On the other hand, corruptions due to adversarial noise are designed to be harmful to ML systems and raises security concerns. Picket is able to handle all three types of noise and the corresponding data corruptions.

Dealing with Corrupted Data in ML   The most common approach to deal with corrupted data during training is to identify samples that have missing values or have erroneous feature-values or labels and remove them from the training set. We refer to this process as filtering. Given a training data set , the goal of filtering is to identify a feasible set of data points to be used for training. Data set may still contain noisy data but should not contain corrupted points that introduce bias. Different mechanisms have been proposed for filtering: 1) Learning-based outlier detection methods (Liu et al., 2008; Chen et al., 2001; Eduardo et al., 2019) leverage ML or statistical models to learn the distribution of clean data and detect out-of-distribution samples (including adversarially corrupted samples); 2) Error-detection methods (Heidari et al., 2019; Qahtan et al., 2018; Mahdavi et al., 2019)

rely on a combination of logic-rule and ML models to identify corrupted samples; they do not consider adversarial corruptions. For corruptions that correspond to missing values alone, recent works propose imputing the missing values 

(Wu et al., 2020) before training a ML model (Karlaš et al., 2020)

or performing statistical estimation 

(Liu et al., 2020). Finally, recent data validation modules for ML platforms (Baylor et al., 2017; Schelter et al., 2019; Breck et al., ) rely on user-defined quality rules and measures as well as and simple one-column statistics to check the quality of data batches. Such user-defined quality rules are out of the scope of this work. In this work, we consider corruptions that correspond to erroneous values due to random, systematic, or adversarial noise. We focus on filtering corrupted samples from the training data that lead to model degradation.

The above approaches, predominantly outlier detection, can also be used during inference. In addition, more advanced methods have been proposed to guard against adversarial corruption at inference time. These methods accept or reject a data point either via statistical tests that compare inference queries to clean data (Grosse et al., 2017)

or by considering variations in a model’s internal data representation (e.g., the neuron activations of a neural network or the prediction confidence of the downstream model) 

(Roth et al., 2019). Similar to these works, we also consider the online detection of inference queries that will result in wrong model predictions due to corruption.

2.2. Self-Supervision and Self-Attention

We review relevant ML concepts.


In self-supervised learning systems, the learning objective is to predict part of the input from the rest of it. A typical approach to self-supervision is to mask a portion of the input, and let the model reconstruct the masked portion based on the unmasked parts. By self-supervision, a model learns to capture dependencies between different parts of the data. Self-supervised learning has shown great success in natural language processing 

(Devlin et al., 2018; Su et al., 2020). Self-supervised learning is a subset of unsupervised learning in a broad sense, since it does not require any human supervision.

Multi-Head Self-Attention   Multi-head self-attention is a model for learning a representation of a structured inputs e.g., a tuple or a text sequence, by capturing the dependencies between different parts of the input (Vaswani et al., 2017). One part can pay different levels of attention to other parts of the same structured input. For example, consider the text sequence “the dog wears a white hat”, the token “wears” pays more attention to “hat” than “white” although “white” is closer in the sequence. The attention mechanism can also be applied to tuples that consist of different attributes (Wu et al., 2020). Multi-head self-attention takes an ensemble of different attention functions, with each head learning one.

We review the computations in multi-head self-attention. Let be the embedding of a structured input with tokens. Each token is transformed to a query-key-value triplet (, , ) by three learnable matrices , and . The query , key , and value are real-valued vectors with the same dimension . The output of a single head for the token is , a weighted sum of all the values in the sequence, where . The attention pays to is determined by the inner product of and . Multiple heads share the same mechanism but have different transformation matrices. In the end, the outputs of all the heads are concatenated and transformed into the final output by an output matrix , which is also learnable.

3. Framework Overview

We formalize the problem of data diagnostics for ML pipelines and provide an overview of our solution to this problem.

3.1. Problem Statement

We use to denote a set of tabular data, and to denote a single sample (tuple) in with

attributes. These attributes correspond to features and can also contain the correct label for the data sample (optionally). We consider a downstream decision system with a classifier

. For a sample , let the prediction of the classifier be , and the correct label of be . For each we denote its clean version; if is not corrupted then .

Training Time Diagnostics   Before the training of classifier , a training set containing tuples is collected. We assume that contains clean and corrupted samples and that the fraction of corrupted samples is always less than half. We term training time diagnostics the task of filtering out the corrupted samples in that bias the learning process. This task is different from that of error detection: we do not wish to remove all erroneous samples from but only those that harm the learning process. We formalize the task of training time diagnostics next. We denote the domain of the samples in . Let be a score function that takes in a data point and returns a number representing how corrupted the data point is. We define the feasible set of with respect to a score function to be . We assume that the threshold is chosen such that a desired fraction of points from the set is discarded (e.g., the user may choose to discard at most 5% of the points in ). The goal of training time diagnostics is to find a score function such that training over minimizes the generalization error of a model .

Inference Time Diagnostics   We consider a trained model that is deployed and serves incoming inference queries. Each inference query is accompanied by a data sample . We define a victim sample to be a sample such that but , i.e., it should have been correctly classified without the corruption. We term inference time diagnostics the problem: Given a classifier that is trained on clean data, for each sample that comes on the fly, we want to tell if it is a good sample that can be correctly classified or it is a victim sample, i.e., we want to detect if .

We next describe how Picket solves the aforementioned tasks.

3.2. Picket Overview

Figure 1. The key components of a typical machine learning pipeline with Picket.

We provide an overview of Picket and how it provides data diagnostics services for ML. An overview is shown in Figure  1.

Training Time Diagnostics   For this phase, the input to Picket is a training data set and a threshold that corresponds to an upper bound on the fraction of points to be filtered from . The output of Picket is a feasible set of data points to be used for training a downstream model. Recall, that is determined by a score function that measures the corruption of points in . This function should be such that the generalization error of the downstream classifier is minimized. We do not want to tie training time diagnostics to a specific ML model, hence, we approximate generalization error via the reconstruction loss of Picket’s model PicketNet (see Section 5). This design choice is motivated by results in the ML literature which show that unsupervised pre-training is linked to better generalization performance (Erhan et al., 2010; Devlin et al., 2018; Yang et al., 2019).

Picket follows the next steps to identify the score function : First, Picket learns a self-supervised PicketNet model that captures how data features are distributed for the clean samples. Picket uses a novel unsupervised robust training mechanism to learn , i.e., the system does not have access to training samples that are labeled as clean or corrupted. In this phase, we ignore any attribute in

that corresponds to the labels of the samples. We assume that labels are either clean or label-based noise is naturally handled by the downstream training procedure. During training, Picket records the reconstruction loss across training epochs for all points in

. After training of , the reconstruction loss trajectories for the points in are analyzed to form a score function for training time diagnostics. Finally, the feasible set is identified and returned to proceed with training of the downstream supervised model.

Inference Time Diagnostics   We assume a trained classifier serving inference queries. Picket’s inference time diagnostics service spans two phases, offline and online. For the offline phase, we assume access to a set of clean training samples . We consider that inference samples follow the same distribution as those in before corruption. We follow the next steps during the offline phase: first, data set is used to learn a PicketNet ; second, is augmented by adding artificial noise and then extended with reconstruction loss features from , and this new data set, denoted , is used to learn a set of victim sample detectors (one for each class considered during inference). These detectors corresponds to 0-1 classifiers. Note that we can take as the clean set, and as the pre-trained PicketNet from the training time diagnostics if the diagnostics of the two stages are combined. Models and are used during the online phase. During this phase, we assume a stream of incoming inference queries. is served by the downstream prediction service corresponding to classifier to provide predictions. To provide online inference time diagnostics, Picket performs the following: for each point it evaluates classifier on to obtain an initial prediction . Picket also uses to compute a feature vector of reconstruction losses for the features of . This reconstruction loss feature-vector together with the raw features in are given as input to the victim sample detector for the class that corresponds to . Using this input, the detector identifies if point corresponds to a victim sample; if the point is not marked as suspicious the final prediction is revealed.

4. The PicketNet Model

We introduce PicketNet, a two-stream self-attention model to learn the distribution of tabular data. The term

stream refers to a path in a deep neural network that focuses on a specific view of the input data. For example, a standard multi-head attention path is one stream that learns value-based dependencies between the parts of the input data (see Section 2). Recent deep learning architectures have proposed combining multiple streams, where each stream focuses on learning a different view of the data. Such architectures achieve state-of-the-art results in natural language processing tasks (Yang et al., 2019)

and computer vision tasks 

(Simonyan and Zisserman, 2014). PicketNet introduces a novel two-stream self-attention model for tabular data. We next describe the PicketNet architecture and then the robust, self-supervised training process we follow.

4.1. Model Architecture

PicketNet contains a schema stream and a value stream. The schema stream captures schema-level dependencies between attributes of the data, while the value stream captures dependencies between specific data values. A design overview of PicketNet is shown in Figure 2. The input to the network is a mixed-type data tuple with attributes. We denote its attribute values as . We discuss the different components of PicketNet architecture next.

The first step is to obtain a numerical representation of tuple . To capture the schema- and value-level information for , we consider two numerical representations for each attribute : 1) a real-valued vector that encodes the information in value , denoted by , and 2) a real-valued vector that encodes schema-level information of attribute , denoted by . For example, a tuple with two attributes is represented as .

To convert to , PicketNet uses the following process: The encoding for each attribute value is computed independently. We consider 1) categorical, 2) numerical, and 3) textual attributes. For categorical attributes, we use a learnable lookup table to get the embedding for each value in the domain. This lookup table is learned jointly with the subsequent components of PicketNet. For numerical attributes, we keep the raw value. For text attributes, we train a fastText (Bojanowski et al., 2017) model over the corpus of all the texts and apply SIF  (Arora et al., 2019) to aggregate the embedding of all the words in a cell. The initial embedding vectors are fed as input to the value-level stream. Each attribute follows a different embedding process.

Each vector serves as a positional encoding of the attribute associated with index . Positional encodings are used to capture high-level dependencies between attributes. For example, if there exists a strong dependency between two attributes and , vectors and should be such that the attention score between and is high. Each corresponds to a trainable vector that is initialized randomly and is fed as input to the schema stream.

We now describe subsequent layers of our model. These layers consider the two attention streams and form a stack of self-attention layers. The output of the previous layer serves as the input to the next layer. Self-attention layer takes the value vector and positional encoding to learn a further representation for attribute and its value . After each attention layer, the outputs of the two streams are aggregated and fed as input to the value-level stream of the next layer, while the schema stream still takes as input the positional encoding. The output of the value stream and that of the schema stream are computed as:

where MHS is the multi-head attention function followed by a feed-forward network, and

is a linear transformation that transforms the input into query, key, or value vectors by the corresponding matrices.

are matrices with query, key and value vectors from multiple attributes packed together. The only difference between the two streams is that the query in the schema stream corresponds to the positional encoding, therefore it learns higher-level dependencies. For the value stream the input to the next level is the sum of the outputs from the two streams: , while for the schema stream the input to the next level corresponds to a new positional encoding that does not depend on the previous layers. If layer is the last layer, is the final representation for attribute value of the input data.

Figure 2. (a) Overview of the two-stream multi-head self-attention network. (b) An illustration of the schema stream for the first attribute.(c) An illustration of the value stream for the first attribute.

4.2. Training Process

We now describe how PicketNet is trained.

Multi-Task Self-Supervised Training   The training process is self-supervised, with an objective of reconstructing the input. In each iteration, we mask one of the attributes, and try to reconstruct it based on the context, i.e., the other attributes in the same tuple. The attributes are masked in turn, with a specified or random order.

The training is also multi-task since the reconstruction of each attribute forms one task. We use different types of losses for the three types of attributes to quantify the quality of reconstruction. Consider a sample whose original value of attribute is . If is numerical, its a one-dimensional value, and hence, the reconstruction of the input value is a regression task: We apply a simple neural network on the output to get an one-dimensional reconstruction , and use the mean squared error (MSE) loss:

For categorical or textual attributes we use the following loss. Consider a tuple and its attribute . For its attribute value let be the input embedding obtained from PicketNet, and the contextual encoding of value after pushing tuple through PicketNet. Given tuple , we randomly select a set of other values from the domain of attribute . We mask the value of attribute and consider the training loss associated with identifying as the correct completion value from the set of possible values

. To compute the training loss we use the cosine similarity between

and the input encoding for each , then we apply the softmax function over the similarities and calculate the cross-entropy (CE) loss:

where is the cosine similarity between and .

Loss-based Filtering   The input to PicketNet may not be clean, and self-supervised models are vulnerable to noisy training data since the features and labels are both corrupted in the presence of noise. Therefore, we need a mechanism to make the training process robust to noisy input. We introduce a loss-based filtering mechanism to our system. Similar mechanisms have been explored in the context of image processing (Song et al., 2018; Yang et al., 2017).

In the early stages of training, we record the statistics of reconstruction loss for several epochs. We aggregate the losses of all the attributes over the recorded epochs, and filter out the samples with abnormally high or low loss. The detailed steps are as follows:

  1. Train PicketNet over the training set for epochs to warm up.

  2. Train PicketNet over for epochs and, for each sample in , record the epoch-wise average loss for each attribute , .

  3. For each sample, aggregate the losses attribute-wise:


    where computes the median over all points in .

  4. Put a sample into set if its aggregated loss is less than or greater than , where and are pre-specified thresholds. is the set of samples to be removed.

  5. Train PicketNet over until convergence.

When we do the attribute-wise aggregation, we perform normalization by dividing the loss of each attribute by the median of it to bring different types of losses to the same scale. The normalized loss characterizes how large the loss is relative to the average level loss in that attribute. Note that we use the median instead of mean because the median is more robust to noise. In the cases when some samples have extremely high or low loss, the mean could shift a lot while the median is more stable.

Figure 3. The reconstruction loss distributions of clean samples, randomly corrupted samples, systematically corrupted samples and poisoned samples in the early training stage.

The filtering is two-sided because randomly or systematically corrupted samples and adversarially poisoned samples have different behaviors during the early training stage. Outliers with random or systematic noise are internally inconsistent and thus have high reconstruction loss in the early training stage of PicketNet. However, adversarially poisoned samples tend to have unusually low reconstruction loss. The reason is that poisoned data tend to be concentrated on a few locations to be effective and appear normal, as is pointed out by Koh et al. (Koh et al., 2018). Such concentration forces deep networks such as PicketNet to fit very quickly and therefore the reconstruction loss in the early stage is lower than that of the clean samples. We confirm this hypothesis experimentally. In Figure  3, we show the distributions of the reconstruction loss for 1) clean samples, 2) randomly and systematically corrupted samples, and 3) adversarially poisoned samples for a real-world dataset. We can see that the loss distributions of the three types of samples have notable statistical distances. Therefore, we need to filter out the samples with high loss to remove randomly or systematically corrupted samples, and samples with abnormally low loss to defend against poisoning attacks.

5. Data Diagnostics for ML in Picket

The key element that enables both training time and inference time diagnostics is the reconstruction loss of PicketNet, which provides a signal that indicates how inconsistent an attribute value is with respect to the other attribute values in a tuple. The internal inconsistency can be used to diagnose the data.

Filtering Corrupted Training Data   We use PicketNet’s reconstruction loss statistics (see Equation 1) from training to design the score function that is used to identify the feasible set (see Section 3.2). The score function follows immediately from the robust training procedure for PicketNet described in Section 4.2. We set the aggregated reconstruction loss statistics for each point in (see Equation 1) to be the score function for filtering corrupted points. We then define the feasible set to be . We allow for two different thresholds to distinguish between points with random or systematic corruption, and points with adversarial corruptions (see Section 4.2). The user can choose to use the default thresholds and used when training PicketNet—this option corresponds to the robust training mechanism of PicketNet—or can adjust the thresholds to be more or less conservative (see Section 6.2).

Victim Sample Detection for Inference Data   We combine the reconstruction loss measurements from PicketNet with victim sample detectors to obtain Picket’s inference time diagnostics (see Section 3.2) for a trained classifier . For each class in the downstream classification task, we build a detector to identify victim sample, i.e., samples for which

will provide a wrong prediction due to corruption of the feature values. Picket uses logistic regression as the detector model by default but the user can easily switch to other classifiers. The training of these classifiers is performed using synthetically generated training data.

During the offline phase of inference time diagnostics, we assume access to a set of clean training data points . Notice that may be the same as obtained via applying Picket’s training time diagnostics on a potentially corrupted data set . We first apply classifier on all points in and obtain a subset of points for which returns the correct prediction, i.e., . We denote this subset . Moreover, we partition into sets , one for each class of the downstream prediction class. For each partition, we use the points in to construct artificial victim samples and artificial noisy points for which returns the correct prediction despite the injection of noise. Let and be the set of artificial victim samples and the set of noisy but correctly classified sample generated from respectively. To construct these two data sets we select a random point from and inject artificial noise to obtain a noisy version ; we then evaluate and if we assign the generated point to otherwise we assign it to . We iteratively perform the above process for randomly selected points in until we populate sets and with enough points such that . Given these three sets, we construct a new augmented data set . We extend the feature vector for each point in by concatenating it with the reconstruction loss vector obtained after passing each point through the trained PicketNet . We also assign to it a positive label (indicative that we will obtain a correct prediction) if it originated from or and a negative label (indicating that we will obtain a wrong prediction) if it originated from . The output of this procedure is the training data for the victim sample detector . We repeat the above process for each class .

Ideally, in the above process, the artificial noise that we inject should have the same distribution as that in the real-world case. However, it is impossible to know the exact noise distribution in advance. In practice, we add random noise to train the detectors and they work well in the presence of other types of noise. A possible reason is that the detectors learn an approximate boundary between good and victim samples regardless of noise type.

At inference time, the victim sample detectors are deployed along with the downstream model. Whenever a sample comes, the downstream model gives the prediction and PicketNet gives the reconstruction losses. The corresponding detector decides if it should be marked as suspicious based on itself and the reconstruction loss vector. Note that since the reconstruction loss of categorical and textual attributes depends on the set of negative samples that are randomly selected from the domain of the corresponding attribute, we repeat the loss computation several times and take the mean to reduce the effect of randomness.

6. Experiments

We compare the performance of Picket against baseline methods on detecting corrupted points (i.e., outlier detection) at training time and the victim sample detection task at test time on a variety of datasets. We seek to validate the effectiveness of Picket’s data diagnostics on safeguarding against different types of corruption in ML pipelines. We consider both training and deployment of different ML models. Finally, we perform several micro-benchmarks to validate different design choices in Picket.

6.1. Experimental Setup

We introduce the datasets, noise generation process, downstream models, baselines, and metrics used in our evaluation. We also provide the hyper-parameters we use for PicketNet.

Datasets:   We consider six data sets, which are summarized in Table 1. The first five data sets, Wine (Cortez et al., 2009), Adult (Dua and Graff, 2017), Marketing (Li et al., 2019), Restaurant (Das et al., ), Titanic (Eaton and Haas, 1995) correspond to binary classification data sets from different domains. They cover numerical, categorical, and textual attributes and contain either one type of attributes or a mixture of different types. These data sets where obtained from the UCI repository (Dua and Graff, 2017) and the CleanML benchmark (Li et al., 2019). The last data set we consider, HTRU2 (Lyon et al., 2016), is purely numerical and we use it in the context of adversarial noise. It also focuses on classification. The reason is that common adversarial attack methods only apply to numerical and not categorical data. A detailed description of the data sets is as follows.

  • Wine: The data set consists of statistics about different types of wine based on physicochemical tests. The task is to predict if the quality of a type of wine is beyond average or not. The features are purely numerical.

  • Adult: The data set contains a set of US Census records of adults. The task is to predict if a person makes over $50,000 per year. The features are a mixture of categorical and numerical attributes.

  • Marketing: The data set comes from a survey on household income consisting of several demographic features. The task is to predict whether the annual gross income of a household is less than $25,000. The features are purely categorical.

  • Restaurant: The data set contains information of restaurants from Yelp. The task is to predict if the price range of a restaurant is “one dollar sign” on Yelp. The features are a mixture of categorical values and textual description,

  • Titanic: The data set contains personal and ticket information of passengers. The task is to predict if a passenger survives or not. The features are a mixture of numerical, categorical and textual attributes.

  • HTRU2: The dataset contains statistics about a set of pulsar candidates collected in a universe survey. The task is to predict if a candidate is a real pulsar or not. The features are purely numeric.

We consider downstream ML pipelines over these data sets that use 80% of each data set as the training set, and the rest as test data. To reduce the effect of class imbalance, we undersample the unbalanced datasets where over 70% of the samples belong to one class. The numerical attributes are normalized to zero mean and unit variance before noise injection.

Dataset Size Numerical Attributes Categorical Attributes Textual Attributes
Wine 4898 11 0 0
Adult 32561 5 9 0
Marketing 8993 0 13 0
Restaurant 12007 0 3 7
Titanic 891 2 5 3
HTRU2 17898 8 0 0
Table 1. Properties of the datasets in our experiments.

Noise Generation Process   We consider different types of noise. For random and systematic noise, we corrupt fraction of the cells in the noisy samples. For adversarial noise, we use data poisoning techniques at training, and evasion attack methods at inference:

  • Random Noise:

    For a categorical or textual attribute, the value of a corrupted cell is flipped to another value in the domain of that attribute. For a numerical attribute, we add Gaussian noise to the value of a corrupted cell, with zero mean and standard deviation of

    , where is a constant.

  • Systematic Noise: For categorical and textual data, we randomly generate a predefined function which maps the value of a corrupted cell to another value in the same domain. The mapping function depends on both the value in that attribute and the value in another pre-specified attribute. For a numerical attribute, we add a fixed amount of noise to the value of a corrupted cell, where is a constant.

  • Adversarial Noise: In the training time, the adversarial samples are generated by a back-gradient method  (Muñoz-González et al., 2017). This type of noise is specific to different downstream models, and thus, different for each data-model combination. In the inference time, we use the projected gradient descent (PGD) attack  (Madry et al., 2018b), a white-box evasion attack method to generate adversarial test samples. We use the implementation of PGD attack from  (Nicolae et al., 2018). The corruption injected by the PGD attack is bounded by an infinity norm of .

The default noise parameters are set to , , at training time and , , at inference time.

Downstream Models   We consider the next downstream models:

  • Logistic regression (LR) with regularization parameter 1.

  • A Support Vector Machine (SVM) with a linear kernel and regularization parameter 1.

  • A fully-connected neural network (NN) with 2 hidden layers of size 100. We use a small model with 1 hidden layer of size 10 when we perform poisoning attacks due to the runtime complexity of the attack algorithm.

Baselines   We consider different baselines at training and inference time as the tasks are different. We compare against three unsupervised outlier detection methods for training time diagnostics:

  • Isolation Forest (IF) Isolation Forest  (Liu et al., 2008)

    is similar to Random Forest but targeting outlier detection. The method randomly selects split points for each tree, and uses the average path length to the root as the measure of normality.

  • One-Class SVM (OCSVM) OCSVM  (Chen et al., 2001)

    is similar to SVM but only learn the boundary of normal data. It detects outliers based on the learned boundary. We use an OCSVM model with a radial basis function kernel.

  • Robust Variational Autoencoder (RVAE): RVAE  (Eduardo et al., 2019)

    is a state-of-the-art generative model learning the joint distribution of the clean data. It explicitly assigns a probability of being an outlier for each cell, and aggregates all the probability of the cells in each sample to detect anomalous samples.

IF and OCSVM require the estimated proportion of outliers in the training set as a hyper-parameter, and we provide the actual proportion. For RVAE, we use the hyper-parameters from (Eduardo et al., 2019).

At test time, we compare against: 1) methods based on per class binary classifiers, 2) naïve confidence-based methods, and 3) adversarial example detection methods from adversarial learning.

Methods based on per class binary classifiers follow the same strategy as Picket but use different features.

  • Raw Feature (RF) The binary classifiers only have access to the raw features of the data.

  • RVAE The binary classifiers use the cell-level probability of being outliers provided by RVAE as features.

  • RVAE+ The classifiers use a combination of the features from the two methods above.

We consider the following naïve methods:

  • Calibrated Confidence Score (CCS) CCS assumes that the victim samples have lower confidence scores than the clean samples being correctly classified. It considers low confidence samples as victim samples. The confidence scores are provided by the downstream models and calibrated by temperature scaling  (Guo et al., 2017).

  • -Nearest Neighbors (KNN)KNN assumes that a victim sample tends to have a different prediction from its neighbors. KNN sets samples with high fraction of disagreement as victim samples. Fraction of disagreement is the fraction of the nearest neighbors that has different predictions. When we search for the nearest neighbors of a sample, we use different distances for the three types of attributes. For numerical attributes, the distance is if , where is the difference between two normalized values; the distance is if . For categorical attributes, the distance is if two values are different, and otherwise. For textual attributes, we use the cosine distance. We set to .

We consider two methods of adversarial sample detection:

  • The Odds are Odd (TOAO)

    TOAO (Roth et al., 2019)

    assumes that adversarial samples are more sensitive to random noise than benign samples, i.e., it is more likely that adding random noise to an adversarial sample changes the prediction. It detects adversarial samples based on the change in the distribution of the logit values after the injection of random noise. It adds Gaussian, Bernoulli, and Uniform noise of different magnitude and takes the majority vote of all noise sources.

  • Model with Outlier Class (MWOC) MWOC  (Grosse et al., 2017) assumes that the feature distribution of adversarial samples is different from that of benign samples. Therefore, it adds a new outlier class to the downstream model to characterize the distribution of adversarial samples.

Metrics   For training set outlier detection, we report the area under the receiver operating characteristic curve (AUROC). We use AUROC since we want to compare the performance of outlier detection in various threshold settings. AUROC is a measure of the accuracy of a diagnostic test. Typical AUROC values range from 0.5 (no diagnostic ability, i.e., random guess) to 1.0 (perfect diagnostic ability). Intuitively, the AUROC is the probability that the criterion value of a sample drawn at random from the population of those with a target condition (in our case the sample is an outlier) is larger than the criterion value of another sample drawn at random from the population where the target condition is not met. An AUROC greater than 0.8 is considered excellent for detection (Hosmer and Lemeshow, 2000). We also consider the test accuracy of downstream models.

For victim sample detection at test time, we report the scores of the classification between correctly classified samples and victim samples. score is a good fit since most of the victim sample methods are based on classifiers and there is no explicit threshold.

All experiments are repeated five times with different random seeds that control train-test split and noise injection, and the means of the five runs are reported.

Hyper-parameters of PicketNet   PicketNet is not sensitive to hyper-parameters in most cases. The default hyper-parameters we use in the experiments is shown in Table  2. For purely numerical datasets, we reduce the dimension of and to 8, and for HTRU2, we reduce the number of self-attention layers to 1. In the other datasets, we always use the default hyper-parameters. We use the Adam optimizer  (Kingma and Ba, 2014) with , and for training. The learning rate , where is the dimension of and , is the index of the training step. increases in the first few steps and then decreases. Typically, PicketNet takes 100 to 500 epochs to converge, depending on the datasets.

Hyper-Parameter Value
Number of Self-Attention Layers 6
Number of Attention Heads 2
Dimension of and 64
Number of Hidden Layers in Each Feedforward Network 1
Dimension of the Hidden Layers in Each Feedforward Network 64
Dropout 0.1
Size of the Negative Sample Set 4
Warm-up Epochs for Loss-Based Filtering 50
Loss Recording Epochs 20
Table 2. Default hyper-parameters for PicketNet.

6.2. Training Time Diagnostics Evaluation

We first evaluate the performance of different methods on detecting erroneous points in the training data, without considering a downstream model. This setup is similar to that of error detection. We then focus on training time diagnostics and consider the effectiveness of different methods at safeguarding against degradation of downstream ML pipelines due to data corruptions.

Figure 4. AUROC of outlier detection for random noise.
Figure 5. AUROC of outlier detection for systematic noise.
Figure 6. AUROC of outlier detection for adversarial noise.

Outlier Detection Performance   We evaluate the performance of Picket and the competing methods on training time outlier detection. Figures 45, and 6 depict the AUROC obtained by the methods for different noise settings. For the setting with random and systematic noise, we corrupt 20% of the training set. We perform the experiments for random and systematic noise on five data sets with various types of attributes. For the setting with data poisoning attack, we add poisoned samples to the training set until the fraction of them gets 20%. Due to data poisoning being limited to numerical data, we only evaluate on two data sets.

Picket is the only approach that consistently achieves an AUROC of more than 0.8 for all data sets and for all noise settings. Other methods achieve comparable performance in some settings but they are not consistent and in many cases their diagnostic power drops close to that of random. It is worth nothing that our approach outperforms the others by over 0.4 AUROC on Restaurant both in the settings with random and systematic noise. These results are expected since Restaurant contains several textual attributes and we employ techniques from the field of natural language processing to build our model. For data sets that do not contain many textual attributes, IF and OCSVM work quite well in the random noise setting, but their performance drops when the noise is systematic. RVAE performs poorly for purely categorical data and data that contain text. For the other data sets, it performs well on random noise but still suffers in the case of systematic noise. In the presence of poisoned data, we filter out samples with abnormally low outlier scores for all the competing methods due to the concentration property of the poisoned samples. IF performs well on Wine but poorly on HTRU2, but OCSVM shows the opposite. A possible reason is that the two data sets have different types of structures, and the two methods are good at capturing only one of them. RVAE shows poor performance for both datasets.

Effect on Downstream Models   We also study the effect on the downstream models if we filter noisy samples based on the detection of the methods. For each method, we filter 20% of the samples with highest outlier scores, and train different downstream models on the resulting feasible set. For each data set the test set is fixed and contains only clean data. As reference points, we also include the test accuracy when 1) the training data is clean without corruption (CL), and 2) the training data is corrupted but no filtering (NF) is performed. These two reference baselines correspond to the desired best-case and worst-case performance.

Dataset Downstream Model IF OCSVM RVAE Picket CL NF
Wine LR 0.7261 0.6976 0.7051 0.7312 0.7349 0.6745
SVM 0.7286 0.6933 0.7082 0.7310 0.7386 0.6727
NN 0.7210 0.6894 0.7035 0.7320 0.7365 0.6722
HTRU2 LR 0.8884 0.9015 0.8811 0.9067 0.9396 0.8799
SVM 0.8884 0.8979 0.8887 0.9232 0.9424 0.8832
NN 0.8671 0.8707 0.8643 0.9000 0.9280 0.8646
Table 3. Test accuracy of downstream models under adversarial poisoning attacks and different filtering methods.

First, we consider the case of adversarial corruptions since this type of noise corresponds to the worst-case for the downstream trained model. We measure the test accuracy of the downstream models when poisoning data is injected into the training stage. The results are shown in Table 3. If we compare CL with NF we see an average drop of six accuracy points if corruptions are ignored and no filtering is applied. We find that all methods reduce the negative impact of the poisoned data and bring up the test accuracy. Nevertheless that Picket outperforms all competing baselines and yields test time accuracy improvements of more than three points in some cases. We see that Picket is able to recover most of the accuracy loss for all models in the Wine data set and comes very close to CL for HRTU2. All other methods exhibit smaller accuracy improvements and do not exhibit consistent behavior across data.

We also consider the cases of random and systematic noise. These types of noise do not directly attack the downstream model. Moreover, most ML models are somewhat robust to these types of noise. As a result, we expect to see a small gap in the test accuracy between CL and NF, and all methods to perform comparably. We consider these setups for completeness.

We first focus on random noise. The results are shown in Tables 4. As expected, in the presence of random noise, the performance of the downstream models drops in some cases and remains roughly the same in the other cases if we look at CL and NF. In the cases when the downstream accuracy drops, we can see that filtering helps most of the time. To better understand the performance of different methods, we consider the average changes of accuracy compared to the two reference points across all data sets and models. The aggregated result is shown in Table  5. Among all the methods, our model has the most positive effect on the downstream accuracy since the improvement over doing nothing is the largest, and the accuracy is closest to clean data. This is expected since our method is the most powerful in detecting outliers.

If we compare the performance of Picket and NF in Table 4 for Neural Networks, we see that for Adult, Titanic, and Restaurant Picket exhibits slightly worse test accuracy. These results are attributed to the selected thresholds for filtering in Picket (see Section 5). In Figure 7, we show the test accuracy of the downstream neural network for different levels of the Picket threshold. We can see that for some datasets, random noise serves as regularization and improves the performance of the downstream model. Therefore, we need to tune the threshold to achieve the best performance.

Dataset Downstream Model IF OCSVM RVAE Picket CL NF
Wine LR 0.7322 0.7314 0.7327 0.7337 0.7349 0.6806
SVM 0.7347 0.7349 0.7322 0.7359 0.7386 0.6647
NN 0.7892 0.7943 0.7914 0.7808 0.8027 0.7808
Adult LR 0.8096 0.8224 0.8138 0.8221 0.8242 0.8031
SVM 0.7920 0.8129 0.7970 0.8092 0.8180 0.7885
NN 0.7800 0.7764 0.7749 0.7742 0.7945 0.7879
Restaurant LR 0.7294 0.7308 0.7341 0.7371 0.7396 0.7391
SVM 0.7329 0.7332 0.7311 0.7372 0.7369 0.7375
NN 0.7142 0.7107 0.7101 0.7175 0.7314 0.7213
Marketing LR 0.7718 0.7754 0.7751 0.7759 0.7777 0.7765
SVM 0.7702 0.7669 0.7771 0.7645 0.7779 0.7773
NN 0.7268 0.7341 0.7337 0.7331 0.7336 0.7315
Titanic LR 0.7855 0.7832 0.7788 0.7844 0.7855 0.7721
SVM 0.8112 0.8056 0.8078 0.8112 0.8156 0.8045
NN 0.7464 0.7486 0.7542 0.7542 0.7922 0.7654
Table 4. Test accuracy of downstream models under random noise and different filtering methods.
Reference Point IF OCSVM RVAE Picket
CL -0.0118 -0.0095 -0.0106 -0.0088
NF 0.0063 0.0087 0.0075 0.0093
Table 5. Average test time accuracy change after filtering with different methods against CL and NF for random noise.
Figure 7. Changes in test accuracy of a neural network when filtering different fraction of the points; random noise.

We then turn our attention to systematic noise. The results are shown in Table 6. We find that the corruption we inject does not have much effect on the downstream model, therefore any filtering is unnecessary and uninteresting in that case. All measurements are within the margin of statistical error.

Dataset Downstream Model IF OCSVM RVAE Picket CL NF
Wine LR 0.7353 0.7337 0.7378 0.7351 0.7349 0.7365
SVM 0.7331 0.7324 0.7359 0.7306 0.7386 0.7345
NN 0.7918 0.7802 0.7816 0.7882 0.8029 0.7933
Adult LR 0.8211 0.8148 0.8219 0.8200 0.8242 0.8204
SVM 0.8106 0.8092 0.8141 0.8103 0.8180 0.8119
NN 0.7760 0.7736 0.7755 0.7761 0.7880 0.7775
Restaurant LR 0.7333 0.7351 0.7333 0.7337 0.7396 0.7377
SVM 0.7355 0.7324 0.7344 0.7344 0.7369 0.7402
NN 0.7200 0.7169 0.7102 0.7161 0.7239 0.7267
Marketing LR 0.7721 0.7718 0.7722 0.7739 0.7777 0.7755
SVM 0.7568 0.7584 0.7703 0.7598 0.7779 0.7709
NN 0.7347 0.7420 0.7392 0.7337 0.7396 0.7316
Titanic LR 0.7821 0.7754 0.7866 0.7899 0.7855 0.7788
SVM 0.8067 0.8067 0.8101 0.8112 0.8156 0.8112
NN 0.7844 0.7676 0.7777 0.7665 0.7832 0.7832
Table 6. Test accuracy of downstream models under systematic noise and different filtering methods.

6.3. Inference Time Diagnostics Evaluation

We compare Picket against the competing methods on victim sample detection under different types of noise at inference time. The scores under random, systematic and adversarial noise are reported in Table  78,and 9 respectively. The test set consists of victim samples and good samples that can be correctly classified by the downstream model. The good samples can either be clean or noisy. Victim samples account for 50%, and so do the good samples. The task is to pick the victim samples from the test set. We provide an augmented training set to all the methods. The augmented set is generated based on a clean training set. It has the same component as the test set but only with random noise, since we assume that the types of systematic and adversarial noise are unknown when the system is deployed. For the defense of adversarial noise, we augment the datasets with low-magnitude random noise with and to match the noise level of the adversarial noise. Some methods use the augmented training set to train their models (RF, RVAE, RVAE+, MWOC, Picket), while the others use it to find a good threshold (CCS, KNN, TOAO). RVAE is not applicable to textual attributes in this stage since the lookup table it learns from the training set cannot serve new data it has never seen.

Wine LR 0.7698 0.8159 0.8357 0.6666 0.6843 0.7322 0.7458 0.8307
SVM 0.7793 0.8227 0.8421 0.6665 0.6987 0.5494 0.7606 0.8271
NN 0.7088 0.7616 0.7843 0.5156 0.6662 0.3176 0.7970 0.7682
Adult LR 0.8427 0.7110 0.8578 0.6634 0.7679 0.2895 0.6895 0.8636
SVM 0.8643 0.7247 0.8738 0.6650 0.8153 0.3991 0.7357 0.8769
NN 0.8234 0.6895 0.8417 0.4944 0.6861 0.2562 0.7895 0.8454
Restaurant LR 0.7734  –# 0.7514 0.6678 0.6071 0.7478 0.8266
SVM 0.7734 0.7514 0.6678 0.5186 0.7478 0.8266
NN 0.7565 0.7052 0.6600 0.6624 0.7042 0.8149
Marketing LR 0.8486 0.6493 0.8502 0.7659 0.7593 0.6288 0.8216 0.8623
SVM 0.8491 0.6833 0.8536 0.7687 0.7866 0.6402 0.8245 0.8630
NN 0.7958 0.6397 0.8065 0.6706 0.7107 0.5266 0.7194 0.8183
Titanic LR 0.8417 0.6860 0.6584 0.6537 0.7922 0.8590
SVM 0.8389 0.6431 0.6457 0.5533 0.7846 0.8482
NN 0.8298 0.6619 0.6512 0.3306 0.7798 0.8391
  • *DM is short for Downstream Model. #RVAE is not applicable to textual attributes.

Table 7. scores of victim sample detection at inference time under random noise.
Wine LR 0.7641 0.5971 0.7720 0.7634 0.6646 0.5360 0.7501 0.8025
SVM 0.7730 0.6011 0.7747 0.7703 0.6656 0.5629 0.7705 0.8001
NN 0.6686 0.5903 0.6643 0.6587 0.6643 0.6639 0.6803 0.7109
Adult LR 0.8022 0.5827 0.7855 0.6669 0.7198 0.0077 0.7160 0.8086
SVM 0.8285 0.5759 0.8110 0.2339 0.7777 0.6667 0.7228 0.8306
NN 0.7886 0.4606 0.7798 0.5799 0.6821 0.1181 0.6558 0.7949
Restaurant LR 0.7813  –# 0.7623 0.6680 0.4819 0.7619 0.8234
SVM 0.7892 0.7470 0.6702 0.6664 0.7762 0.8307
NN 0.7559 0.7100 0.6602 0.6244 0.6919 0.7968
Marketing LR 0.8553 0.6355 0.8600 0.7744 0.7617 0.6288 0.8336 0.8679
SVM 0.8499 0.7003 0.8564 0.7861 0.7877 0.6387 0.8323 0.8677
NN 0.7881 0.6080 0.7996 0.6792 0.7135 0.5322 0.6826 0.8077
Titanic LR 0.8138 0.7109 0.6491 0.3598 0.7741 0.8272
SVM 0.8518 0.6281 0.6317 0.6367 0.8328 0.8495
NN 0.8099 0.6668 0.6436 0.3202 0.6780 0.8176
  • *DM is short for Downstream Model. #RVAE is not applicable to textual attributes.

Table 8. scores of victim sample detection at inference time under systematic noise.
Wine LR 0.8282 0.6975 0.8314 0.8273 0.6621 0.6142 0.8089 0.8246
SVM 0.8344 0.6899 0.8372 0.8017 0.6655 0.6450 0.8164 0.8361
NN 0.7683 0.6795 0.7653 0.6083 0.6637 0.6679 0.6559 0.7763
HTRU2 LR 0.9250 0.6859 0.9238 0.8878 0.6660 0.6557 0.9148 0.9291
SVM 0.9626 0.7055 0.9622 0.8531 0.6643 0.6577 0.9275 0.9677
NN 0.9441 0.6854 0.9444 0.7417 0.6654 0.5387 0.9275 0.9543
  • *DM is short for Downstream Model.

Table 9. scores of victim sample detection at inference time under adversarial noise.

From the tables, we can see that Picket has the best performance in most cases. By comparing RF and our method, we show that the reconstruction loss features provided by PicketNet are good signals to help identify victim samples

. Such signals are better than those provided by RVAE since our method outperforms RVAE+ most of the time. TOAO performs consistently poorly since the assumption it relies on does not hold for the downstream models and data sets we consider. It works for image classification with complex convolutional neural networks under adversarial settings since adding random noise to images could eliminate the effect of adversarial noise. However, for tabular data sets and models which are not that complex, especially when the noise is not adversarial, adding random noise does not make a big difference. Another method from the adversarial learning literature (MWOC) works well in some cases even if the noise is not adversarial.

6.4. Micro-Benchmarks

Effectiveness of the Two-Stream Self-Attention   We perform an ablation study to validate the effectiveness of the two-stream self-attention. We evaluate the performance of outlier detection with only one stream and with both. The results are depicted in Figure  8. In the case of one stream, we simply let the output of self-attention layer be either for the value stream, or for the schema stream instead of , where is the index of the attribute. We use three setups: Wine with adversarial noise, Adult with systematic noise, and Marketing with random noise.

Figure 8. Outlier detection under different stream settings.

From Figure  8, we see that for Adult and Marketing, PicketNet with two streams outperforms both one-stream options. For Wine, the value stream itself works fine, but a combination of the two streams does not impair the performance of the model. Neither of the two one-stream options demonstrates obvious superiority over the other one, since there are cases that the value stream performs better than the schema stream, and cases that the opposite happens.

Effectiveness of the Early Filtering Mechanism   We validate the effectiveness of early filtering by comparing the performance of outlier detection at the early stage of PicketNet’s training to that after convergence. The results are shown in Figure  9. We use the setup from the previous micro-benchmark.

Figure  9 shows that filtering at early stages consistently outperforms filtering after convergence. The reason is that in the early stage of training, the model is less likely to overfit to the input, and therefore the reconstruction loss of the outliers differs from that of the clean samples more. However, the gaps between early and late filtering are not excessively large, which indicates that PicketNet is actually robust and tends not to overfit in the presence of noise.

Figure 9. Early vs. after-convergence filtering.

Effectiveness of Per-Class Victim Sample Detectors   We compare the performance of our per-class detectors against a unified detector and a score-based detector. The unified detector uses one single logistic regression model over the same features to distinguish between good and victim samples regardless of the downstream predictions. The score-based detector follows the logic of the training time outlier detector, i.e., it aggregates the reconstruction losses attribute-wise, and considers samples with high loss as victims. We perform the comparison on three datasets with all of the three downstream models: Wine with adversarial noise, Adult with systematic noise and Marketing with random noise.

The result is shown in Table  10. Per-Class Detectors outperform the other two, which validates the effectiveness of having one detector per-class. The unified detector performs poorly because the victim samples in one class differ from those in the other statistically. The score-based detector does not work well since it only has access to the noise level of the samples but does not consider the connection between corruptions and the downstream prediction.

Dataset Downstream Model Per-Class Detectors Unified Detector Score-based Detector
Wine LR 0.8195 0.7473 0.6938
SVM 0.8366 0.7787 0.7160
NN 0.7777 0.5687 0.6771
Adult LR 0.8637 0.7362 0.7666
SVM 0.8765 0.7428 0.7703
NN 0.8455 0.7196 0.7511
Marketing LR 0.8681 0.7640 0.7197
SVM 0.8685 0.7797 0.7287
NN 0.8073 0.7076 0.7007
Table 10. A comparison between the per-class detectors, the unified detector, and the score-based detector on inference time victim sample detection.

Runtime and Scalability   We report the training time of PicketNet for each dataset in Table  11. The device we use is NVIDIA Tesla V100-PCIE with 32GB memory. Note that the current runtime has not been fully optimized.

We also study the attribute-wise scalibilty of PicketNet using synthetic datasets. The datasets have a different number of attributes ranging from 2 to 20 with a increase step of one, while the other settings are the same (the dimension of and is fixed to 8). We report the training time of 100 epochs in Figure  10. The growth of the runtime is roughly quadratic as the number of attributes increases. This is expected since the dependencies between one attribute and all the others yield quadratic complexity. When the number of attributes is excessively large, we can apply simple methods like computing the correlations between attributes to split the attributes into groups, where only the attributes within the same group exhibit correlations. Then, we can apply PicketNet to learn the structure for each of the groups.

Dataset Wine Adult Restaurant Marketing Titanic HTRU2
Training Time (sec) 1953 8256 3794 4581 1693 189
Table 11. Training time of PicketNet for each dataset.
Figure 10. Attribute-wise scalibility of PicketNet

7. Related Work

Data Validation Systems for ML   TFX  (Baylor et al., 2017; Breck et al., ) and Deequ  (Schelter et al., 2019)

propose data validation modules that rely on user-defined constraints and simple anomaly detection. CleanML 

(Li et al., 2019) studies how the quality of training data affects the performance of downstream models. These works focus on simple constraints such as data types, value ranges, and one-column statistics and ignore the structure of the data. NaCL  (Khosravi et al., 2019) and CPClean  (Karlaš et al., 2020) propose algorithms to deal with missing entries, and the effect of missing entries are analyzed theoretically in (Liu et al., 2020). These works are orthogonal to ours since they only consider missing entries.

Learning Dependencies with Attention Mechanisms   Attention mechanisms have been widely used in the field of natural language processing to learn the dependencies between tokens (Vaswani et al., 2017; Yang et al., 2019). Recently, we introduced AimNet (Wu et al., 2020), which demonstrates that attention mechanisms are also effective in learning the dependencies between attributes in structured tabular data. AimNet employs the attention techniques to impute the missing values in tabular data and achieve state-of-the-art performance. AimNet is rather simplistic and it only captures schema-level dependencies. Furthermore, training AimNet requires access to clean training data and does not employ any robust-training mechanism to tolerate noise.

Outlier Detection Methods   Outlier detection for tabular data has been studied for years, and many rule-based methods have been proposed  (Ilyas and Chu, 2015; Rahm and Do, 2000; Fan and Geerts, 2012). Learning-based outlier detection has become popular recently and focuses on semi-supervised or unsupervised approaches. Semi-supervised methods such as the ones proposed in  (Heidari et al., 2019; Xue et al., 2010; Mahdavi et al., 2019) still need human in the loop to explicitly label some data. Isolation Forest (Liu et al., 2008) and One-Class SVM (Chen et al., 2001) are simple unsupervised methods that are widely used. Autoencoder-based outlier detection methods  (An and Cho, 2015; Sabokrou et al., 2016; Eduardo et al., 2019) are most relevant to our work since they also rely on the reconstruction of the input, and among them RVAE  (Eduardo et al., 2019) works best for mixed-type tabular data.

Adversarial Attacks and Defenses   Training time attacks  (Koh et al., 2018; Muñoz-González et al., 2017; Biggio et al., 2012) add poisoned samples to corrupt the target model. Filtering-based defenses  (Steinhardt et al., 2017; Diakonikolas et al., 2018) remove suspicious samples during training based on training statistics. Inference time attacks  (Madry et al., 2018b; Carlini and Wagner, 2017; Moosavi-Dezfooli et al., 2016) add small perturbation to test samples to fool the classifier. Efforts have been made to improve the robustness of the model by training data augmentation  (Goodfellow et al., 2014; Madry et al., 2018a) or making modifications to the model  (Xiao et al., 2019; Pang et al., 2020, 2019). Another group of defenses trying to detect adversarial samples at inference time are more directly related to our work. Roth et al.  (Roth et al., 2019) and Hu et al.  (Hu et al., 2019) add random noise to input samples and detect suspicious ones based on the changes in the logit values. Grosse et al.  (Grosse et al., 2017) assume that adversarial samples have different distributions from benign samples and add another class to the downstream classifier to detect them.

8. Conclusion

We introduced Picket, a first-of-its-kind system that enables data diagnostics for machine learning pipelines over tabular data. We showed that Picket can safeguard against data corruptions that lead to degradation either during training or deployment. To design Picket, we introduced PicketNet, a novel self-supervised deep learning model that corresponds to a Transformer network for tabular data. Picket is designed as a plugin that can increase the robustness of any machine learning pipeline.


  • J. An and S. Cho (2015) Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2 (1). Cited by: §7.
  • S. Arora, Y. Liang, and T. Ma (2019) A simple but tough-to-beat baseline for sentence embeddings. (English (US)). Note: 5th International Conference on Learning Representations, ICLR 2017 ; Conference date: 24-04-2017 Through 26-04-2017 Cited by: §4.1.
  • D. Baylor, E. Breck, H. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, C. Y. Koo, L. Lew, C. Mewald, A. N. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang, M. Wicke, J. Wilkiewicz, X. Zhang, and M. Zinkevich (2017)

    TFX: a tensorflow-based production-scale machine learning platform

    In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, New York, NY, USA, pp. 1387–1395. External Links: ISBN 9781450348874, Link, Document Cited by: §1, §2.1, §7.
  • B. Biggio, B. Nelson, and P. Laskov (2012) Poisoning attacks against support vector machines. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, Madison, WI, USA, pp. 1467–1474. External Links: ISBN 9781450312851 Cited by: §1, §7.
  • C. M. Bishop (1995) Training with noise is equivalent to tikhonov regularization. Neural Comput. 7 (1), pp. 108–116. External Links: ISSN 0899-7667, Link, Document Cited by: §1.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: ISSN 2307-387X Cited by: §4.1.
  • [7] E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich Data validation for machine learning. In MLSys-19, Cited by: §1, §1, §2.1, §7.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: §7.
  • Y. Chen, X. S. Zhou, and T. S. Huang (2001)

    One-class svm for learning in image retrieval

    In Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205), Vol. 1, pp. 34–37. Cited by: §2.1, 2nd item, §7.
  • P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis (2009) Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47 (4), pp. 547–553. Cited by: §6.1.
  • [11] S. Das, A. Doan, P. S. G. C., C. Gokhale, P. Konda, Y. Govind, and D. Paulsen The magellan data repository. University of Wisconsin-Madison. Note: https://sites.google.com/site/anhaidgroup/projects/data Cited by: §6.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.2, §3.2.
  • I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart (2017) Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 999–1008. Cited by: §1.
  • I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, J. Steinhardt, and A. Stewart (2018) Sever: a robust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815. Cited by: §7.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.1.
  • J. P. Eaton and C. A. Haas (1995) Titanic, triumph and tragedy. WW Norton & Company. Cited by: §6.1.
  • S. Eduardo, A. Nazábal, C. K. I. Williams, and C. Sutton (2019) Robust variational autoencoders for outlier detection in mixed-type data. ArXiv abs/1907.06671. Cited by: §1, §1, §2.1, 3rd item, §6.1, §7.
  • D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio (2010) Why does unsupervised pre-training help deep learning?. Journal of Machine Learning Research 11 (Feb), pp. 625–660. Cited by: §3.2.
  • W. Fan and F. Geerts (2012) Foundations of data quality management. Morgan & Claypool Publishers. Cited by: §7.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §7.
  • K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel (2017) On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280. Cited by: §1, §2.1, 2nd item, §7.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: 1st item.
  • A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas (2019) Holodetect: few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data, pp. 829–846. Cited by: §1, §1, §2.1, §7.
  • D. W. Hosmer and S. Lemeshow (2000) Applied logistic regression. John Wiley and Sons. External Links: ISBN 0471356328, 9780471356325 Cited by: §1, §6.1.
  • S. Hu, T. Yu, C. Guo, W. Chao, and K. Q. Weinberger (2019) A new defense against adversarial images: turning a weakness into a strength. In Advances in Neural Information Processing Systems, pp. 1633–1644. Cited by: §7.
  • I. F. Ilyas and X. Chu (2015) Trends in cleaning relational data: consistency and deduplication. Foundations and Trends in Databases 5 (4), pp. 281–393. Cited by: §7.
  • B. Karlaš, P. Li, R. Wu, N. M. Gürel, X. Chu, W. Wu, and C. Zhang (2020) Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions. External Links: 2005.05117 Cited by: §2.1, §7.
  • P. Khosravi, Y. Liang, Y. Choi, and G. Van den Broeck (2019) What to expect of classifiers? reasoning about logistic regression with missing features. In

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19

    pp. 2716–2724. External Links: Document, Link Cited by: §7.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
  • P. W. Koh, J. Steinhardt, and P. Liang (2018) Stronger data poisoning attacks break data sanitization defenses. arXiv preprint arXiv:1811.00741. Cited by: §1, §1, §1, §4.2, §7.
  • P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang (2019) CleanML: a benchmark for joint data cleaning and machine learning [experiments and analysis]. arXiv preprint arXiv:1904.09483. Cited by: §2.1, §6.1, §7.
  • F. T. Liu, K. M. Ting, and Z. Zhou (2008) Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. Cited by: §2.1, 1st item, §7.
  • Z. Liu, J. Park, N. Palumbo, T. Rekatsinas, and C. Tzamos (2020) Robust mean estimation under coordinate-level corruption. External Links: 2002.04137 Cited by: §2.1, §7.
  • R. J. Lyon, B. Stappers, S. Cooper, J. Brooke, and J. Knowles (2016) Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach. Monthly Notices of the Royal Astronomical Society 459 (1), pp. 1104–1123. Cited by: §6.1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018a) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, External Links: Link Cited by: §7.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018b) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, External Links: Link Cited by: 3rd item, §7.
  • M. Mahdavi, Z. Abedjan, R. Castro Fernandez, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang (2019) Raha: a configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD ’19, New York, NY, USA, pp. 865–882. External Links: ISBN 9781450356435, Link, Document Cited by: §1, §2.1, §7.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2574–2582. Cited by: §7.
  • L. Muñoz-González, B. Biggio, A. Demontis, A. Paudice, V. Wongrassamee, E. C. Lupu, and F. Roli (2017) Towards poisoning of deep learning algorithms with back-gradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 27–38. Cited by: §1, 3rd item, §7.
  • M. Nicolae, M. Sinn, M. N. Tran, B. Buesser, A. Rawat, M. Wistuba, V. Zantedeschi, N. Baracaldo, B. Chen, H. Ludwig, I. Molloy, and B. Edwards (2018) Adversarial robustness toolbox v1.2.0. CoRR 1807.01069. External Links: Link Cited by: 3rd item.
  • T. Pang, K. Xu, Y. Dong, C. Du, N. Chen, and J. Zhu (2020) Rethinking softmax cross-entropy loss for adversarial robustness. In International Conference on Learning Representations, External Links: Link Cited by: §7.
  • T. Pang, K. Xu, C. Du, N. Chen, and J. Zhu (2019) Improving adversarial robustness via promoting ensemble diversity. arXiv preprint arXiv:1901.08846. Cited by: §7.
  • A. Qahtan, A. Elmagarmid, R. C. Fernandez, M. Ouzzani, and N. Tang (2018) FAHES: a robust disguised missing values detector. In KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2100–2109 (English). Note: 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018 ; Conference date: 19-08-2018 Through 23-08-2018 External Links: Document, ISBN 9781450355520 Cited by: §2.1.
  • E. Rahm and H. H. Do (2000) Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23 (4), pp. 3–13. Cited by: §7.
  • K. Roth, Y. Kilcher, and T. Hofmann (2019) The odds are odd: a statistical test for detecting adversarial examples. arXiv preprint arXiv:1902.04818. Cited by: §1, §2.1, 1st item, §7.
  • M. Sabokrou, M. Fathy, and M. Hoseini (2016) Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder. Electronics Letters 52 (13), pp. 1122–1124. Cited by: §7.
  • S. Schelter, F. Biessmann, D. Lange, T. Rukat, P. Schmidt, S. Seufert, P. Brunelle, and A. Taptunov (2019) Unit testing data with deequ. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD ’19, New York, NY, USA, pp. 1993–1996. External Links: ISBN 9781450356435, Link, Document Cited by: §1, §1, §2.1, §7.
  • K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: §4.
  • J. Song, T. Ma, M. Auli, and Y. Dauphin (2018) Better generalization with on-the-fly dataset denoising. Cited by: §4.2.
  • J. Steinhardt, P. W. W. Koh, and P. S. Liang (2017) Certified defenses for data poisoning attacks. In Advances in neural information processing systems, pp. 3517–3529. Cited by: §1, §1, §1, §7.
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2020) VL-bert: pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, External Links: Link Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.2, §7.
  • R. Wu, A. Zhang, I. F. Ilyas, and T. Rekatsinas (2020) Attention-based learning for missing data imputation in holoclean. Proceedings of Machine Learning and Systems, pp. 307–325. Cited by: §1, §1, §2.1, §2.2, §7.
  • C. Xiao, P. Zhong, and C. Zheng (2019) Resisting adversarial attacks by -winners-take-all. arXiv preprint arXiv:1905.10510. Cited by: §7.
  • Z. Xue, Y. Shang, and A. Feng (2010) Semi-supervised outlier detection based on fuzzy rough c-means clustering. Mathematics and Computers in simulation 80 (9), pp. 1911–1921. Cited by: §1, §7.
  • C. Yang, Q. Wu, H. Li, and Y. Chen (2017) Generative poisoning attack method against neural networks. arXiv preprint arXiv:1703.01340. Cited by: §4.2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5754–5764. Cited by: §3.2, §4, §7.