Adversarial Learning and Explainability in Structured Datasets

We theoretically and empirically explore the explainability benefits of adversarial learning in logistic regression models on structured datasets. In particular we focus on improved explainability due to significantly higher _feature-concentration_ in adversarially-learned models: Compared to natural training, adversarial training tends to much more efficiently shrink the weights of non-predictive and weakly-predictive features, while model performance on natural test data only degrades slightly (and even sometimes improves), compared to that of a naturally trained model. We provide a theoretical insight into this phenomenon via an analysis of the expectation of the logistic model weight updates by an SGD-based adversarial learning algorithm, where examples are drawn from a random binary data-generation process. We empirically demonstrate the feature-pruning effect on a synthetic dataset, some datasets from the UCI repository, and real-world large-scale advertising response-prediction data-sets from MediaMath. In several of the MediaMath datasets there are 10s of millions of data points, and on the order of 100,000 sparse categorical features, and adversarial learning often results in model-size reduction by a factor of 20 or higher, and yet the model performance on natural test data (measured by AUC) is comparable to (and sometimes even better) than that of the naturally trained model. We also show that traditional ℓ_1 regularization does not even come close to achieving this level of feature-concentration. We measure "feature concentration" using the Integrated Gradients-based feature-attribution method of Sundararajan et. al (2017), and derive a new closed-form expression for 1-layer networks, which substantially speeds up computation of aggregate feature attributions across a large dataset.


page 19

page 21

page 22


Improved Adversarial Learning for Fair Classification

Motivated by concerns that machine learning algorithms may introduce sig...

f-Domain-Adversarial Learning: Theory and Algorithms

Unsupervised domain adaptation is used in many machine learning applicat...

Adversarial Training Makes Weight Loss Landscape Sharper in Logistic Regression

Adversarial training is actively studied for learning robust models agai...

Logistic regression models for aggregated data

Logistic regression models are a popular and effective method to predict...

Adversarial Perturbation Intensity Achieving Chosen Intra-Technique Transferability Level for Logistic Regression

Machine Learning models have been shown to be vulnerable to adversarial ...

Beyond Explainability: Leveraging Interpretability for Improved Adversarial Learning

In this study, we propose the leveraging of interpretability for tasks b...

Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach

We target open-world feature extrapolation problem where the feature spa...

1 Introduction

While deep learning models have been wildly successful in a variety of perceptual (images, audio, video) and text domains, a number of authors have recently highlighted two important concerns with such models:

  • Adversarial Examples: Many of these models are vulnerable to adversarial attacks: it is possible for an adversary to perturb the inputs in such a way that humans perceive no significant change (and in particular they would assign the same "label" as the original example), and yet the model’s label or prediction can be markedly different Szegedy2013-tx ; Goodfellow2014-cy . Adversarial training has recently been proposed as a method for training models that are robust to such attacks, by applying techniques from the area of Robust Optimization Papernot2015-pg ; Madry2017-zz ; Sinha2018-mj .

  • Feature Attribution: Especially for complex models, but even for simple linear models, it is often not clear how to quantify the impact of an input feature on the model’s final output. Such a quantification is crucial in order to understand which features are relevant (and irrelevant or marginally relevant) in determining a model’s behavior on specific inputs, or to understand which features are important in aggregate. This understanding can highlight a model’s strengths or weaknesses and can aid in improving the models or help instill trust in the models (especially in domains such as health-care and finance). A few different feature attribution techniques have been proposed recently Leino2018-ga ; Ancona2017-tw , and Guidotti2018-zk ; Ras2018-sl are two recent surveys of various explanation methods for black-box models. In particular Sundararajan2017-rz formulate a set of axioms that any good feature-attribution method should satisfy and identify a specific method, which they call Integrated Gradients (IG) that satisfies all of these axioms.

Much of the recent work on Adversarial Machine Learning (AML) has focused on perceptual domains or text, and "structured" datasets have been largely ignored. (For all practical purposes, structured datasets can be thought of as tabular datasets such as those arising from health-care, advertising, or a variety of other application areas, where the values in the table columns are numerical or categorical). Moreover, AML research has been primarily concerned with producing adversarially robust models, and the conventional wisdom is that adversarial robustness comes at the cost of (often significant) loss in classification performance on natural examples

Kurakin2016-lf ; Tsipras2018-fu ; Kannan2018-fl .

For example, in Tsipras2018-fo the authors create a synthetic dataset where they demonstrate a tradeoff between standard accuracy (i.e. accuracy on natural unperturbed examples), and adversarial accuracy (i.e. accuracy on adversarially perturbed examples). In particular, on their synthetic datset, they show that as standard accuracy approaches 100%, adversarial accuracy falls to zero. Other tradeoffs have also been considered in the literature, such as the need for more training data to achieve adversarial robustness Schmidt2018-rb .

The emphasis in this paper is different in a few ways. We explore adversarial training, not in perceptual or text domains, but in structured datasets. Our primary interest in this paper is not adversarial robustness or adversarial accuracy, but rather an intriguing side-benefit of adversarial training:

Adversarial training (with -bounded perturbations) can be used to produce models significantly more concentrated than their naturally-trained counterparts, with minimal or no impact on standard accuracy

(i.e. the performance on natural data), particularly in logistic probability-prediction models applied to structured datasets. Unlike regularization techniques such as

or -regularization (which do not guarantee adversarial robustness, as discussed below), adversarial training offers adversarial robustness and in addition produces more concentrated models.

We use the term feature-concentration (or "model compression", or "model concentration") informally to refer to the extent to which the learned model differentiates between "relevant" and "less relevant" or "irrelevant" features. (See below for more on measuring feature concentration). An aspect of adversarial training that we are especially interested in, is its ability to weed out (the weights of) non-predictive features that exhibit spurious correlation with the label on finite datasets. This is clearly important from an explainability and feature-attribution viewpoint: We would like our trained model to not place significant weights on such irrelevant features.

We should point out that there has been prior work on the regularization benefits of adversarial training Xu2009-gz ; Szegedy2013-tx ; Goodfellow2014-cy ; Shaham2015-lg ; Sankaranarayanan2017-vy ; Tanay2018ANA , primarily in image-classification applications: when a model is adversarially trained, its classification accuracy on natural (i.e. un-perturbed) test data can improve. All of this prior work has focused on the performance-improvement (on natural test data) aspect of regularization, but none have examined the feature-pruning benefits explicitly. In contrast to this work, our primary interest is in the explainability benefits of adversarial training, and specifically the ability of adversarial training to significantly improve feature-concentration while maintaining (and often improving) performance on natural test data. Although our theoretical analysis is generally applicable to logistic regression models applied to any type of data, our experimental results are on structured datasets. We focus on structured datasets here because logistic regression models (and variants such a Poisson regression) have been successfully used in a variety of domains involving structured datasets, such as advertising Brendan_McMahan2013-uw ; Chapelle2014-yn , health-care Sperandei2014-ai ; Anderson2003-tz ; Tolles_undated-aw ; El_Emam2013-qd , and finance Emfevid2018-hs ; Mironiuc2013-wu ; Bolton2009-yx ; Dong2010-zn . In all of these domains, it is crucial to understand which input features are truly important in determining the model’s output (either on a single data point or in aggregate across a dataset), and our aim is to show that adversarially-trained logistic regression models yield significantly better feature-attribution by shrinking the weights on irrelevant or weakly-relevant features to negligible levels, with minimal impact on (and often improvement in) performance on natural test data.

It might be argued that traditional regularization can potentially also provide a similar explainability benefit by shrinking the weights of irrelevant features. This does seem to hold for our synthetic dataset (Sec. 4) and the toy UCI datasets (Sec. C

) with simple optimizers such as pure SGD (Stochastic Gradient Descent). However practitioners typically use more sophisticated optimizers such as FTRL

Brendan_McMahan2013-uw to achieve better model performance on test data, and with this optimizer (especially on the real-world advertising data, as described in Sec. 6.2), the model-concentration produced by adversarial training is significantly better than with -regularization (and regularization is even worse than -regularization in this respect). Moreover, or regularization will not necessarily yield the robustness guarantees provided by adversarial training Goodfellow2014-cy . In other words adversarial training simultaneously provides robustness to adversarial attacks, as well as explainability benefits, while maintaining or improving model performance on natural test data. These advantages make adversarial training especially compelling for logistic regression models since there is a simple closed form formula for the adversarial perturbation in this case (as shown in Lemma 3.1 and also pointed out in Goodfellow2014-cy ), and so from a computational standpoint it is no more demanding than or regularization.

Since our results show that adversarial training can effectively shrink the weights of irrelevant or weakly-relevant features (while preserving weights on relevant features), a legitimate counter-proposal might be that one could weed out such features beforehand via a pre-processing step where features with negligible label-correlations can be "removed" from the training process. Besides the fact that this scheme has no guarantees whatsoever with regard to adversarial robustness, there are some practical reasons why correlation-based feature selection is not as effective as adversarial training, in producing pruned models: (a) With adversarial training, one needs to simply try different values of the adversarial strength parameter

and find a level where accuracy (or other metric such as AUC, see Sec 2.7

) is not impacted much but model-weights are significantly more concentrated; on the other hand with the correlation-based feature-pruning method, one needs to set up an iterative loop with gradually increasing correlation thresholds, and each time the input pre-processing pipeline needs to be re-executed with a reduced set of features. (b) When there are categorical features with large cardinalities, where just some of the categorical values have negligible feature-correlations, it is not even clear how one can "remove" these specific feature-values, since the feature itself must still be used; at the very least it would require a re-encoding of the categorical feature each time a subset of its values is "dropped" (for example if a one-hot encoding or hashing scheme is used). Thus correlation-based feature-pruning is a much more cumbersome and inefficient process compared to adversarial training.

1.1 Description of Results

It is now standard Madry2017-zz

to view adversarial training as a process similar to the usual model-training where a certain loss function is minimized, except that each

-dimensional input instance is altered by an adversary by adding a -dimensional perturbation to , where the adversary chooses in the "worst" possible way given the current model state and certain constraints (a formal description appears in Section 3.1). The constraint on the adversary is typically in the form of a bound on the -norm of . For brevity we will refer to such an adversary as a adversary. The key difficulty in scaling up adversarial training to large datasets is the computation of this worst-case at each input sample point. For a general model there is often no choice but to apply an iterative optimization procedure, such as Projected Gradient Descent (PGD), at each sample point to compute the worst-case perturbation . However for linear models (such as logistic or poisson) we derive a simple closed-form expression (Lemma 3.1 in Section 3.1) for under -norm and -norm constraints. This result makes it practical to incorporate adversarial training into a learning system based on logistic or other linear models.

Theorem 3.1 in Section 3.2 is our main theoretical result that sheds light on the feature-concentration effect of adversarial training. This result gives bounds on the expectation of the logistic weight updates when a Stochastic Gradient Descent (SGD) algorithm is applied to data points from an idealized data-generation process, that are subjected to the worst-case -adversarial perturbation according to the closed-form expression mentioned above. These bounds show a clear link between the expected SGD update of a feature’s weight, and the relative magnitudes of and the (absolute) label-correlation of the feature (i.e. the correlation between the feature and the label). Roughly speaking, for features whose absolute label-correlations are less than , the SGD updates on average tend to shrink their weights toward zero, at a rate proportional to the difference between and the label-correlation. On the other hand, for features whose absolute label-correlations are sufficiently larger than , the SGD updates on average tend to expand their weights in their current direction.

This suggests that -adversarial training with an appropriate value of acts as a relevance filter that weeds out the weights of irrelevant or weakly-relevant features, while preserving weights of relevant features, such that standard accuracy is only minimally sacrificed. In fact if there is a large gap between the absolute label-correlations of "relevant" features and those of "irrelevant" features, then this result suggests the existence of a "goldilocks zone" of values that are "just right": large enough to suppress irrelevant or weakly-relevant features, and yet small enough to preserve relevant features such that standard accuracy is not impacted significantly. This is precisely the phenomenon we observe in several experimental studies on synthetic, toy, large-scale real-world datasets, as described below. A more detailed description of the implications of Theorem 3.1 appears in Section 3.3.

Inspired by the synthetic dataset of Tsipras2018-fo , we define a synthetic dataset (Section 4) expressly designed to dramatically highlight the contrast between adversarial training and natural training in their ability to ignore irrelevant features. In contrast to the theoretical analysis of Tsipras2018-fo , we conduct SGD-based adversarial training experiments on this dataset. Our synthetic dataset contains a mix of predictive and purely random non-predictive features, and natural training tends to learn significant weights on the non-predictive features (with the weights sometimes being bigger than those of the predictive features), whereas adversarial training acts as an efficient sieve that weeds out the non-predictive features by pushing their weights close to zero, while still leaving the predictive features with significant weights, hence only minimally impacting (and sometimes even improving) standard accuracy.

How do we quantify feature concentration? One possibility, applicable to linear models, is based on the weights of the features (see the measures Wts.L1 and Wts.1Pct defined in Section 6.1). We also consider concentration measures using the more principled Integrated Gradients (IG) based approach of Sundararajan2017-rz . Specifically we compute each feature’s "importance" using the IG methodology, and then measure "concentration" based on how this measure varies across features (see Section 6.1 for details).

The core of the IG method is an expression for the contribution of a given feature-dimension to a specific prediction on an input sample . This expression involves a line integral over gradients computed along the straight-line path from a certain baseline input to the actual input . Computing this integral in general requires an approximation based on gradients computed at equally spaced points on this straight-line path. This is reasonably efficient to compute for a single datapoint and single dimension

. However in this paper we are interested in repeatedly computing this quantity across all dimensions (which could number into the hundreds of thousands when there are high-cardinality categorical features), over an entire dataset (which could contain millions of records). Fortunately, we are able to show a closed-form formula for the IG-based attribution for 1-layer neural networks with an


differentiable activation function (Lemma

5.1 in Section 5.1).

While the IG methodology of Sundararajan2017-rz focuses on feature-attribution for a single sample point, for our purposes we also want to quantify the overall importance of a feature in a model, aggregated over a specific dataset. In Section 5.2 we introduce a simple, intuitive methodology to compute such an aggregate feature-importance metric we call the Feature Impact.

In Sections C and 6.2 we describe our experiments on toy datasets from the UCI repository Dua:2017 , and large scale real-world advertising datasets from MediaMath respectively. In these experiments we compare results from natural training and adversarial training where the adversary’s perturbations are constrained such that for some positive . The common theme that emerges from these experiments is that there is almost always a "goldilocks zone" for where is large enough to cause significant feature-concentration, yet small enough not to damage performance (as measured by AUC or accuracy) on natural test data. We also study whether this type of model-concentration/performance-preservation effect can be achieved using traditional regularization. We find that on the toy UCI or synthetic datasets we are forced to use a very large regularization parameter (more than 200 for example) to achieve this effect, at least with the FTRL optimizer, whereas on the real-world MediaMath datasets, this effect cannot be achieved at all with regularization: When is large enough to produce a model concentration comparable to adversarial training, the AUC is much worse, while a smaller lambda that preserves AUC produces a much inferior model concentration.

To summarize, our main contributions are:

  • A novel theoretical analysis (Theorem 3.1) of the expectation of SGD updates of logistic regression model weights during adversarial training (on data points from an idealized random data generation process), that shows that adversarial learning acts as an aggressive relevance filter: on average the weight of a feature shrinks or expands depending on whether its absolute label-correlation is smaller than or (sufficiently) larger than respectively. One implication of the result is that when there is a large gap between the absolute label-correlations of relevant and irrelevant (or weakly-relevant) features, then values in this gap region constitute a goldilocks zone for model-concentration: in this zone is "just right": large enough to shrink weights of irrelevant features, and small enough to preserve or expand weights of relevant features, thus only minimally impacting standard accuracy (or AUC-ROC).

  • Experimental results (Sections 4, C, 6.2) showing that - adversarial training of logistic regression models (on synthetic data, UCI data, and large-scale real-world advertising datasets from MediaMath) exhibits the above relevance-filtering phenomenon, with a goldilocks zone of values. When there are a large number of low-relevance features, this aggressive relevance-filtering results in models that are significantly more concentrated than their naturally-trained counterparts, without significantly sacrificing standard accuracy. In particular on some of the MediaMath datasets we show that with -adversarial training using , the number of significant absolute weights (i.e. those within 1% of the largest absolute weight) drops by a factor of 20, while the AUC-ROC on natural test data is nearly the same as with (i.e. natural training). We also show that traditional -regularization cannot be used to achieve this type model-concentration while preserving standard accuracy.

  • A closed form formula (Lemma 5.1) for the Integrated Gradients-based Feature Attribution of Sundararajan2017-rz for 1-layer neural networks with an arbitrary differentiable activation function (of which logistic models are a special case), and the use of this formula to efficiently compute new measures of feature-importance and model-concentration on a dataset.

2 Background, Definitions and Notation

For ease of reference, in this section we collect some basic terminology, concepts and known results pertaining to Machine Learning, Neural Network models, Adversarial Training, and Feature Attribution.

2.1 Machine Learning Datasets and Binary Probability-Prediction

We are interested in machine-learning tasks in domains where the primary emphasis is in developing models that predict the probability of the event. Such models are useful in a wide variety of areas, for instance in advertising to predict response-rates from ad exposureBrendan_McMahan2013-uw ; Chapelle2014-yn , and in health-care to predict the probability of adverse eventsZhao2015-mk . In advertising, the probability is directly used to determine the best price to bid for an ad opportunity in a real-time-bidding system. This is in contrast to domains considered in much of the adversarial learning literature, where the emphasis is primarily on the classification produced by the model.

In this paper we focus on structured datasets. For our purposes a structured dataset is one which can be thought of as a table with columns, where the values in the columns can be numerical or categorical. We assume categorical features are "1-hot encoded" (as described in the next subsection) at least implicitly if not explicitly.

A training dataset consists of example-label pairs where is a

-dimensional example vector and

is the corresponding binary label 111For mathematical convenience we use -1/1 as the labels in the theoretical results, whereas in the experiments we often use 0/1 labels, but these are trivially transformed to -1/1 and the same results apply., where indicates the occurrence of an event of interest. When a model is trained on dataset , the output of the model for an input is interpreted as the predicted probability that . Throughout the paper, it is assumed that the input vector represents the "exploded" feature vector where each categorical feature in the original example has been 1-hot encoded, as described in the next subsection (see more on this in the next sub-section).

2.2 One-hot Encoding of Categorical Features

In ad conversion prediction and other domains involving structured datasets, several (or even most) features can be categorical. We introduce some notation and terminology to simplify the description of some of our results involving categorical features in Section 5.2.

An example vector in original form (i.e. prior to one-hot encoding of categorical features) is denoted , and refers to the ’th feature (which could be numerical or categorical). If feature is categorical, then denotes the set of possible values of the feature, i.e. represents the "vocabulary" of this feature. The cardinality of feature is . We assume that is in index form, i.e. regardless of what the actual feature values are, we assume they are represented as indices in . For example if is the dayOfWeek feature, it has the 7 possible values , and its cardinality is 7. Sometimes cardinalities can be much larger. For example in the advertising datasets we study in our experiments, a website URL is encoded as an integer siteID, which has cardinality around .

For a categorical feature , its 1-hot representation is denoted and is obtained by starting with an -dimensional vector of all zeros, and setting dimension to 1, where is the value of . For example if and its cardinality , then is the vector . For notational convenience, for a numerical feature , we let denote its "one-hot encoding", which is identical to . For a feature-vector (in original form) , we let represent the vector resulting from concatenating the 1-hot encoding of each in the natural way:


We then say that is in exploded form. For example suppose is a 3-dimensional feature vector in original form, is categorical with cardinality , is numerical, and is categorical with cardinality . If , then the exploded form after 1-hot transformation is:


We use to denote the (contiguous) set of dimensions corresponding to the original feature , in the exploded vector . In the above example , , and .

2.3 1-Layer Neural Networks

A 1-layer neural network is defined by learnable parameters (the weight vector) and (the bias), and a differentiable scalar activation function , and computes the following function for any input feature-vector (in exploded space):


where denotes the dot product of and . Although in all of our experiments we train (logistic) models that include the bias , we do not include the bias (also called the "intercept") in most of our results; this is without loss of generality since the effect of the bias term can be captured by setting one of the dimensions of the input vector to 1.0, and the presence of the bias term adds notational clutter but does not meaningful alter our results.

Two commonly used models fall within the class of 1-layer networks: Logistic regression models where the activation function is the sigmoid (or logistic) function , and Poisson regression models where the activation is . In the case of a logistic model, the value of will lie in and can be interpreted as a probability prediction. Logistic and Poisson models are members of the more general class of Generalized Linear Models mccullagh1989generalized .

2.4 Natural Training of Logistic Models

Logistic models are typically trained by minimizing the empirical risk, expressed as the expected Negative Log Likelihood (NLL) loss:


where the NLL loss is given by:


Note that the prediction is a function of and , so the loss on a given example is a function of and .

2.5 Adversarial Training of Logistic Models

While Empirical Risk Minimization (ERM) has been successfully used in a variety of domains to yield classifiers that perform well on examples drawn from the

natural data distribution , such classifiers have repeatedly been shown to be vulnerable to adversarially crafted examples (see Madry2017-zz ; Szegedy2013-tx ; Yuan2017-vz and references therein). As a result there has been a growing interest in adversarial training, i.e. training models that are robust to adversarial examples. As mentioned in the Section 1, our primary interest in adversarial training is not adversarial robustness but rather in the resulting improved model-concentration (and hence improved model-explainability).

Following Madry2017-zz , for our logistic models and NLL loss, we formulate adversarial learning as the problem of minimizing the expected adversarial loss:


where is the set of possible perturbations to the input that the adversary is allowed to make.

It is common practice to consider to be the set of -bounded perturbations. For brevity we will use the phrase "-adversary" to refer to an adversary who is allowed to choose perturbations such that , and we refer to such perturbations as perturbations. In the rest of the paper, whenever we use the term "adversarial" perturbation, it should be understood that we are referring to the worst-case perturbation by an adversary, for the loss function under consideration (i.e. this perturbation achieves the inner max in Eq. (6)).

Thus minimizing expected adversarial loss results in a min-max (or saddle-point) optimization problem which is computationally demanding for general loss functions (since for each example one must find the worst-case perturbation that maximizes the loss on that example). Fortunately, for logistic models we show in Section 3.1 that the worst-case that achieves the inner maximization has a simple closed form expression, when adversarial perturbations are bounded under and -norms.

2.6 Stochastic Gradient Descent (SGD) in Adversarial Training

Whether to minimize the expected natural loss (4) or adversarial loss (6), it is standard to use variants of Stochastic Gradient Descent (SGD). In this paper we assume the following canonical SGD setup for adversarial training of logistic models with an -adversary (For this reduces to the standard SGD for natural examples).

The weight vector is initialized to all zeros. After each example is encountered (where is in exploded form, i.e. after 1-hot-encoding all categorical features),

  1. The input vector is subjected to an adversarial perturbation, i.e. is replaced by where is the vector that achieves the inner maximum in (6), which we show (Section 3.1) for logistic models is given by Eq. (12) for an -adversary.

  2. The logistic NLL loss is computed using Eq. (5),

  3. Each component of is updated according to the gradient of the NLL loss w.r.t. :


    where is the learning rate.

2.7 Model Performance Measures: Accuracy, ROC-AUC

In this paper we are mainly interested in probability-prediction models, and so we use the Area Under the ROC Curve (AUC-ROC, or AUC in brief) Fawcett2006-ws as our primary measure of model performance. In the synthetic datasets, we also use accuracy to measure model performance, and there it should be understood that we treat any prediction above (below) 0.5 as a "positive" ("negative") classification, and accuracy is calculated as the fraction of correctly classified examples in the test dataset.

2.8 Feature Attribution in Neural Networks using Integrated Gradients

The main thrust of this paper is to demonstrate the ability of adversarial training to produce models that are much more concentrated than naturally-trained ones, without sacrificing accuracy on natural test data. Model concentration can be measured in terms of the model weights (for example by computing the -norm of the weights), which is reasonable if there are only numerical features of the same scale, and the model is a 1-layer neural network. However in general the weight of a feature, even in a 1-layer neural network, is not necessarily a good measure of its importance to a model: for instance a numerical feature may have a relatively large weight, but on a typical dataset, its absolute value may be much smaller than other numerical features, or if it is a categorical feature-value, it may occur very infrequently (and hence its overall importance on a dataset may be relatively small). Therefore a more principled approach to measuring the "importance" of a feature in a model is needed.

Feature attribution refers to the general area of understanding how the output of a neural network is impacted by its input features, and this is important in a variety of contexts. For a neural network that computes a function , some examples of relevant questions are:

  • On a specific input , which features of are "most responsible" for the value of the output , relative to for some baseline input (such as a black image for image models, or a zero vector for logistic prediction models). Such an understanding can aid in debugging or improving a model’s performance. Sample-level attribution can also be used as a rationale for a specific output. Such explanations can can help the end-user (e.g. a doctor) understand the strengths and weaknesses of the model.

  • In aggregate over some dataset, what are the relative importances of the features in determining the network output? If there are categorical features, then this question can be asked either at the level of feature-values (i.e. the individual values of each categorical feature, such as the different possible values of the "country" feature), or features (i.e. each categorical feature in aggregate, such as "country"). Understanding aggregate-level feature importance can help prune features, or identify identify bugs where a feature that is expected to be important is not turning out to be important (as measured by the specific attribution method), or vice versa.

The key to a useful feature-attribution is a sound methodology which does not have quirks which obscure the impact of features on the model output. In general it is difficult to evaluate an attribution method, so in a recent paper Sundararajan2017-rz the authors identify several axioms that any sound attribution method must satisfy, and in particular propose a specific method that satisfies these axioms, which they call Integrated Gradients (IG). We adopt this IG method in this paper, and it works as follows. Suppose represents the function computed by a neural network. Let be a specific input, and be the baseline input. The IG is defined as the path integral of the gradients along the straight-line path from the baseline to the input . The IG along the ’th dimension for an input and baseline is defined as:


where denotes the gradient of along the ’th dimension, at .

3 Analysis of SGD Updates in Adversarial Training of Logistic Models

Our aim is to understand the nature of the solution to the -adversarial learning problem (6) for logistic regression models. The biggest difficulty in this optimization problem is the inner maximization, which requires finding the worst-case -adversarial perturbation on each input . In general one needs to run a separate optimization procedure (such as Projected Gradient Descent Madry2017-zz ) to find this , but fortunately for logistic regression models there is a simple closed form expression for , which we show in Lemma 3.1.

While the closed form expression for makes the -adversarial training of logistic regression models computationally as simple as natural training, gaining an analytical insight into the nature of the optimum of (6) is still difficult, especially since there is no known closed-form solution for this optimization problem. Rather than analyzing the final optimum of (6), we instead analyze how an SGD-based optimizer updates the model weights under adversarial perturbations. We assume the standard SGD setup for adversarial training described in Section 2.6.

As a first step toward analyzing the SGD-based weight-updates, we show in Proposition 3.1 a simple expression for the gradient of the logistic NLL loss (5) on an example perturbed by an -adversary. Analyzing SGD-based weight updates is still challenging since SGD is a stateful, sequential process, so in Section 3.2 we define an idealized data-generation process (which we call the Biased Coin Process, or BCP) and instead analyze the expectations of the SGD updates on -perturbed data points drawn from the BCP, which leads to our main theoretical result, Theorem 3.1.

3.1 Adversarial Perturbations for Logistic Models

The following Lemma gives a closed form formula for the perturbations under the logistic NLL loss (5) for .

Lemma 3.1 (Adversarial perturbations for logistic models)

For a fixed , , positive integer , and , define as the set of perturbations whose norm is bounded by :


and define the adversarial as:


where is the logistic NLL loss defined in (5). Then



Consider the case (the case is analogous). In this case the NLL simplifies to , which is monotonically decreasing function of . Hence maximizing is equivalent to minimizing . When the -norm of is bounded by , is minimized when is such that it has -norm and points in the direction opposite to , which implies the first result by noting that is the unit vector in the direction of . When the -norm of is bounded by , the lowest value of is achieved when each component of has magnitude but sign opposite that of the corresponding component of , which implies the second result.

The following Proposition shows an expression for the gradient of the logistic NLL loss where is the perturbation (12). The proof, shown in Appendix A

, involves simple algebra and applications of the chain rule, and properties of the sigmoid function.

Proposition 3.1 (Gradient of Logistic NLL Loss under adversarial perturbation)

If is the logistic NLL loss given by (5), and where is the adversarial perturbation (12), then for each the gradient is given by:


3.2 Expectation of SGD Updates on Adversarially Perturbed BCP data

Since SGD is inherently a stateful process, it is challenging to analyze its sequential dynamics. Instead we analyze the expectation of the gradient updates from an arbitrary state, when the training data points are drawn from an idealized, general data-generation process we call the Biased Coins Process (BCP), defined below. The BCP can be viewed as a generative model that is simulating the distribution of points when drawn with replacement from an actual (finite) dataset. Our main result is Theorem 3.1 which characterizes the expected gradient updates on -adversarially perturbed BCP inputs. This result yields several insights that help explain our results on synthetic datasets (Section 4), the UCI datasets (Section C) and real-world advertising datasets (Section 6.2).

In the BCP, a data-point is generated as follows. The label is chosen uniformly at random from , and is a -dimensional feature vector where for each ,


where is the bias of feature . Note that and for any , and

, and the variance of

and are both 1.0, so the correlation of and is . This fact will be useful when interpreting the implications of Theorem 3.1 in Section 3.3.

Consider an arbitrary stage of the SGD algorithm where the current weight vector is , and we generate a data-point according to the BCP, and after replacing the feature-vector with where is the adversarial perturbation in Eq. (12), we present to the SGD algorithm. We then ask what is the expectation of the gradient-based update of the weight ? Each will be updated to a new value , where is the (current) learning rate, and is the logistic NLL loss defined in Eq. (5). For brevity we denote the change in by , i.e.


Given the expression (13) for the gradient of from Proposition 3.1, it turns out that the following conditional expectation will be useful when computing :


where the expectation is over data points generated by the BCP. Note that the value of represents whether or not the value of feature is "aligned" with the label , and means that the weight of feature is "consistent" with its label-alignment: i.e. either and , or and . Conversely, means that that the weight of feature is "inconsistent" with its label-alignment. In general when we say that "feature has consistency ". Further note that is the model’s predicted probability of the wrong label (on an -adversarially perturbed point data-point ). We can therefore interpret as the "expected mis-prediction, given consistency of feature ". Intuitively, the value of is inversely related to the model’s performance on the "slice" of the BCP data where the feature has consistency . Therefore in the initial stages of the SGD algorithm, we expect to be relatively large (i.e. closer to 1.0) and in the later stages, it will relatively small. This intuition will be useful later in Section 3.3 when we interpret the results of Theorem 3.1, which we state below (see Appendix B for the proof).

Theorem 3.1 (Expectation of logistic gradient update for the BCP)

Given a weight vector , assuming a learning rate (only to avoid notational clutter), if a data point is drawn according to the BCP above, and where is the adversarial perturbation given by Eq. (12), then for each , the expectation (over random draws from the BCP) of the SGD update (defined in Eq. (15)) satisfies the following properties:

  1. If then

  2. If , then




It is easy to verify from Eq. (15) that when the learning rate , it shows up as a multiplicative factor on the right hand side of equation (17), and the bounds (18) and (19) in the Theorem.

3.3 Implications of the Expected Gradient Results

Theorem 3.1 has a few interesting implications. It will help to recall from Section 3.2 that the conditional expectation can be interpreted as the "expected mis-prediction of the model when feature has consistency ", which in turn means that in the initial stages of SGD we expect to be relatively large, and it will shrink as SGD progresses toward a better model. Since this factor appears in all of the bounds of Theorem 3.1, the various effects discussed below are more pronounced in the initial stages of SGD and less so during later stages. Note also that the quantity appearing in Theorem 3.1 (and in the implications below) is the absolute correlation between and , as noted at the beginning of Section 3.2. We will refer to as the absolute label-correlation of .

In the following interpretations, we say the weight of a feature is aligned, if , and otherwise we say it is mis-aligned. We also informally say that the feature is "weight-aligned" and "weight-mis-aligned" respectively. Also note that signifies that the expected update of weight shrinks it toward zero. Conversely, signifies an expansion of the weight in the current direction.

  • (1). Weights grow in the right direction starting from zero. Property 1 of Theorem 3.1 implies that for any (i.e. for natural as well as adversarial training) if a feature has weight , the SGD update will on average "grow" its weight in the "correct" direction, i.e. in the same direction as its bias , and the expected magnitude of the update is .

  • (2). Mis-aligned weights are shrunk. When , the upper bound (18) in Property 2 of Theorem 3.1 simplifies to

    which shows that if the weight of a feature is mis-aligned, then its weight is shrunk toward zero on average, and the magnitude of the shrinkage is proportional to . This makes it clear that adversarial training with a positive value of shrinks mis-aligned weights more aggressively than natural training (i.e. with ), and this effect is even more pronounced for features with large absolute label-correlations.

  • (3). Aligned weights are shrunk by a sufficiently large . When , the upper bound (18) simplifies to


    which means that for a weight-aligned feature with bias , if the adversarial exceeds the feature’s absolute label-correlation , then its weight is shrunk toward zero in expectation, and the expected magnitude of the shrinkage is proportional to .

  • (4). Aligned weights are expanded up to a point, for sufficiently small . When , the lower bound (19) simplifies to

    and this lower bound is positive if and


    In other words, if is weight-aligned, and is sufficiently smaller than its absolute label-correlation (to account for the term in Eq. (21)), then its weight is expanded or preserved on average.

Implications (3) and (4) are specific to adversarial training since they apply only for a certain range of non-zero values of . Indeed, these two implications are key to the feature-concentration effect of adversarial training which we explore in detail in the experiment sections. Suppose for example that there are two features (in the BCP-generated data) with biases , and respectively, and we adversarially train a logistic model with . Then by implication (2) a negative weight on either feature will be shrunk toward zero on average. Implication (3) means that a positive weight on the "weakly biased" feature will be shrunk toward zero with a magnitude proportional to almost . Implication (4) means that a positive weight say on the "strongly biased" feature will be preserved or expanded since , and can be verified to satisfy the bound (21).

More generally, if there are truly random non-predictive features in a data-set, that just happen to show spurious correlations with the label in a finite sample, these features will have much weaker biases than the truly predictive features, and so adversarial training with an appropriate would tend to weed out the non-predictive features. We see this phenomenon clearly in the synthetic datasets in the next section, where we intentionally construct a dataset with a mix of predictive and non-predictive random features. In the experiments on the UCI datasets (Section C) and on real-world advertising datasets (Section 6.2) we also see this feature-pruning effect. In these non-synthetic experiments, it is possible that the features being pruned are either truly non-predictive ones, or very weakly biased ones. In datasets with hundreds of thousands of sparse categorical features, pruning even weakly predictive features can be valuable due to the explainability and model-size compression benefits.

Finally, implications (3) and (4) above hint at the possibility of a goldilocks zone of values which are large enough to weed out irrelevant or weakly-relevant features (i.e. those with tiny biases) and yet small enough to preserve the truly predictive features (i.e those with significant biases), thus maintaining model performance. For instance if the ordered sequence of absolute feature-label correlations has a gap that separates "relevant" features from "irrelevant" or "weakly relevant", then any that happens to be in this gap will induce the following behaviors. Eq. (20) suggests that this would be large enough that it exceeds the absolute label-correlations of the weakly-relevant features, and so their weights will tend to shrink even if the features are weight-aligned (and if they are not, then their weights would tend to shrink anyway due to implication (2)). On the other hand Eq. (21) suggests that if the is sufficiently below the lowest absolute label-correlation of the "relevant" features (to account for the term in the bound in Eq. (21)), then the weights of the relevant features are preserved or expanded. In other words, adversarial training with an appropriate acts as an aggressive correlation-based feature-filter, or what we referred to as a relevance-filter in Section 1.

The fact that the model-concentration behavior of -adversarial training is realized over a range of is important from a practical perspective: it implies that we can find a suitable more easily and that the desired behavior does not just occur for one "lucky" value.

In case there are a large number of features which are irrelevant or weakly-relevant, then the results of Theorem 3.1 suggest that -adversarial training with an appropriate choice of can aggressively weed out the weights of these features, and produce a model which has significantly better feature-concentration compared to a naturally-trained one (i.e. with ).

It should be pointed out that in Theorem 3.1 we do not analyze how adversarial training impacts model accuracy, and we leave this as an open question for future research.

In closing this section we make an observation about -regularization. As noted by Goodfellow2014-cy , in the context of logistic regression, the gradient of the loss function under -regularization has some similarities with the gradient of the loss under -adversarial training as in Eq. (13). In fact in our experiments using a pure SGD optimizer, we find the behavior of -regularization and -adversarial training to be somewhat similar on the synthetic and toy UCI datasets (but not on the real-world MediaMath advertising datasets). However as mentioned in Section 1 most practitioners do not use a simple SGD and instead use optimizers such as FTRL Brendan_McMahan2013-uw to obtain improved accuracy or AUC on test data. Specifically when using FTRL we no longer see this similarity, and -adversarial training outshines -regularization in its ability to hone in on relevant features without significantly impacting AUC or accuracy on natural test data.

4 Experiments on Synthetic Datasets

The results of Theorem 3.1 and the discussion in Section 3.3 imply that -adversarial training can be perform aggressive relevance-filtering when learning a logistic regression model, but the results only apply to data points drawn from the BCP, and only characterize the expectation of SGD weight updates. Nevertheless, these results prompt us to ask: Is it possible to reproduce the relevance-filtering behavior of adversarial training on a finite synthetic dataset with a mix of predictive and non-predictive features, without sacrificing standard accuracy? Specifically, we want to design a synthetic dataset where:

  • Natural training places non-negligible weights on the predictive features, and on at least some non-predictive ones.

  • -adversarial training with some places significant weight on the predictive features, but negligible weights on the non-predictive features.

  • -adversarial training achieves accuracy (or AUC-ROC) comparable to that of natural training, on natural test data.

It turns out that the following synthetic dataset demonstrates the above phenomena very well.

4.1 Synthetic Data Generation and Training Process

The label is -1 or +1 with equal probability. The input vector consists of 2 kinds of features:

Correlated, strongly predictive features

: identical features such that:

I.I.D, random non-predictive features:

i.i.d. features each taking the values +1/-1 with equal probability.

For our experiments we generate data-points according to these specifications, with , i.e. 8 identical predictive features and 8 random non-predictive features. The rationale behind these choices of and is to ensure that there are "sufficiently many" features of each type, and 8 of each suffices to highlight some of the behaviors we are discussing. The rationale for needing "sufficiently many" is the following: When there are sufficiently many random non-predictive features, some of them will accidentally "look" like they are correlated with the label: this is because in a set of tosses of a fair coin, the expectation of the fractional imbalance between heads and tails is proportional to , and the distribution of the imbalance is highly concentrated around this expectation. As a result, if there are sufficiently many random non-predictive features, at least some of them will appear correlated with the label, and so natural training will put non-negligible weights on them. Many real-world structured datasets are moderate-sized (say under a 100,000), and even if the number of data points is very large, there may be a very large number of sparse categorical features, and thus this "spurious" correlation can occur with higher likelihood. When there are several identical, predictive features, natural model training will force these features to "share" their weight roughly equally, thus pushing down the weights to a level comparable or even less than that of non-predictive features.

The experimental training and testing methodology is as follows. We train a logistic model (with a non-zero bias term) on the first 700 data-points and test on the remaining 300. For training we use a mini-batch size of 20 and train for 200 epochs, with the FTRL optimizer with a learning-rate of 0.01, and all weights initialized to zero. While the results are qualitatively similar with other optimizers such as ADAM, simple SGD or AdaGrad, the best results are obtained using FTRL, as stated in Sec.

1. During training, we perturb the input vector using the worst-case adversarial perturbation from Eq. (12) with a specific choice of , before taking gradients:


4.2 Comparison of Natural and -Adversarial Training with

Our first experiment is to compare the models resulting from adversarial training with (which is equivalent to natural training), versus . As shown in Table 1, the AUC and accuracy of the models (on natural test data) are similar, but the learned weights show a dramatic difference. To make it easier to distinguish the random non-predictive features from the predictive features , we relabel the non-predictive features as , and refer to these two feature-groups as "-features" and "-features" respectively. Table 1 shows that after natural training, the -norm of the predictive features is comparable to that of the non-predictive features, whereas adversarial training reduces the -norm of the non-predictive features to a negligible amount, while maintaining a siginificant -norm on the predictive features.

This contrast between natural and adversarial training is seen clearly in the bar-chart of Figure 1. As expected, natural training results in equal weights on the 8 predictive features (and this is true for adversarial training as well). However natural training also places a significant weight on the 8 random non-predictive features, with 4 of them having higher weights than the predictive features. This is clearly problematic from a model-explanation point of view: when explaining the predicted probability on a specific example, we would like to not see the non-relevant features contributing a meaningful amount, and especially not more than the truly predictive features.

In sharp contrast to natural training, adversarial training does not suffer from this problem: the only significant weights are on the 8 predictive features, and all of the non-predictive feature-weights are selectively killed-off, with weights close to zero. This is precisely the aggressive relevance-filtering effect of adversarial training that we wanted to demonstrate.

training AUC accuracy Wts_x_L1 Wts_r_L1
natural 0.668 0.676 0.724 0.661
adversarial 0.676 0.676 0.205 0.001
Table 1: AUC, Accuracy and model-concentration metrics for natural and adversarial training with on the synthetic dataset. The -norm of the weights of the predictive features is denoted Wts_x_L1, and the -norm of the weights of the random non-predictive features is denoted Wts_r_L1.
Figure 1: Bar chart of feature weights of logistic model trained on the synthetic dataset with natural training (training = nat, colored grey) and -adversarial training with (training = adv, colored orange). The features are shown left to right in decreasing order of absolute value of naturally-trained weights. The 8 predictive features are , and the random non-predictive features are relabeled to make it easy to distinguish them. Both training modes result in equal weights on the predictive features. However natural training results in all of the non-predictive features having significant weights, with 4 of them having higher weights than the predictive features, whereas adversarial training selectively "kills off" all the non-predictive feature-weights, while retaining a significant weight on the predictive ones.

4.3 Adversarial Vs Natural Training for a Range of values

We gain further insight into the impact of on the adversarially-learned model weights, by repeating the above adversarial training (and natural testing) with a range of -bounds from 0.0 to 2.0. For each value of we separately compute the -norm of the learned weights of the predictive -features, and the -norm of the learned weights of the non-predictive random -features. These are plotted in Figure 2, along with the AUC on the 300 natural test data points. The figure shows that as increases, the -norm of the weights of the -features approaches zero much more rapidly than the -norm of the -features. Moreover at the level of where the -feature weights approach zero, the AUC (on natural test data) is nearly as high as the AUC with natural training (). Notice that there is a range of values (shown by the blue band in the figure) which are "just right": i.e. large enough to de-weight non-predictive features, yet small enough to preserve sufficient weight on the predictive features and hence have minimal performance impact on natural test data. This is an instance of the goldilocks zone of values which we referred to in Section 3.3: -adversarial training with an value in the this zone yields both good model explanations (due to the aggressive relevance-filtering behavior where non-predictive features are given negligible weight) and good model performance (since standard accuracy/AUC is maintained).

Figure 2: Variation of three quantities in adversarial training as the -bound of the adversary increases: The top plot shows the -norm of the predictive feature-weights. The middle plot shows the -norm of random non-predictive feature-weights. The bottom plot shows the AUC on the natural (i.e. unperturbed) test dataset. Note that corresponds to training on natural examples, and as we increase , the learned weights of the non-predictive -features approach zero much more rapidly than those of the predictive -features, and the where the -weights approach zero still maintains the AUC of a naturally-trained model. The blue band represents the "goldilocks zone" of values that are "just right", i.e. adversarial training with this bound yields models that have both good explanations (since they eliminate non-predictive features) and good model performance (since AUC is maintained close to the naturally trained level).

It is natural to wonder whether the feature-weight concentration effect of adversarial training can be achieved using traditional -regularization. On this specific synthetic dataset we find that indeed it is possible to achieve a similar de-weighting of the non-predictive features using -regularization with, but this requires a very large weight on the regularization term in the loss function: it needs to be at least 200, as shown in Figure 3. However as we show in Section 6.2, in large real-world datasets, such a high value of damages the AUC of the model considerably, and small values do not achieve a model-concentration as significant as that achieved by adversarial training.

Figure 3: Similar to Figure 2, but keeping fixed at 0 (i.e. natural training), and varying the -regularization parameter .

4.4 Explanation of Adversarial Training Behavior on the Synthetic Dataset

We can now use Theorem 3.1 to at least partially explain at least some of the observations we made on our synthetic dataset of points. Our experiments used a variant of SGD with a mini-batch size of 20 on a shuffled dataset, and so the behavior of the learning algorithm can be reasonably approximated by a modified process where at each stage, a single data-point is drawn uniformly at random with replacement from the -point dataset and perturbed by the adversary, and presented to the SGD algorithm for a gradient-based update. The random draws of points from the dataset can therefore be modeled as a BCP, with appropriate choices of the biases of each feature. Recall that in the synthetic dataset (Section 4) the 8 identical predictive features each have a bias , for . Since these will share weights equally in the optimal solution, we can reasonably approximate the overall effect of these identical predictive features, using a single feature in the BCP with bias . The 8 non-predictive features are each independently uniformly chosen from . Thus each non-predictive feature has a probability 0.5 of agreeing or disagreeing with the label . However in a finite sample of data points, for a non-predictive feature, the expectation of the fractional absolute imbalance between agreements and disagreements (with the label) is of the order of , and moreover the distribution of this imbalance is highly concentrated around the expectation. Therefore we can model each of these non-predictive features as a feature in the BCP with an absolute bias of .

In other words, we are approximating the synthetic dataset of Section 4 with a BCP where there is one predictive feature with bias 0.2, and 8 features with absolute bias . Theorem 3.1 and its implications (Section 3.3) then help us explain the following observations. (It will be helpful to see Figure 1 which shows the weights from natural and adversarial training with , and Figure 2), which shows the variation of -norms of the two feature-types with varying .)

Observation 1: Natural training (i.e. -adversarial training with ) places non-negligible weights on at least some of the non-predictive random features, with some of them having weights even larger than predictive features. This is seen in Figure 1. As mentioned before, since the 8 predictive features are identical, natural (and adversarial with any ) learning results in a model where the weights are shared equally amongst them. On the other hand, many of the random non-predictive features have a bias of . Due to implication (1) in Section 3.3, their weights grow from 0 in the direction of their bias, and due to implication (4) they are maintained or expanded in their current direction since for natural training. This effect, combined with the fact that the weights of the 8 predictive features are shared equally, causes the weights of some of the non-predictive features to exceed that of the predictive features.

Observation 2: With adversarial training () the learned weights on the 8 non-predictive features are close to 0 (Figure 1), but the 8 predictive features retain significant weights. This is the key relevance-filtering effect highlighted by these experiments. Note that far exceeds the absolute bias of the random-non-predictive features, so by implication (2) in Section 3.3, any mis-aligned weights of these features are shrunk toward zero with an expected change proportional to , and by implication (3) any aligned weights are also shrunk toward zero, this time with an expected change proportional to . Both these expected shrinkage rates are close to since the absolute biases of these non-predictive features are and therefore negligible compared to . Note that the single predictive feature has bias 0.2, so by implication (2), if its weight is mis-aligned, it shrinks toward zero with an expected change proportional to or , which is a much more aggressive rate than that of the non-predictive features. If the predictive feature has an aligned weight, by implication (3) it shrinks toward zero since , and the expected change is proportional to , which is a much slower shrinkage rate than that of the non-predictive features (which is around 1.0). Presumably, after reaching a certain level, the weight of the predictive feature enters the regime of implication (3), which preserves its weight. This can explain why even with the weights on the predictive features remain significant.

5 Feature Attribution in 1-Layer Networks

We now turn our attention to measuring feature concentration, which we argue in this paper is a key benefit of -adversarial training on logistic regression models. As mentioned in Section 2.8, feature concentration could potentially be measured in terms of the model weights, but this is not always the best approach. This is particularly the case for real-world datasets such as those considered in the advertising response prediction task in Section 6.2, where there are many high-cardinality categorical features, and we need an appropriate way to measure an aggregate importance of a categorical feature over a dataset. It is therefore worth considering some more principled ways of measuring feature-importance which are amenable to a natural aggregate importance definition.

We find that the feature-attribution method of Integrated Gradients (IG) Sundararajan2017-rz (described in Section 2.8) is very well suited to our purposes: We derive a closed-form expression for the IG-based feature-attributions for a 1-layer neural network (Lemma 5.1), and this makes it computationally very efficient to compute the attributions of all features (which could number into the hundreds of thousands since there are potentially many high-cardinality categorical features). Moreover, in Section 5.2 we propose natural ways to aggregate the IG-based feature-importance metrics across a dataset (which we call Feature Impact and Feature-Value Impact), for categorical as well as numerical features.

5.1 Closed Form for IG in 1-Layer Networks

For general neural networks, the authors of Sundararajan2017-rz show how to approximate the IG integral (8) by a summation involving gradients at equally-spaced points along the straight-line path from to . While this approximation is reasonably efficient for a fixed example and dimension , it can be prohibitively expensive for computing the IG values across a dataset of millions of examples and thousands of (sparse) features. Closed form expressions for the IG would therefore be of significant interest, especially if the goal is to compute the IG over an entire dataset in order to glean aggregate feature importances.

We first show a closed form exact expression for the when is a single-layer network.

Lemma 5.1 (IG Attribution for 1-layer Networks)

If is computed by a 1-layer neural network (3) with weights vector , then the Integrated Gradients for all dimensions of relative to a baseline are given by:


where the operator denotes the entry-wise product of vectors.


Consider the partial derivative in the definition (8) of . For a given , and , let denote the vector . Then , and by applying the chain rule we get:

where is the gradient of the activation at . This implies that:

We can therefore write

and since is a scalar, this yields

Using this equation the integral in the definition of can be written as


where (24) follows from the fact that and do not depend on . Therefore from the definition (8) of :

and this yields the expression (23) for .

Note that the closed form expression (23) does not depend on the activation derivative at all, as long as the activation is differentiable. There is a natural interpretation of the closed form expression (23): When the input changes from the baseline value to , the dot product changes by , and the fractional contribution of dimension is , and is this fraction times the total function value change .

5.2 Aggregation of IG Over a Dataset

The IG methodology of Sundararajan2017-rz only considers feature attribution for a single example . In order to understand the relative importance of features over a (possibly large) dataset, it would be helpful to somehow aggregate the IG values across multiple examples. We propose here a simple method to do this in structured datasets.

We can now describe our IG aggregation procedure. As mentioned before, we assume that the neural network input is an exploded-form vector . Note that the ’th dimension of corresponds either to a numerical feature in the original feature-vector , or some specific value of a categorical feature. The IG for each dimension of can be computed from Eq. (8) for a general neural network, or from Eq. (23) for a 1-layer network. Informally, measures the extent to which the ’th dimension of contributed to "moving" the network output from its baseline value to . In other words represents the impact of the ’th dimension on the output for example . A reasonable measure of the "importance" of dimension in some suitable dataset is therefore the simple average of over all . We call this the feature-value impact FVI (since the ’th dimension in exploded space corresponds to a specific value of a categorical feature, or a numeric feature):


In the case of a categorical feature, we are also interested in the overall impact of that feature. For example we may want to know what is the overall importance of the dayOfWeek feature in some dataset . A reasonable way to compute the feature-impact FI, i.e. the overall importance of a categorical feature in original form, is to add the FVI values over all dimensions corresponding to this feature in the exploded space:


The FI metric is particularly useful to gain an understanding of the aggregate importance of high-cardinality categorical features. For example we measure the feature-concentration of models trained on the MediaMath datasets (which have categorical features with cardinalities in the 100,000 range) in terms of the FI metric (see the FI.L1 and FI.1Pct metrics in Section 6.1).

6 Experiments

In Section 3.2 we analyzed the expectation of the SGD weight-updates of a logistic regression model during -adversarial training, on an idealized random data-generation process (the BCP). Theorem 3.1 in that Section suggested that -adversarial training can have an aggressive relevance-filtering (or model-concentration) behavior: the possibility that it aggressively shrinks the weights of irrelevant or weakly-relevant features, while maintaining significant weights on relevant features, and hence not significantly impacting performance on natural test data. In Section 4 we showed that this behavior can be realized on a specific synthetic dataset. A natural next question is, whether the model-concentration benefits of -adversarial training can be seen in real-world structured datasets. This is the question we explore in this section.

Specifically, we describe the results of experiments intended to answer the following questions for real-world datasets:

  1. Is there a "goldilocks zone" of values for which significant model-concentration (as measured by various metrics defined in Section 6.1 below) is achieved by -adversarial training, while AUC (on natural test data) is no worse than 0.01 compared to natural training (i.e. with ) ?

  2. Fixing at a value in the goldilocks zone, how do the model-weights and IG-based Feature Impact metrics (defined in Section 5.2) in the -adversarially trained model compare with the corresponding metrics in a naturally trained model?

  3. If we train the model on natural data, but with -regularization using a regularization penalty factor , can we achieve a similar effect, i.e. produce model-concentration comparable to -adversarial training, and yet maintain AUC within 0.01 of the AUC with ?

To answer the above questions, we performed experiments on two kinds of datasets: (a) Two datasets ("mushroom" and "spambase") from the UCI ML data repository Dua:2017 , and (b) Large-scale real-world ad conversion-prediction datasets from MediaMath. Although the UCI datasets are "real" in the sense that they are derived from real domains, their size has been kept relatively small (typically no larger than a few thousand data records) to facilitate benchmarking, and quick testing or demonstration of ideas. The MediaMath datasets on the other hand have millions of records and hundreds of thousands of (sparse) features, and are actually used to train models that determine bids in the real-time bidding system that the company operates (More details are in Section 6.2).

The findings from our experiments, corresponding to the three questions above, are as follows:

  1. For all the datasets studied, there is indeed a "goldilocks zone" of good values for which -adversarial training produces significantly more concentrated models with AUC drop (on natural test data) of no more than 0.01 relative to natural training.

  2. Examining feature weights or Feature Impact in -adversarially trained models (for a fixed in the goldilocks zone) reveals that these models are significantly more concentrated than their naturally-trained counterparts. There are often cases where features given importance by a naturally-trained model are much less important in an -adversarially trained model, and vice versa.

  3. On a MediaMath dataset, natural logistic model training with -regularization using a regularization penalty factor achieves some model concentration but significantly worse than with -adversarial training, and as is increased beyond 0.2, the AUC degrades rapidly. A similar effect is seen on the UCI datasets: as is increased, the model concentration improves and AUC (on natural test data) drops, but at the point where AUC is 0.01 below the AUC for , model concentration is much inferior to that produced by -regularization.

Section 6.1 describes model training and evaluation methodology. Section 6.2 describes the results from experiments on the MediaMath datasets, and the results from the UCI datasets are in Appendix C.

6.1 Model Training and Evaluation Methodology

We describe here the common aspects of the training and evaluation methodology for the UCI and MediaMath datasets. Any variations specific to the datasets are described in the respective subsections. All the tasks we consider are probability prediction tasks as described in Section 2.1

, where the prediction target is a binary +1/-1 variable, with +1 indicating a positive example and -1 indicating a negative example. (The specific implementations may actually use a 0/1 label instead, but we keep the -1/1 description here as it simplifies some of the analytical expressions). Our code is implemented in Python using the high-level TensorFlow Estimators and Dataset APIs.

It is important to note that all categorical variables are 1-hot encoded (as described in Section

2.2) prior to being fed to the model-training and evaluation code. In other words we apply the -adversarial perturbation (given by (12)) to the input vector in exploded form. A reasonable question is whether such perturbations are semantically meaningful, and whether they represent legitimate perturbations by an adversary. One could also make the argument that a real adversary would only be able to perturb the original input vector, and so the set of allowed perturbations of should be restricted to legitimate 1-hot encodings. Indeed some authors have considered this type of restriction in the domain of malware detection Al-Dujaili2018-yp . We set aside this issue in this paper, since our interest is more in the model-concentration effect of adversarial training, and less in robustness to real attacks.

Each dataset is divided into train and test subsets. For training on natural examples we use the ADAM optimizer in TensorFlow with -regularization strength () set to 0. (We vary the to evaluate the effect of -regularization). Our results are substantially the same regardless of which optimizer we use, e.g. Adam, AdaGrad or simple SGD. We use FTRL mainly because in TensorFlow the FTRL optimizer has an optional argument that controls the strength of -regularization. As mentioned in Sec. 4, although our results are qualitatively similar with other optimizers, the best results are obtained using the FTRL optimizer. All model weights are initialized to zero in case of the synthetic and toy UCI datasets, whereas they are initialized using a Gaussian initializer (with mean 0 and variance 0.001) in the case of the MediaMath ad response-prediction models. Once again our results remain the same whether we use zero or Gaussian initializers. For -adversarial training we also use the FTRL optimizer, except that in each mini-batch the examples are perturbed according to the worst-case perturbation given by Eq. 12, as described in the canonical SGD setup in Section 2.6. Some authors train adversarially robust models by first training on natural examples and then training on adversarial examples. But in our experiments we find that the initial pre-training on natural examples does not make a difference, at least for the model-concentration effects which we are studying.

Once a model is trained (adversarially or naturally) we compute two types of metrics:

  • An ML performance metric, the AUC-ROC (Area Under the ROC Curve) on the held-out natural test dataset.

  • A few feature concentration metrics, defined as follows, where the linear model weight-vector is (and is the dimension of the exploded feature-space, i.e. after 1-hot encoding).


    , which is a measure of the overall magnitude of the weights, scaled by the biggest absolute weight. Note that if we multiply all weights by a constant factor, then WtsL1 does not change.


    The percent of the weights in whose absolute value is at least 1% of the maximum absolute weight. This can be thought of as a measure of how many feature-weights are "significant", where the threshold of significance is 1% of the biggest absolute weight.


    , where stands for the vector of Feature Impact values (defined by Eq (26)), and ranges over the dimensions in the original feature-space (i.e. before 1-hot encoding), and the dataset is the natural training dataset.


    The percent of components of (which are all positive by definition) that are at least 1% of the biggest component of , again over the natural training dataset.

In all our experiments, when we use the closed-form formula in Eq. (23) to compute the FVI (Feature-Value Impact) values (Eq (25)) of the dimensions of the exploded feature-vector, for the baseline input we use the all-zeros vector.

In the various tables of results, we use the abbreviation nat to refer to metrics for the naturally-trained model, and adv to refer to metrics for the adversarially-trained model.

6.2 Experiments with MediaMath Datasets: Ad Conversion Prediction

MediaMath provides a software platform that operates a real-time bidding (RTB) engine which responds to bid-opportunities sent by ad-exchanges. The RTB engine bids on behalf of advertisers who set up ad-campaigns on the platform. A key component in determining bid prices is a prediction of the probability that a consumer exposed to the advertiser’s campaign would subsequently perform a certain designated action (called a "conversion"). MediaMath currently trains a logistic regression model for each campaign to generate these conversion probability predictions. The models are trained on a dataset collected over a number of days, where each record contains various features related to the ad opportunity (such as device type, browser, location, time of day etc), as well as a 0/1 label indicating whether or not a conversion occurred subsequent to ad exposure. The model for each campaign is trained on a sequence of 18 days of data, and validated/tested on the subsequent 3 days of data. The total number of records in each dataset can range from half million to 50 million depending on the campaign. Each record has around 100 features, mostly categorical, and some (such as "siteID") have cardinalities as high as 100,000, and so the dimension of the exploded feature-space (i.e. after 1-hot encoding) is on the order of 400,000. (We use feature-hashing rather than explicit 1-hot encoding to map some of the high-cardinality features to a lower-dimensional vector, but the net effect is similar to 1-hot encoding, except that now each dimension in the 1-hot encoding vector may correspond to multiple features, due to hash collisions)

Given the extremely high dimensionality of the exploded feature-space, it is of considerable practical importance to understand which features have a truly significant impact on the predictions. Specifically, we wish to explore whether adversarial training can yield models that have significantly better feature concentration, while maintaining the AUC within say 0.01 of the naturally-trained model. We have seen strong evidence that this is indeed possible, both on synthetic datasets (Section 4) and on some UCI datasets (Section C). We show below that we see a similar phenomenon in the conversion-prediction models.

To study the impact of adversarial training, we performed experiments with a wide range of values of and found that for most campaigns, adversarial training with or results in feature-concentrations significantly better than with natural training, while maintaining AUC (on the validation set) within 0.01 of the AUC of a naturally-trained model. We also experimented with keeping and varying the -regularization parameter in the FTRL optimizer, and found that any significantly lowers the AUC of the resulting model, and lower values do not yield a feature-concentration as strong as that achieved by adversarial training. Indeed we find that the effects of adversarial training and regularization are complementary: when an appropriate value of is used in conjunction with say , we find that regularization helps to "clean" up the very low feature-weights produced by adversarial training by pushing them to zero.

Table 2 shows a summary of results on 9 campaigns 222All campaign IDs and feature names are masked for client confidentiality reasons. In some cases the AUC of the adversarially-trained model is better than that of the naturally-trained model. Recall that the Wts.1Pct metric measures what percent of dimensions (in the exploded space, after 1-hot encoding) have absolute weights at least 1 percent of the highest absolute weight. Since most features are categorical, Wts.1Pct is therefore a measure of what percent of feature-values are significant to the model. This metric (as well as Wts.L1) falls drastically with adversarial training in all cases, which indicates that several of the feature-values are simply not relevant to predicting the label. There is thus a potentially massive model-compression that can be done, and this can have benefits in storing, updating and serving models (MediaMath periodically trains around 40,000 models). Table 2 also shows the FI.1Pct and FI.L1 metrics, which are aggregate feature-impact concentration metrics over the natural training dataset. Note that these are at the feature level and not feature-value level. Since the FI measure of a categorical feature aggregates the FVI metric over all values of this feature, the drop in this metric (when we go from natural to adversarial training) is not as dramatic as in the case of Wts.L1 or Wts.1Pct (and sometimes these are higher than with natural training).

Campaign training AUC Wts.1Pct Wts.L1 FI.1Pct FI.L1
285 nat 0.560 12.93 165.05 0.78 5.80
adv ( 0.01) 0.556 0.39 13.27 0.37 2.87
479 nat 0.697 11.23 92.02 3.20 12.52
adv ( 0.001) 0.694 6.17 66.96 3.13 12.34
622 nat 0.565 27.11 110.12 6.63 5.73
adv ( 0.01) 0.561 4.21 18.18 2.87 3.55
594 nat 0.702 16.91 172.55 0.86 12.54
adv ( 0.001) 0.702 1.09 19.97 0.77 11.19
473 nat 0.683 28.02 177.36 3.14 14.69
adv ( 0.001) 0.673 3.58 55.68 2.84 15.69
070 nat 0.622 18.53 158.15 4.94 26.21
adv ( 0.001) 0.625 7.55 107.63 4.46 24.00
645 nat 0.573 16.26 251.37 2.78 31.78
adv ( 0.01) 0.627 1.07 34.45 1.12 10.60
733 nat 0.658 27.35 203.73 4.04 11.36
adv ( 0.001) 0.667 9.91 108.03 4.13 11.60
735 nat 0.758 12.20 220.97 1.87 21.82
adv ( 0.01) 0.765 0.51 21.02 0.75 16.36
Table 2: Comparison of AUC and feature-concentration between natural and adversarial training on 9 advertising campaigns. The 4 concentration metrics are defined in Section 6.1. Note that while the AUC is computed on the natural validation set, the concentration metrics FI.1Pct and FI.L1 are computed on the natural training dataset. In some campaigns, such as 285, 735 the Wts.1Pct metric improves by a factor of more than 24.

To illustrate the effect of adversarial training in more detail, we focus on campaign number 735 (the bottom row in Table 2) and compare the results from natural training and adversarial training (with ). Figure 4 compares the Feature Impact (FI) values between these models; Figure 5 compares the feature-weights drop-off curves of these models; and Figure 6 compares the FI drop-off curves.

Figure 4: Comparison of aggregate Feature Impact (FI defined in Eq. (26)) for a naturally-trained model, and an adversarially-trained model with , on the dataset for Campaign 735. The features are arranged left to right in decreasing order of their FI in the naturally-trained model.
Figure 5: Comparison of drop-off of absolute feature weights in natural and adversarial training (with ), for Campaign 735. In each curve the -coordinate of the point corresponding to a percent equals the weight of the feature at the ’th percentile when the weights are arranged in decreasing absolute value. Each curve is truncated when the weight reaches 1% of the highest weight in the respective model. The adversarially trained model has a much steeper weight drop-off, with only 0.5% being above the 1% threshold, compared to 12% with natural training (this is consistent with Table 2).
Figure 6: Comparison of drop-off of aggregate Feature Impact (FI) (computed over the natural training dataset) for a naturally-trained and adversarially-trained model (with ), for Campaign 735. In each curve the -coordinate of the point corresponding to a percent equals the weight of the feature at the ’th percentile when the weights are arranged in decreasing absolute value. Each curve is truncated when the FI reaches 1% of the highest FI in the respective model. The adversarially trained model has a much steeper FI drop-off, with only 0.75% being above the 1% threshold, compared to 1.87% with natural training. (this is consistent with Table 2).

Figures 7 and 8 contrast the ability of adversarial training and regularization to improve model concentration while maintaining AUC (on natural test data): adversarial training with improves the concentration metric Wts.1Pct to as low as 0.5% (compared to 12% for a naturally trained model, an improvement factor as high as 24), and yet achieves an AUC slightly higher than with natural training. On the other hand with regularization, using a strength of improves the concentration to 5% (significantly worse than 0.5% for adversarial training) and slightly improves upon the naturally-trained AUC, but any higher value of significantly degrades the AUC, and the Wts.1Pct concentration metric does not go below 2.5%.

Figure 7: Variation of Wts.1Pct and natural test AUC with increasing bound on perturbations in adversarial training, for campaign 735. Note that corresponds to natural training, which results in AUC=0.758. The horizontal green band lower-bounded by AUC=0.748 represents the range of AUCs within 0.01 of AUC of the naturally-trained mode. The blue vertical band represents the range of values (0.01 to 0.03) that are high enough to produce significant model-concentration (i.e. reduction in Wts.1Pct), yet low enough that AUC is maintained within the green band. For these values of , the Wts.1Pct metric is under 0.5%, meaning that only 0.5% of absolute weights are within 1% of the highest absolute weight.
Figure 8: Variation of Wts.1Pct and natural test AUC for naturally-trained models, with increasing -regularization parameter , for campaign 735. For a , the Wts.1Pct concentration metric is as high as 5% (significantly worse than the 0.5% for adversarial training), and any higher value of significantly degrades the AUC.

7 Conclusion and Future Work

We considered the question of whether adversarial learning can be used as a mechanism to trim logistic models significantly, while maintaining performance (as measured by AUC or accuracy) on natural (unperturbed) test data. From an explainability standpoint, it is highly desirable that models do not heavily weigh features that are irrelevant or marginally influential. We explored this possibility of feature-concentration both theoretically and empirically, in the context of logistic regression models and -bounded adversaries. On the theory side, we derived results showing bounds on the expectation of the weight updates, in terms of the adversarial bound, the feature’s bias, its current weight, and the current overall learning stage of the model. Our results suggest there is often a goldilocks zone of adversarial bounds that are "just right": large enough to weed out irrelevant features, yet small enough to maintain reasonable weights on truly predictive ones and hence not impact model performance on natural test data. The practical implication of the goldilocks zone is that it makes it easy to find a suitable , so the desired behavior is not restricted to just one (or a few) "lucky" value of .

Our theory both motivates and at least partially explains our experimental studies. We designed a synthetic dataset containing a mix of predictive features and random non-predictive features, and showed that natural learning tends to learn significant weights on the non-predictive features (some of which are higher than that of the predictive features) simply because they show spurious correlations with the label in a finite sample of data. By contrast adversarial training with a large enough bound can weed out these noise features while maintaining weights on the predictive features, hence minimally impacting (if at all) the model’s performance on natural test data.

We demonstrate the feature-pruning effect of adversarial training on two toy UCI datasets and real-world advertising response-prediction datasets from MediaMath. On the latter datasets we showed that adversarial training with perturbations with bounds as small as 0.001 or 0.01 can achieve as much as a factor of 20 reduction in the number of "significant" weights (defined as the number of weights whose magnitudes are within 1% of the maximum magnitude), and yet their performance on natural test data is not impacted, and sometimes even improves upon natural training.

We also showed that this effect is not easily replicated with -regularization. In particular we showed that in the synthetic datasets, one needs to use an unusually-large value of the regularization-weight whereas there is no that achieves a comparable effect in the UCI datasets and real-world MediaMath datasets. As mentioned in the experiments sections, we obtain the best results using the FTRL optimizer, compared to other optimizers such as ADAM, AdaGrad or simple SGD. We leave for future work an explanation of why this is the case.

It is worth pointing out that on specific datasets, it may well be possible to reproduce this model-concentration behavior with natural model training, using carefully custom-designed hyper-parameters (such as a learning-rate schedule customized to the dataset). We emphasize, however, that it is simpler to achieve this behavior using -adversarial training: it merely involves trying a set of values, without the need for specially customizing the other hyper-parameters.

To speed up adversarial training we relied on our closed-form formula for the worst-case adversarial perturbation for logistic models. We quantified feature-concentration in a few different ways, including some that are based on the model weights, and some derived from the Integrated-Gradient (IG) based feature-attribution method. We derived a closed-form formula for the IG-based feature-attribution, for 1-layer neural networks, which we leverage to be able to compute a new metric of aggregate feature importance we introduced, called Feature Impact.

It would be of significant interest to expand our theoretical analysis of adversarial learning, and in particular show a link to accuracy (which is something we did not do, unlike the analysis of Madry2017-zz ). Extending the model-concentration analysis to models with one or more hidden layers would also be interesting. Deriving closed form formulas (or efficient approximations) for adversarial perturbations, as well as feature-attributions, for more complex networks would also help make adversarial training and measurement of aggregate feature impact more practical.

Another direction for exploration is to consider more carefully the notion of an adversarial perturbation in structured datasets (a point that was alluded to at the beginning of Section 6). In image domains, an adversarial perturbation is one that preserves perceptual similarity (from a human observer’s perspective) and yet causes a model to misclassify the example. We have side-stepped this issue in this paper since our primary motivation was to study the model-concentration effects of adversarial learning. However even for this specific purpose, it may be useful to consider other classes of permissible perturbations, such as perturbations that are constrained to be valid inputs when the features are categorical (for example Al-Dujaili2018-yp consider this type of restriction for adversarially robust detection malware detection). In our experiments, we perturb the 1-hot encoded vector along all dimensions, which will in general result in a vector that is not a valid representation of any input vector (since multiple dimensions corresponding to a single categorical feature may be non-zero). It is possible that such constrained perturbations produce even better results from a model-concentration point of view.