1 Introduction
While deep learning models have been wildly successful in a variety of perceptual (images, audio, video) and text domains, a number of authors have recently highlighted two important concerns with such models:

Adversarial Examples: Many of these models are vulnerable to adversarial attacks: it is possible for an adversary to perturb the inputs in such a way that humans perceive no significant change (and in particular they would assign the same "label" as the original example), and yet the model’s label or prediction can be markedly different Szegedy2013tx ; Goodfellow2014cy . Adversarial training has recently been proposed as a method for training models that are robust to such attacks, by applying techniques from the area of Robust Optimization Papernot2015pg ; Madry2017zz ; Sinha2018mj .

Feature Attribution: Especially for complex models, but even for simple linear models, it is often not clear how to quantify the impact of an input feature on the model’s final output. Such a quantification is crucial in order to understand which features are relevant (and irrelevant or marginally relevant) in determining a model’s behavior on specific inputs, or to understand which features are important in aggregate. This understanding can highlight a model’s strengths or weaknesses and can aid in improving the models or help instill trust in the models (especially in domains such as healthcare and finance). A few different feature attribution techniques have been proposed recently Leino2018ga ; Ancona2017tw , and Guidotti2018zk ; Ras2018sl are two recent surveys of various explanation methods for blackbox models. In particular Sundararajan2017rz formulate a set of axioms that any good featureattribution method should satisfy and identify a specific method, which they call Integrated Gradients (IG) that satisfies all of these axioms.
Much of the recent work on Adversarial Machine Learning (AML) has focused on perceptual domains or text, and "structured" datasets have been largely ignored. (For all practical purposes, structured datasets can be thought of as tabular datasets such as those arising from healthcare, advertising, or a variety of other application areas, where the values in the table columns are numerical or categorical). Moreover, AML research has been primarily concerned with producing adversarially robust models, and the conventional wisdom is that adversarial robustness comes at the cost of (often significant) loss in classification performance on natural examples
Kurakin2016lf ; Tsipras2018fu ; Kannan2018fl .For example, in Tsipras2018fo the authors create a synthetic dataset where they demonstrate a tradeoff between standard accuracy (i.e. accuracy on natural unperturbed examples), and adversarial accuracy (i.e. accuracy on adversarially perturbed examples). In particular, on their synthetic datset, they show that as standard accuracy approaches 100%, adversarial accuracy falls to zero. Other tradeoffs have also been considered in the literature, such as the need for more training data to achieve adversarial robustness Schmidt2018rb .
The emphasis in this paper is different in a few ways. We explore adversarial training, not in perceptual or text domains, but in structured datasets. Our primary interest in this paper is not adversarial robustness or adversarial accuracy, but rather an intriguing sidebenefit of adversarial training:
Adversarial training (with bounded perturbations) can be used to produce models significantly more concentrated than their naturallytrained counterparts, with minimal or no impact on standard accuracy
(i.e. the performance on natural data), particularly in logistic probabilityprediction models applied to structured datasets. Unlike regularization techniques such as
or regularization (which do not guarantee adversarial robustness, as discussed below), adversarial training offers adversarial robustness and in addition produces more concentrated models.
We use the term featureconcentration (or "model compression", or "model concentration") informally to refer to the extent to which the learned model differentiates between "relevant" and "less relevant" or "irrelevant" features. (See below for more on measuring feature concentration). An aspect of adversarial training that we are especially interested in, is its ability to weed out (the weights of) nonpredictive features that exhibit spurious correlation with the label on finite datasets. This is clearly important from an explainability and featureattribution viewpoint: We would like our trained model to not place significant weights on such irrelevant features.
We should point out that there has been prior work on the regularization benefits of adversarial training Xu2009gz ; Szegedy2013tx ; Goodfellow2014cy ; Shaham2015lg ; Sankaranarayanan2017vy ; Tanay2018ANA , primarily in imageclassification applications: when a model is adversarially trained, its classification accuracy on natural (i.e. unperturbed) test data can improve. All of this prior work has focused on the performanceimprovement (on natural test data) aspect of regularization, but none have examined the featurepruning benefits explicitly. In contrast to this work, our primary interest is in the explainability benefits of adversarial training, and specifically the ability of adversarial training to significantly improve featureconcentration while maintaining (and often improving) performance on natural test data. Although our theoretical analysis is generally applicable to logistic regression models applied to any type of data, our experimental results are on structured datasets. We focus on structured datasets here because logistic regression models (and variants such a Poisson regression) have been successfully used in a variety of domains involving structured datasets, such as advertising Brendan_McMahan2013uw ; Chapelle2014yn , healthcare Sperandei2014ai ; Anderson2003tz ; Tolles_undatedaw ; El_Emam2013qd , and finance Emfevid2018hs ; Mironiuc2013wu ; Bolton2009yx ; Dong2010zn . In all of these domains, it is crucial to understand which input features are truly important in determining the model’s output (either on a single data point or in aggregate across a dataset), and our aim is to show that adversariallytrained logistic regression models yield significantly better featureattribution by shrinking the weights on irrelevant or weaklyrelevant features to negligible levels, with minimal impact on (and often improvement in) performance on natural test data.
It might be argued that traditional regularization can potentially also provide a similar explainability benefit by shrinking the weights of irrelevant features. This does seem to hold for our synthetic dataset (Sec. 4) and the toy UCI datasets (Sec. C
) with simple optimizers such as pure SGD (Stochastic Gradient Descent). However practitioners typically use more sophisticated optimizers such as FTRL
Brendan_McMahan2013uw to achieve better model performance on test data, and with this optimizer (especially on the realworld advertising data, as described in Sec. 6.2), the modelconcentration produced by adversarial training is significantly better than with regularization (and regularization is even worse than regularization in this respect). Moreover, or regularization will not necessarily yield the robustness guarantees provided by adversarial training Goodfellow2014cy . In other words adversarial training simultaneously provides robustness to adversarial attacks, as well as explainability benefits, while maintaining or improving model performance on natural test data. These advantages make adversarial training especially compelling for logistic regression models since there is a simple closed form formula for the adversarial perturbation in this case (as shown in Lemma 3.1 and also pointed out in Goodfellow2014cy ), and so from a computational standpoint it is no more demanding than or regularization.Since our results show that adversarial training can effectively shrink the weights of irrelevant or weaklyrelevant features (while preserving weights on relevant features), a legitimate counterproposal might be that one could weed out such features beforehand via a preprocessing step where features with negligible labelcorrelations can be "removed" from the training process. Besides the fact that this scheme has no guarantees whatsoever with regard to adversarial robustness, there are some practical reasons why correlationbased feature selection is not as effective as adversarial training, in producing pruned models: (a) With adversarial training, one needs to simply try different values of the adversarial strength parameter
and find a level where accuracy (or other metric such as AUC, see Sec 2.7) is not impacted much but modelweights are significantly more concentrated; on the other hand with the correlationbased featurepruning method, one needs to set up an iterative loop with gradually increasing correlation thresholds, and each time the input preprocessing pipeline needs to be reexecuted with a reduced set of features. (b) When there are categorical features with large cardinalities, where just some of the categorical values have negligible featurecorrelations, it is not even clear how one can "remove" these specific featurevalues, since the feature itself must still be used; at the very least it would require a reencoding of the categorical feature each time a subset of its values is "dropped" (for example if a onehot encoding or hashing scheme is used). Thus correlationbased featurepruning is a much more cumbersome and inefficient process compared to adversarial training.
1.1 Description of Results
It is now standard Madry2017zz
to view adversarial training as a process similar to the usual modeltraining where a certain loss function is minimized, except that each
dimensional input instance is altered by an adversary by adding a dimensional perturbation to , where the adversary chooses in the "worst" possible way given the current model state and certain constraints (a formal description appears in Section 3.1). The constraint on the adversary is typically in the form of a bound on the norm of . For brevity we will refer to such an adversary as a adversary. The key difficulty in scaling up adversarial training to large datasets is the computation of this worstcase at each input sample point. For a general model there is often no choice but to apply an iterative optimization procedure, such as Projected Gradient Descent (PGD), at each sample point to compute the worstcase perturbation . However for linear models (such as logistic or poisson) we derive a simple closedform expression (Lemma 3.1 in Section 3.1) for under norm and norm constraints. This result makes it practical to incorporate adversarial training into a learning system based on logistic or other linear models.Theorem 3.1 in Section 3.2 is our main theoretical result that sheds light on the featureconcentration effect of adversarial training. This result gives bounds on the expectation of the logistic weight updates when a Stochastic Gradient Descent (SGD) algorithm is applied to data points from an idealized datageneration process, that are subjected to the worstcase adversarial perturbation according to the closedform expression mentioned above. These bounds show a clear link between the expected SGD update of a feature’s weight, and the relative magnitudes of and the (absolute) labelcorrelation of the feature (i.e. the correlation between the feature and the label). Roughly speaking, for features whose absolute labelcorrelations are less than , the SGD updates on average tend to shrink their weights toward zero, at a rate proportional to the difference between and the labelcorrelation. On the other hand, for features whose absolute labelcorrelations are sufficiently larger than , the SGD updates on average tend to expand their weights in their current direction.
This suggests that adversarial training with an appropriate value of acts as a relevance filter that weeds out the weights of irrelevant or weaklyrelevant features, while preserving weights of relevant features, such that standard accuracy is only minimally sacrificed. In fact if there is a large gap between the absolute labelcorrelations of "relevant" features and those of "irrelevant" features, then this result suggests the existence of a "goldilocks zone" of values that are "just right": large enough to suppress irrelevant or weaklyrelevant features, and yet small enough to preserve relevant features such that standard accuracy is not impacted significantly. This is precisely the phenomenon we observe in several experimental studies on synthetic, toy, largescale realworld datasets, as described below. A more detailed description of the implications of Theorem 3.1 appears in Section 3.3.
Inspired by the synthetic dataset of Tsipras2018fo , we define a synthetic dataset (Section 4) expressly designed to dramatically highlight the contrast between adversarial training and natural training in their ability to ignore irrelevant features. In contrast to the theoretical analysis of Tsipras2018fo , we conduct SGDbased adversarial training experiments on this dataset. Our synthetic dataset contains a mix of predictive and purely random nonpredictive features, and natural training tends to learn significant weights on the nonpredictive features (with the weights sometimes being bigger than those of the predictive features), whereas adversarial training acts as an efficient sieve that weeds out the nonpredictive features by pushing their weights close to zero, while still leaving the predictive features with significant weights, hence only minimally impacting (and sometimes even improving) standard accuracy.
How do we quantify feature concentration? One possibility, applicable to linear models, is based on the weights of the features (see the measures Wts.L1 and Wts.1Pct defined in Section 6.1). We also consider concentration measures using the more principled Integrated Gradients (IG) based approach of Sundararajan2017rz . Specifically we compute each feature’s "importance" using the IG methodology, and then measure "concentration" based on how this measure varies across features (see Section 6.1 for details).
The core of the IG method is an expression for the contribution of a given featuredimension to a specific prediction on an input sample . This expression involves a line integral over gradients computed along the straightline path from a certain baseline input to the actual input . Computing this integral in general requires an approximation based on gradients computed at equally spaced points on this straightline path. This is reasonably efficient to compute for a single datapoint and single dimension
. However in this paper we are interested in repeatedly computing this quantity across all dimensions (which could number into the hundreds of thousands when there are highcardinality categorical features), over an entire dataset (which could contain millions of records). Fortunately, we are able to show a closedform formula for the IGbased attribution for 1layer neural networks with an
arbitrarydifferentiable activation function (Lemma
5.1 in Section 5.1).While the IG methodology of Sundararajan2017rz focuses on featureattribution for a single sample point, for our purposes we also want to quantify the overall importance of a feature in a model, aggregated over a specific dataset. In Section 5.2 we introduce a simple, intuitive methodology to compute such an aggregate featureimportance metric we call the Feature Impact.
In Sections C and 6.2 we describe our experiments on toy datasets from the UCI repository Dua:2017 , and large scale realworld advertising datasets from MediaMath respectively. In these experiments we compare results from natural training and adversarial training where the adversary’s perturbations are constrained such that for some positive . The common theme that emerges from these experiments is that there is almost always a "goldilocks zone" for where is large enough to cause significant featureconcentration, yet small enough not to damage performance (as measured by AUC or accuracy) on natural test data. We also study whether this type of modelconcentration/performancepreservation effect can be achieved using traditional regularization. We find that on the toy UCI or synthetic datasets we are forced to use a very large regularization parameter (more than 200 for example) to achieve this effect, at least with the FTRL optimizer, whereas on the realworld MediaMath datasets, this effect cannot be achieved at all with regularization: When is large enough to produce a model concentration comparable to adversarial training, the AUC is much worse, while a smaller lambda that preserves AUC produces a much inferior model concentration.
To summarize, our main contributions are:

A novel theoretical analysis (Theorem 3.1) of the expectation of SGD updates of logistic regression model weights during adversarial training (on data points from an idealized random data generation process), that shows that adversarial learning acts as an aggressive relevance filter: on average the weight of a feature shrinks or expands depending on whether its absolute labelcorrelation is smaller than or (sufficiently) larger than respectively. One implication of the result is that when there is a large gap between the absolute labelcorrelations of relevant and irrelevant (or weaklyrelevant) features, then values in this gap region constitute a goldilocks zone for modelconcentration: in this zone is "just right": large enough to shrink weights of irrelevant features, and small enough to preserve or expand weights of relevant features, thus only minimally impacting standard accuracy (or AUCROC).

Experimental results (Sections 4, C, 6.2) showing that  adversarial training of logistic regression models (on synthetic data, UCI data, and largescale realworld advertising datasets from MediaMath) exhibits the above relevancefiltering phenomenon, with a goldilocks zone of values. When there are a large number of lowrelevance features, this aggressive relevancefiltering results in models that are significantly more concentrated than their naturallytrained counterparts, without significantly sacrificing standard accuracy. In particular on some of the MediaMath datasets we show that with adversarial training using , the number of significant absolute weights (i.e. those within 1% of the largest absolute weight) drops by a factor of 20, while the AUCROC on natural test data is nearly the same as with (i.e. natural training). We also show that traditional regularization cannot be used to achieve this type modelconcentration while preserving standard accuracy.

A closed form formula (Lemma 5.1) for the Integrated Gradientsbased Feature Attribution of Sundararajan2017rz for 1layer neural networks with an arbitrary differentiable activation function (of which logistic models are a special case), and the use of this formula to efficiently compute new measures of featureimportance and modelconcentration on a dataset.
2 Background, Definitions and Notation
For ease of reference, in this section we collect some basic terminology, concepts and known results pertaining to Machine Learning, Neural Network models, Adversarial Training, and Feature Attribution.
2.1 Machine Learning Datasets and Binary ProbabilityPrediction
We are interested in machinelearning tasks in domains where the primary emphasis is in developing models that predict the probability of the event. Such models are useful in a wide variety of areas, for instance in advertising to predict responserates from ad exposureBrendan_McMahan2013uw ; Chapelle2014yn , and in healthcare to predict the probability of adverse eventsZhao2015mk . In advertising, the probability is directly used to determine the best price to bid for an ad opportunity in a realtimebidding system. This is in contrast to domains considered in much of the adversarial learning literature, where the emphasis is primarily on the classification produced by the model.
In this paper we focus on structured datasets. For our purposes a structured dataset is one which can be thought of as a table with columns, where the values in the columns can be numerical or categorical. We assume categorical features are "1hot encoded" (as described in the next subsection) at least implicitly if not explicitly.
A training dataset consists of examplelabel pairs where is a
dimensional example vector and
is the corresponding binary label ^{1}^{1}1For mathematical convenience we use 1/1 as the labels in the theoretical results, whereas in the experiments we often use 0/1 labels, but these are trivially transformed to 1/1 and the same results apply., where indicates the occurrence of an event of interest. When a model is trained on dataset , the output of the model for an input is interpreted as the predicted probability that . Throughout the paper, it is assumed that the input vector represents the "exploded" feature vector where each categorical feature in the original example has been 1hot encoded, as described in the next subsection (see more on this in the next subsection).2.2 Onehot Encoding of Categorical Features
In ad conversion prediction and other domains involving structured datasets, several (or even most) features can be categorical. We introduce some notation and terminology to simplify the description of some of our results involving categorical features in Section 5.2.
An example vector in original form (i.e. prior to onehot encoding of categorical features) is denoted , and refers to the ’th feature (which could be numerical or categorical). If feature is categorical, then denotes the set of possible values of the feature, i.e. represents the "vocabulary" of this feature. The cardinality of feature is . We assume that is in index form, i.e. regardless of what the actual feature values are, we assume they are represented as indices in . For example if is the dayOfWeek feature, it has the 7 possible values , and its cardinality is 7. Sometimes cardinalities can be much larger. For example in the advertising datasets we study in our experiments, a website URL is encoded as an integer siteID, which has cardinality around .
For a categorical feature , its 1hot representation is denoted and is obtained by starting with an dimensional vector of all zeros, and setting dimension to 1, where is the value of . For example if and its cardinality , then is the vector . For notational convenience, for a numerical feature , we let denote its "onehot encoding", which is identical to . For a featurevector (in original form) , we let represent the vector resulting from concatenating the 1hot encoding of each in the natural way:
(1) 
We then say that is in exploded form. For example suppose is a 3dimensional feature vector in original form, is categorical with cardinality , is numerical, and is categorical with cardinality . If , then the exploded form after 1hot transformation is:
(2) 
We use to denote the (contiguous) set of dimensions corresponding to the original feature , in the exploded vector . In the above example , , and .
2.3 1Layer Neural Networks
A 1layer neural network is defined by learnable parameters (the weight vector) and (the bias), and a differentiable scalar activation function , and computes the following function for any input featurevector (in exploded space):
(3) 
where denotes the dot product of and . Although in all of our experiments we train (logistic) models that include the bias , we do not include the bias (also called the "intercept") in most of our results; this is without loss of generality since the effect of the bias term can be captured by setting one of the dimensions of the input vector to 1.0, and the presence of the bias term adds notational clutter but does not meaningful alter our results.
Two commonly used models fall within the class of 1layer networks: Logistic regression models where the activation function is the sigmoid (or logistic) function , and Poisson regression models where the activation is . In the case of a logistic model, the value of will lie in and can be interpreted as a probability prediction. Logistic and Poisson models are members of the more general class of Generalized Linear Models mccullagh1989generalized .
2.4 Natural Training of Logistic Models
Logistic models are typically trained by minimizing the empirical risk, expressed as the expected Negative Log Likelihood (NLL) loss:
(4) 
where the NLL loss is given by:
(5) 
Note that the prediction is a function of and , so the loss on a given example is a function of and .
2.5 Adversarial Training of Logistic Models
While Empirical Risk Minimization (ERM) has been successfully used in a variety of domains to yield classifiers that perform well on examples drawn from the
natural data distribution , such classifiers have repeatedly been shown to be vulnerable to adversarially crafted examples (see Madry2017zz ; Szegedy2013tx ; Yuan2017vz and references therein). As a result there has been a growing interest in adversarial training, i.e. training models that are robust to adversarial examples. As mentioned in the Section 1, our primary interest in adversarial training is not adversarial robustness but rather in the resulting improved modelconcentration (and hence improved modelexplainability).Following Madry2017zz , for our logistic models and NLL loss, we formulate adversarial learning as the problem of minimizing the expected adversarial loss:
(6) 
where is the set of possible perturbations to the input that the adversary is allowed to make.
It is common practice to consider to be the set of bounded perturbations. For brevity we will use the phrase "adversary" to refer to an adversary who is allowed to choose perturbations such that , and we refer to such perturbations as perturbations. In the rest of the paper, whenever we use the term "adversarial" perturbation, it should be understood that we are referring to the worstcase perturbation by an adversary, for the loss function under consideration (i.e. this perturbation achieves the inner max in Eq. (6)).
Thus minimizing expected adversarial loss results in a minmax (or saddlepoint) optimization problem which is computationally demanding for general loss functions (since for each example one must find the worstcase perturbation that maximizes the loss on that example). Fortunately, for logistic models we show in Section 3.1 that the worstcase that achieves the inner maximization has a simple closed form expression, when adversarial perturbations are bounded under and norms.
2.6 Stochastic Gradient Descent (SGD) in Adversarial Training
Whether to minimize the expected natural loss (4) or adversarial loss (6), it is standard to use variants of Stochastic Gradient Descent (SGD). In this paper we assume the following canonical SGD setup for adversarial training of logistic models with an adversary (For this reduces to the standard SGD for natural examples).
The weight vector is initialized to all zeros. After each example is encountered (where is in exploded form, i.e. after 1hotencoding all categorical features),

The logistic NLL loss is computed using Eq. (5),

Each component of is updated according to the gradient of the NLL loss w.r.t. :
(7) where is the learning rate.
2.7 Model Performance Measures: Accuracy, ROCAUC
In this paper we are mainly interested in probabilityprediction models, and so we use the Area Under the ROC Curve (AUCROC, or AUC in brief) Fawcett2006ws as our primary measure of model performance. In the synthetic datasets, we also use accuracy to measure model performance, and there it should be understood that we treat any prediction above (below) 0.5 as a "positive" ("negative") classification, and accuracy is calculated as the fraction of correctly classified examples in the test dataset.
2.8 Feature Attribution in Neural Networks using Integrated Gradients
The main thrust of this paper is to demonstrate the ability of adversarial training to produce models that are much more concentrated than naturallytrained ones, without sacrificing accuracy on natural test data. Model concentration can be measured in terms of the model weights (for example by computing the norm of the weights), which is reasonable if there are only numerical features of the same scale, and the model is a 1layer neural network. However in general the weight of a feature, even in a 1layer neural network, is not necessarily a good measure of its importance to a model: for instance a numerical feature may have a relatively large weight, but on a typical dataset, its absolute value may be much smaller than other numerical features, or if it is a categorical featurevalue, it may occur very infrequently (and hence its overall importance on a dataset may be relatively small). Therefore a more principled approach to measuring the "importance" of a feature in a model is needed.
Feature attribution refers to the general area of understanding how the output of a neural network is impacted by its input features, and this is important in a variety of contexts. For a neural network that computes a function , some examples of relevant questions are:

On a specific input , which features of are "most responsible" for the value of the output , relative to for some baseline input (such as a black image for image models, or a zero vector for logistic prediction models). Such an understanding can aid in debugging or improving a model’s performance. Samplelevel attribution can also be used as a rationale for a specific output. Such explanations can can help the enduser (e.g. a doctor) understand the strengths and weaknesses of the model.

In aggregate over some dataset, what are the relative importances of the features in determining the network output? If there are categorical features, then this question can be asked either at the level of featurevalues (i.e. the individual values of each categorical feature, such as the different possible values of the "country" feature), or features (i.e. each categorical feature in aggregate, such as "country"). Understanding aggregatelevel feature importance can help prune features, or identify identify bugs where a feature that is expected to be important is not turning out to be important (as measured by the specific attribution method), or vice versa.
The key to a useful featureattribution is a sound methodology which does not have quirks which obscure the impact of features on the model output. In general it is difficult to evaluate an attribution method, so in a recent paper Sundararajan2017rz the authors identify several axioms that any sound attribution method must satisfy, and in particular propose a specific method that satisfies these axioms, which they call Integrated Gradients (IG). We adopt this IG method in this paper, and it works as follows. Suppose represents the function computed by a neural network. Let be a specific input, and be the baseline input. The IG is defined as the path integral of the gradients along the straightline path from the baseline to the input . The IG along the ’th dimension for an input and baseline is defined as:
(8) 
where denotes the gradient of along the ’th dimension, at .
3 Analysis of SGD Updates in Adversarial Training of Logistic Models
Our aim is to understand the nature of the solution to the adversarial learning problem (6) for logistic regression models. The biggest difficulty in this optimization problem is the inner maximization, which requires finding the worstcase adversarial perturbation on each input . In general one needs to run a separate optimization procedure (such as Projected Gradient Descent Madry2017zz ) to find this , but fortunately for logistic regression models there is a simple closed form expression for , which we show in Lemma 3.1.
While the closed form expression for makes the adversarial training of logistic regression models computationally as simple as natural training, gaining an analytical insight into the nature of the optimum of (6) is still difficult, especially since there is no known closedform solution for this optimization problem. Rather than analyzing the final optimum of (6), we instead analyze how an SGDbased optimizer updates the model weights under adversarial perturbations. We assume the standard SGD setup for adversarial training described in Section 2.6.
As a first step toward analyzing the SGDbased weightupdates, we show in Proposition 3.1 a simple expression for the gradient of the logistic NLL loss (5) on an example perturbed by an adversary. Analyzing SGDbased weight updates is still challenging since SGD is a stateful, sequential process, so in Section 3.2 we define an idealized datageneration process (which we call the Biased Coin Process, or BCP) and instead analyze the expectations of the SGD updates on perturbed data points drawn from the BCP, which leads to our main theoretical result, Theorem 3.1.
3.1 Adversarial Perturbations for Logistic Models
The following Lemma gives a closed form formula for the perturbations under the logistic NLL loss (5) for .
Lemma 3.1 (Adversarial perturbations for logistic models)
For a fixed , , positive integer , and , define as the set of perturbations whose norm is bounded by :
(9) 
and define the adversarial as:
(10) 
where is the logistic NLL loss defined in (5). Then
(11) 
(12) 
Proof:
Consider the case (the case is analogous). In this case the NLL simplifies to , which is monotonically decreasing function of . Hence maximizing is equivalent to minimizing . When the norm of is bounded by , is minimized when is such that it has norm and points in the direction opposite to , which implies the first result by noting that is the unit vector in the direction of . When the norm of is bounded by , the lowest value of is achieved when each component of has magnitude but sign opposite that of the corresponding component of , which implies the second result.
The following Proposition shows an expression for the gradient of the logistic NLL loss where is the perturbation (12). The proof, shown in Appendix A
, involves simple algebra and applications of the chain rule, and properties of the sigmoid function.
3.2 Expectation of SGD Updates on Adversarially Perturbed BCP data
Since SGD is inherently a stateful process, it is challenging to analyze its sequential dynamics. Instead we analyze the expectation of the gradient updates from an arbitrary state, when the training data points are drawn from an idealized, general datageneration process we call the Biased Coins Process (BCP), defined below. The BCP can be viewed as a generative model that is simulating the distribution of points when drawn with replacement from an actual (finite) dataset. Our main result is Theorem 3.1 which characterizes the expected gradient updates on adversarially perturbed BCP inputs. This result yields several insights that help explain our results on synthetic datasets (Section 4), the UCI datasets (Section C) and realworld advertising datasets (Section 6.2).
In the BCP, a datapoint is generated as follows. The label is chosen uniformly at random from , and is a dimensional feature vector where for each ,
(14) 
where is the bias of feature . Note that and for any , and
, and the variance of
and are both 1.0, so the correlation of and is . This fact will be useful when interpreting the implications of Theorem 3.1 in Section 3.3.Consider an arbitrary stage of the SGD algorithm where the current weight vector is , and we generate a datapoint according to the BCP, and after replacing the featurevector with where is the adversarial perturbation in Eq. (12), we present to the SGD algorithm. We then ask what is the expectation of the gradientbased update of the weight ? Each will be updated to a new value , where is the (current) learning rate, and is the logistic NLL loss defined in Eq. (5). For brevity we denote the change in by , i.e.
(15) 
Given the expression (13) for the gradient of from Proposition 3.1, it turns out that the following conditional expectation will be useful when computing :
(16) 
where the expectation is over data points generated by the BCP. Note that the value of represents whether or not the value of feature is "aligned" with the label , and means that the weight of feature is "consistent" with its labelalignment: i.e. either and , or and . Conversely, means that that the weight of feature is "inconsistent" with its labelalignment. In general when we say that "feature has consistency ". Further note that is the model’s predicted probability of the wrong label (on an adversarially perturbed point datapoint ). We can therefore interpret as the "expected misprediction, given consistency of feature ". Intuitively, the value of is inversely related to the model’s performance on the "slice" of the BCP data where the feature has consistency . Therefore in the initial stages of the SGD algorithm, we expect to be relatively large (i.e. closer to 1.0) and in the later stages, it will relatively small. This intuition will be useful later in Section 3.3 when we interpret the results of Theorem 3.1, which we state below (see Appendix B for the proof).
Theorem 3.1 (Expectation of logistic gradient update for the BCP)
Given a weight vector , assuming a learning rate (only to avoid notational clutter), if a data point is drawn according to the BCP above, and where is the adversarial perturbation given by Eq. (12), then for each , the expectation (over random draws from the BCP) of the SGD update (defined in Eq. (15)) satisfies the following properties:

If then
(17) 
If , then
(18) and
(19)
3.3 Implications of the Expected Gradient Results
Theorem 3.1 has a few interesting implications. It will help to recall from Section 3.2 that the conditional expectation can be interpreted as the "expected misprediction of the model when feature has consistency ", which in turn means that in the initial stages of SGD we expect to be relatively large, and it will shrink as SGD progresses toward a better model. Since this factor appears in all of the bounds of Theorem 3.1, the various effects discussed below are more pronounced in the initial stages of SGD and less so during later stages. Note also that the quantity appearing in Theorem 3.1 (and in the implications below) is the absolute correlation between and , as noted at the beginning of Section 3.2. We will refer to as the absolute labelcorrelation of .
In the following interpretations, we say the weight of a feature is aligned, if , and otherwise we say it is misaligned. We also informally say that the feature is "weightaligned" and "weightmisaligned" respectively. Also note that signifies that the expected update of weight shrinks it toward zero. Conversely, signifies an expansion of the weight in the current direction.

(1). Weights grow in the right direction starting from zero. Property 1 of Theorem 3.1 implies that for any (i.e. for natural as well as adversarial training) if a feature has weight , the SGD update will on average "grow" its weight in the "correct" direction, i.e. in the same direction as its bias , and the expected magnitude of the update is .

(2). Misaligned weights are shrunk. When , the upper bound (18) in Property 2 of Theorem 3.1 simplifies to
which shows that if the weight of a feature is misaligned, then its weight is shrunk toward zero on average, and the magnitude of the shrinkage is proportional to . This makes it clear that adversarial training with a positive value of shrinks misaligned weights more aggressively than natural training (i.e. with ), and this effect is even more pronounced for features with large absolute labelcorrelations.

(3). Aligned weights are shrunk by a sufficiently large . When , the upper bound (18) simplifies to
(20) which means that for a weightaligned feature with bias , if the adversarial exceeds the feature’s absolute labelcorrelation , then its weight is shrunk toward zero in expectation, and the expected magnitude of the shrinkage is proportional to .

(4). Aligned weights are expanded up to a point, for sufficiently small . When , the lower bound (19) simplifies to
and this lower bound is positive if and
(21) In other words, if is weightaligned, and is sufficiently smaller than its absolute labelcorrelation (to account for the term in Eq. (21)), then its weight is expanded or preserved on average.
Implications (3) and (4) are specific to adversarial training since they apply only for a certain range of nonzero values of . Indeed, these two implications are key to the featureconcentration effect of adversarial training which we explore in detail in the experiment sections. Suppose for example that there are two features (in the BCPgenerated data) with biases , and respectively, and we adversarially train a logistic model with . Then by implication (2) a negative weight on either feature will be shrunk toward zero on average. Implication (3) means that a positive weight on the "weakly biased" feature will be shrunk toward zero with a magnitude proportional to almost . Implication (4) means that a positive weight say on the "strongly biased" feature will be preserved or expanded since , and can be verified to satisfy the bound (21).
More generally, if there are truly random nonpredictive features in a dataset, that just happen to show spurious correlations with the label in a finite sample, these features will have much weaker biases than the truly predictive features, and so adversarial training with an appropriate would tend to weed out the nonpredictive features. We see this phenomenon clearly in the synthetic datasets in the next section, where we intentionally construct a dataset with a mix of predictive and nonpredictive random features. In the experiments on the UCI datasets (Section C) and on realworld advertising datasets (Section 6.2) we also see this featurepruning effect. In these nonsynthetic experiments, it is possible that the features being pruned are either truly nonpredictive ones, or very weakly biased ones. In datasets with hundreds of thousands of sparse categorical features, pruning even weakly predictive features can be valuable due to the explainability and modelsize compression benefits.
Finally, implications (3) and (4) above hint at the possibility of a goldilocks zone of values which are large enough to weed out irrelevant or weaklyrelevant features (i.e. those with tiny biases) and yet small enough to preserve the truly predictive features (i.e those with significant biases), thus maintaining model performance. For instance if the ordered sequence of absolute featurelabel correlations has a gap that separates "relevant" features from "irrelevant" or "weakly relevant", then any that happens to be in this gap will induce the following behaviors. Eq. (20) suggests that this would be large enough that it exceeds the absolute labelcorrelations of the weaklyrelevant features, and so their weights will tend to shrink even if the features are weightaligned (and if they are not, then their weights would tend to shrink anyway due to implication (2)). On the other hand Eq. (21) suggests that if the is sufficiently below the lowest absolute labelcorrelation of the "relevant" features (to account for the term in the bound in Eq. (21)), then the weights of the relevant features are preserved or expanded. In other words, adversarial training with an appropriate acts as an aggressive correlationbased featurefilter, or what we referred to as a relevancefilter in Section 1.
The fact that the modelconcentration behavior of adversarial training is realized over a range of is important from a practical perspective: it implies that we can find a suitable more easily and that the desired behavior does not just occur for one "lucky" value.
In case there are a large number of features which are irrelevant or weaklyrelevant, then the results of Theorem 3.1 suggest that adversarial training with an appropriate choice of can aggressively weed out the weights of these features, and produce a model which has significantly better featureconcentration compared to a naturallytrained one (i.e. with ).
It should be pointed out that in Theorem 3.1 we do not analyze how adversarial training impacts model accuracy, and we leave this as an open question for future research.
In closing this section we make an observation about regularization. As noted by Goodfellow2014cy , in the context of logistic regression, the gradient of the loss function under regularization has some similarities with the gradient of the loss under adversarial training as in Eq. (13). In fact in our experiments using a pure SGD optimizer, we find the behavior of regularization and adversarial training to be somewhat similar on the synthetic and toy UCI datasets (but not on the realworld MediaMath advertising datasets). However as mentioned in Section 1 most practitioners do not use a simple SGD and instead use optimizers such as FTRL Brendan_McMahan2013uw to obtain improved accuracy or AUC on test data. Specifically when using FTRL we no longer see this similarity, and adversarial training outshines regularization in its ability to hone in on relevant features without significantly impacting AUC or accuracy on natural test data.
4 Experiments on Synthetic Datasets
The results of Theorem 3.1 and the discussion in Section 3.3 imply that adversarial training can be perform aggressive relevancefiltering when learning a logistic regression model, but the results only apply to data points drawn from the BCP, and only characterize the expectation of SGD weight updates. Nevertheless, these results prompt us to ask: Is it possible to reproduce the relevancefiltering behavior of adversarial training on a finite synthetic dataset with a mix of predictive and nonpredictive features, without sacrificing standard accuracy? Specifically, we want to design a synthetic dataset where:

Natural training places nonnegligible weights on the predictive features, and on at least some nonpredictive ones.

adversarial training with some places significant weight on the predictive features, but negligible weights on the nonpredictive features.

adversarial training achieves accuracy (or AUCROC) comparable to that of natural training, on natural test data.
It turns out that the following synthetic dataset demonstrates the above phenomena very well.
4.1 Synthetic Data Generation and Training Process
The label is 1 or +1 with equal probability. The input vector consists of 2 kinds of features:
 Correlated, strongly predictive features

: identical features such that:
(22)  I.I.D, random nonpredictive features:

i.i.d. features each taking the values +1/1 with equal probability.
For our experiments we generate datapoints according to these specifications, with , i.e. 8 identical predictive features and 8 random nonpredictive features. The rationale behind these choices of and is to ensure that there are "sufficiently many" features of each type, and 8 of each suffices to highlight some of the behaviors we are discussing. The rationale for needing "sufficiently many" is the following: When there are sufficiently many random nonpredictive features, some of them will accidentally "look" like they are correlated with the label: this is because in a set of tosses of a fair coin, the expectation of the fractional imbalance between heads and tails is proportional to , and the distribution of the imbalance is highly concentrated around this expectation. As a result, if there are sufficiently many random nonpredictive features, at least some of them will appear correlated with the label, and so natural training will put nonnegligible weights on them. Many realworld structured datasets are moderatesized (say under a 100,000), and even if the number of data points is very large, there may be a very large number of sparse categorical features, and thus this "spurious" correlation can occur with higher likelihood. When there are several identical, predictive features, natural model training will force these features to "share" their weight roughly equally, thus pushing down the weights to a level comparable or even less than that of nonpredictive features.
The experimental training and testing methodology is as follows. We train a logistic model (with a nonzero bias term) on the first 700 datapoints and test on the remaining 300. For training we use a minibatch size of 20 and train for 200 epochs, with the FTRL optimizer with a learningrate of 0.01, and all weights initialized to zero. While the results are qualitatively similar with other optimizers such as ADAM, simple SGD or AdaGrad, the best results are obtained using FTRL, as stated in Sec.
1. During training, we perturb the input vector using the worstcase adversarial perturbation from Eq. (12) with a specific choice of , before taking gradients:(12) 
4.2 Comparison of Natural and Adversarial Training with
Our first experiment is to compare the models resulting from adversarial training with (which is equivalent to natural training), versus . As shown in Table 1, the AUC and accuracy of the models (on natural test data) are similar, but the learned weights show a dramatic difference. To make it easier to distinguish the random nonpredictive features from the predictive features , we relabel the nonpredictive features as , and refer to these two featuregroups as "features" and "features" respectively. Table 1 shows that after natural training, the norm of the predictive features is comparable to that of the nonpredictive features, whereas adversarial training reduces the norm of the nonpredictive features to a negligible amount, while maintaining a siginificant norm on the predictive features.
This contrast between natural and adversarial training is seen clearly in the barchart of Figure 1. As expected, natural training results in equal weights on the 8 predictive features (and this is true for adversarial training as well). However natural training also places a significant weight on the 8 random nonpredictive features, with 4 of them having higher weights than the predictive features. This is clearly problematic from a modelexplanation point of view: when explaining the predicted probability on a specific example, we would like to not see the nonrelevant features contributing a meaningful amount, and especially not more than the truly predictive features.
In sharp contrast to natural training, adversarial training does not suffer from this problem: the only significant weights are on the 8 predictive features, and all of the nonpredictive featureweights are selectively killedoff, with weights close to zero. This is precisely the aggressive relevancefiltering effect of adversarial training that we wanted to demonstrate.
training  AUC  accuracy  Wts_x_L1  Wts_r_L1 

natural  0.668  0.676  0.724  0.661 
adversarial  0.676  0.676  0.205  0.001 
4.3 Adversarial Vs Natural Training for a Range of values
We gain further insight into the impact of on the adversariallylearned model weights, by repeating the above adversarial training (and natural testing) with a range of bounds from 0.0 to 2.0. For each value of we separately compute the norm of the learned weights of the predictive features, and the norm of the learned weights of the nonpredictive random features. These are plotted in Figure 2, along with the AUC on the 300 natural test data points. The figure shows that as increases, the norm of the weights of the features approaches zero much more rapidly than the norm of the features. Moreover at the level of where the feature weights approach zero, the AUC (on natural test data) is nearly as high as the AUC with natural training (). Notice that there is a range of values (shown by the blue band in the figure) which are "just right": i.e. large enough to deweight nonpredictive features, yet small enough to preserve sufficient weight on the predictive features and hence have minimal performance impact on natural test data. This is an instance of the goldilocks zone of values which we referred to in Section 3.3: adversarial training with an value in the this zone yields both good model explanations (due to the aggressive relevancefiltering behavior where nonpredictive features are given negligible weight) and good model performance (since standard accuracy/AUC is maintained).
It is natural to wonder whether the featureweight concentration effect of adversarial training can be achieved using traditional regularization. On this specific synthetic dataset we find that indeed it is possible to achieve a similar deweighting of the nonpredictive features using regularization with, but this requires a very large weight on the regularization term in the loss function: it needs to be at least 200, as shown in Figure 3. However as we show in Section 6.2, in large realworld datasets, such a high value of damages the AUC of the model considerably, and small values do not achieve a modelconcentration as significant as that achieved by adversarial training.
4.4 Explanation of Adversarial Training Behavior on the Synthetic Dataset
We can now use Theorem 3.1 to at least partially explain at least some of the observations we made on our synthetic dataset of points. Our experiments used a variant of SGD with a minibatch size of 20 on a shuffled dataset, and so the behavior of the learning algorithm can be reasonably approximated by a modified process where at each stage, a single datapoint is drawn uniformly at random with replacement from the point dataset and perturbed by the adversary, and presented to the SGD algorithm for a gradientbased update. The random draws of points from the dataset can therefore be modeled as a BCP, with appropriate choices of the biases of each feature. Recall that in the synthetic dataset (Section 4) the 8 identical predictive features each have a bias , for . Since these will share weights equally in the optimal solution, we can reasonably approximate the overall effect of these identical predictive features, using a single feature in the BCP with bias . The 8 nonpredictive features are each independently uniformly chosen from . Thus each nonpredictive feature has a probability 0.5 of agreeing or disagreeing with the label . However in a finite sample of data points, for a nonpredictive feature, the expectation of the fractional absolute imbalance between agreements and disagreements (with the label) is of the order of , and moreover the distribution of this imbalance is highly concentrated around the expectation. Therefore we can model each of these nonpredictive features as a feature in the BCP with an absolute bias of .
In other words, we are approximating the synthetic dataset of Section 4 with a BCP where there is one predictive feature with bias 0.2, and 8 features with absolute bias . Theorem 3.1 and its implications (Section 3.3) then help us explain the following observations. (It will be helpful to see Figure 1 which shows the weights from natural and adversarial training with , and Figure 2), which shows the variation of norms of the two featuretypes with varying .)
Observation 1: Natural training (i.e. adversarial training with ) places nonnegligible weights on at least some of the nonpredictive random features, with some of them having weights even larger than predictive features. This is seen in Figure 1. As mentioned before, since the 8 predictive features are identical, natural (and adversarial with any ) learning results in a model where the weights are shared equally amongst them. On the other hand, many of the random nonpredictive features have a bias of . Due to implication (1) in Section 3.3, their weights grow from 0 in the direction of their bias, and due to implication (4) they are maintained or expanded in their current direction since for natural training. This effect, combined with the fact that the weights of the 8 predictive features are shared equally, causes the weights of some of the nonpredictive features to exceed that of the predictive features.
Observation 2: With adversarial training () the learned weights on the 8 nonpredictive features are close to 0 (Figure 1), but the 8 predictive features retain significant weights. This is the key relevancefiltering effect highlighted by these experiments. Note that far exceeds the absolute bias of the randomnonpredictive features, so by implication (2) in Section 3.3, any misaligned weights of these features are shrunk toward zero with an expected change proportional to , and by implication (3) any aligned weights are also shrunk toward zero, this time with an expected change proportional to . Both these expected shrinkage rates are close to since the absolute biases of these nonpredictive features are and therefore negligible compared to . Note that the single predictive feature has bias 0.2, so by implication (2), if its weight is misaligned, it shrinks toward zero with an expected change proportional to or , which is a much more aggressive rate than that of the nonpredictive features. If the predictive feature has an aligned weight, by implication (3) it shrinks toward zero since , and the expected change is proportional to , which is a much slower shrinkage rate than that of the nonpredictive features (which is around 1.0). Presumably, after reaching a certain level, the weight of the predictive feature enters the regime of implication (3), which preserves its weight. This can explain why even with the weights on the predictive features remain significant.
5 Feature Attribution in 1Layer Networks
We now turn our attention to measuring feature concentration, which we argue in this paper is a key benefit of adversarial training on logistic regression models. As mentioned in Section 2.8, feature concentration could potentially be measured in terms of the model weights, but this is not always the best approach. This is particularly the case for realworld datasets such as those considered in the advertising response prediction task in Section 6.2, where there are many highcardinality categorical features, and we need an appropriate way to measure an aggregate importance of a categorical feature over a dataset. It is therefore worth considering some more principled ways of measuring featureimportance which are amenable to a natural aggregate importance definition.
We find that the featureattribution method of Integrated Gradients (IG) Sundararajan2017rz (described in Section 2.8) is very well suited to our purposes: We derive a closedform expression for the IGbased featureattributions for a 1layer neural network (Lemma 5.1), and this makes it computationally very efficient to compute the attributions of all features (which could number into the hundreds of thousands since there are potentially many highcardinality categorical features). Moreover, in Section 5.2 we propose natural ways to aggregate the IGbased featureimportance metrics across a dataset (which we call Feature Impact and FeatureValue Impact), for categorical as well as numerical features.
5.1 Closed Form for IG in 1Layer Networks
For general neural networks, the authors of Sundararajan2017rz show how to approximate the IG integral (8) by a summation involving gradients at equallyspaced points along the straightline path from to . While this approximation is reasonably efficient for a fixed example and dimension , it can be prohibitively expensive for computing the IG values across a dataset of millions of examples and thousands of (sparse) features. Closed form expressions for the IG would therefore be of significant interest, especially if the goal is to compute the IG over an entire dataset in order to glean aggregate feature importances.
We first show a closed form exact expression for the when is a singlelayer network.
Lemma 5.1 (IG Attribution for 1layer Networks)
If is computed by a 1layer neural network (3) with weights vector , then the Integrated Gradients for all dimensions of relative to a baseline are given by:
(23) 
where the operator denotes the entrywise product of vectors.
Proof:
Consider the partial derivative in the definition (8) of . For a given , and , let denote the vector . Then , and by applying the chain rule we get:
where is the gradient of the activation at . This implies that:
We can therefore write
and since is a scalar, this yields
Using this equation the integral in the definition of can be written as
(24)  
where (24) follows from the fact that and do not depend on . Therefore from the definition (8) of :
and this yields the expression (23) for .
Note that the closed form expression (23) does not depend on the activation derivative at all, as long as the activation is differentiable. There is a natural interpretation of the closed form expression (23): When the input changes from the baseline value to , the dot product changes by , and the fractional contribution of dimension is , and is this fraction times the total function value change .
5.2 Aggregation of IG Over a Dataset
The IG methodology of Sundararajan2017rz only considers feature attribution for a single example . In order to understand the relative importance of features over a (possibly large) dataset, it would be helpful to somehow aggregate the IG values across multiple examples. We propose here a simple method to do this in structured datasets.
We can now describe our IG aggregation procedure. As mentioned before, we assume that the neural network input is an explodedform vector . Note that the ’th dimension of corresponds either to a numerical feature in the original featurevector , or some specific value of a categorical feature. The IG for each dimension of can be computed from Eq. (8) for a general neural network, or from Eq. (23) for a 1layer network. Informally, measures the extent to which the ’th dimension of contributed to "moving" the network output from its baseline value to . In other words represents the impact of the ’th dimension on the output for example . A reasonable measure of the "importance" of dimension in some suitable dataset is therefore the simple average of over all . We call this the featurevalue impact FVI (since the ’th dimension in exploded space corresponds to a specific value of a categorical feature, or a numeric feature):
(25) 
In the case of a categorical feature, we are also interested in the overall impact of that feature. For example we may want to know what is the overall importance of the dayOfWeek feature in some dataset . A reasonable way to compute the featureimpact FI, i.e. the overall importance of a categorical feature in original form, is to add the FVI values over all dimensions corresponding to this feature in the exploded space:
(26) 
The FI metric is particularly useful to gain an understanding of the aggregate importance of highcardinality categorical features. For example we measure the featureconcentration of models trained on the MediaMath datasets (which have categorical features with cardinalities in the 100,000 range) in terms of the FI metric (see the FI.L1 and FI.1Pct metrics in Section 6.1).
6 Experiments
In Section 3.2 we analyzed the expectation of the SGD weightupdates of a logistic regression model during adversarial training, on an idealized random datageneration process (the BCP). Theorem 3.1 in that Section suggested that adversarial training can have an aggressive relevancefiltering (or modelconcentration) behavior: the possibility that it aggressively shrinks the weights of irrelevant or weaklyrelevant features, while maintaining significant weights on relevant features, and hence not significantly impacting performance on natural test data. In Section 4 we showed that this behavior can be realized on a specific synthetic dataset. A natural next question is, whether the modelconcentration benefits of adversarial training can be seen in realworld structured datasets. This is the question we explore in this section.
Specifically, we describe the results of experiments intended to answer the following questions for realworld datasets:

Is there a "goldilocks zone" of values for which significant modelconcentration (as measured by various metrics defined in Section 6.1 below) is achieved by adversarial training, while AUC (on natural test data) is no worse than 0.01 compared to natural training (i.e. with ) ?

Fixing at a value in the goldilocks zone, how do the modelweights and IGbased Feature Impact metrics (defined in Section 5.2) in the adversarially trained model compare with the corresponding metrics in a naturally trained model?

If we train the model on natural data, but with regularization using a regularization penalty factor , can we achieve a similar effect, i.e. produce modelconcentration comparable to adversarial training, and yet maintain AUC within 0.01 of the AUC with ?
To answer the above questions, we performed experiments on two kinds of datasets: (a) Two datasets ("mushroom" and "spambase") from the UCI ML data repository Dua:2017 , and (b) Largescale realworld ad conversionprediction datasets from MediaMath. Although the UCI datasets are "real" in the sense that they are derived from real domains, their size has been kept relatively small (typically no larger than a few thousand data records) to facilitate benchmarking, and quick testing or demonstration of ideas. The MediaMath datasets on the other hand have millions of records and hundreds of thousands of (sparse) features, and are actually used to train models that determine bids in the realtime bidding system that the company operates (More details are in Section 6.2).
The findings from our experiments, corresponding to the three questions above, are as follows:

For all the datasets studied, there is indeed a "goldilocks zone" of good values for which adversarial training produces significantly more concentrated models with AUC drop (on natural test data) of no more than 0.01 relative to natural training.

Examining feature weights or Feature Impact in adversarially trained models (for a fixed in the goldilocks zone) reveals that these models are significantly more concentrated than their naturallytrained counterparts. There are often cases where features given importance by a naturallytrained model are much less important in an adversarially trained model, and vice versa.

On a MediaMath dataset, natural logistic model training with regularization using a regularization penalty factor achieves some model concentration but significantly worse than with adversarial training, and as is increased beyond 0.2, the AUC degrades rapidly. A similar effect is seen on the UCI datasets: as is increased, the model concentration improves and AUC (on natural test data) drops, but at the point where AUC is 0.01 below the AUC for , model concentration is much inferior to that produced by regularization.
Section 6.1 describes model training and evaluation methodology. Section 6.2 describes the results from experiments on the MediaMath datasets, and the results from the UCI datasets are in Appendix C.
6.1 Model Training and Evaluation Methodology
We describe here the common aspects of the training and evaluation methodology for the UCI and MediaMath datasets. Any variations specific to the datasets are described in the respective subsections. All the tasks we consider are probability prediction tasks as described in Section 2.1
, where the prediction target is a binary +1/1 variable, with +1 indicating a positive example and 1 indicating a negative example. (The specific implementations may actually use a 0/1 label instead, but we keep the 1/1 description here as it simplifies some of the analytical expressions). Our code is implemented in Python using the highlevel TensorFlow Estimators and Dataset APIs.
It is important to note that all categorical variables are 1hot encoded (as described in Section
2.2) prior to being fed to the modeltraining and evaluation code. In other words we apply the adversarial perturbation (given by (12)) to the input vector in exploded form. A reasonable question is whether such perturbations are semantically meaningful, and whether they represent legitimate perturbations by an adversary. One could also make the argument that a real adversary would only be able to perturb the original input vector, and so the set of allowed perturbations of should be restricted to legitimate 1hot encodings. Indeed some authors have considered this type of restriction in the domain of malware detection AlDujaili2018yp . We set aside this issue in this paper, since our interest is more in the modelconcentration effect of adversarial training, and less in robustness to real attacks.Each dataset is divided into train and test subsets. For training on natural examples we use the ADAM optimizer in TensorFlow with regularization strength () set to 0. (We vary the to evaluate the effect of regularization). Our results are substantially the same regardless of which optimizer we use, e.g. Adam, AdaGrad or simple SGD. We use FTRL mainly because in TensorFlow the FTRL optimizer has an optional argument that controls the strength of regularization. As mentioned in Sec. 4, although our results are qualitatively similar with other optimizers, the best results are obtained using the FTRL optimizer. All model weights are initialized to zero in case of the synthetic and toy UCI datasets, whereas they are initialized using a Gaussian initializer (with mean 0 and variance 0.001) in the case of the MediaMath ad responseprediction models. Once again our results remain the same whether we use zero or Gaussian initializers. For adversarial training we also use the FTRL optimizer, except that in each minibatch the examples are perturbed according to the worstcase perturbation given by Eq. 12, as described in the canonical SGD setup in Section 2.6. Some authors train adversarially robust models by first training on natural examples and then training on adversarial examples. But in our experiments we find that the initial pretraining on natural examples does not make a difference, at least for the modelconcentration effects which we are studying.
Once a model is trained (adversarially or naturally) we compute two types of metrics:

An ML performance metric, the AUCROC (Area Under the ROC Curve) on the heldout natural test dataset.

A few feature concentration metrics, defined as follows, where the linear model weightvector is (and is the dimension of the exploded featurespace, i.e. after 1hot encoding).
 Wts.L1:

, which is a measure of the overall magnitude of the weights, scaled by the biggest absolute weight. Note that if we multiply all weights by a constant factor, then WtsL1 does not change.
 Wts.1Pct:

The percent of the weights in whose absolute value is at least 1% of the maximum absolute weight. This can be thought of as a measure of how many featureweights are "significant", where the threshold of significance is 1% of the biggest absolute weight.
 FI.L1:

, where stands for the vector of Feature Impact values (defined by Eq (26)), and ranges over the dimensions in the original featurespace (i.e. before 1hot encoding), and the dataset is the natural training dataset.
 FI.1Pct:

The percent of components of (which are all positive by definition) that are at least 1% of the biggest component of , again over the natural training dataset.
In all our experiments, when we use the closedform formula in Eq. (23) to compute the FVI (FeatureValue Impact) values (Eq (25)) of the dimensions of the exploded featurevector, for the baseline input we use the allzeros vector.
In the various tables of results, we use the abbreviation nat to refer to metrics for the naturallytrained model, and adv to refer to metrics for the adversariallytrained model.
6.2 Experiments with MediaMath Datasets: Ad Conversion Prediction
MediaMath provides a software platform that operates a realtime bidding (RTB) engine which responds to bidopportunities sent by adexchanges. The RTB engine bids on behalf of advertisers who set up adcampaigns on the platform. A key component in determining bid prices is a prediction of the probability that a consumer exposed to the advertiser’s campaign would subsequently perform a certain designated action (called a "conversion"). MediaMath currently trains a logistic regression model for each campaign to generate these conversion probability predictions. The models are trained on a dataset collected over a number of days, where each record contains various features related to the ad opportunity (such as device type, browser, location, time of day etc), as well as a 0/1 label indicating whether or not a conversion occurred subsequent to ad exposure. The model for each campaign is trained on a sequence of 18 days of data, and validated/tested on the subsequent 3 days of data. The total number of records in each dataset can range from half million to 50 million depending on the campaign. Each record has around 100 features, mostly categorical, and some (such as "siteID") have cardinalities as high as 100,000, and so the dimension of the exploded featurespace (i.e. after 1hot encoding) is on the order of 400,000. (We use featurehashing rather than explicit 1hot encoding to map some of the highcardinality features to a lowerdimensional vector, but the net effect is similar to 1hot encoding, except that now each dimension in the 1hot encoding vector may correspond to multiple features, due to hash collisions)
Given the extremely high dimensionality of the exploded featurespace, it is of considerable practical importance to understand which features have a truly significant impact on the predictions. Specifically, we wish to explore whether adversarial training can yield models that have significantly better feature concentration, while maintaining the AUC within say 0.01 of the naturallytrained model. We have seen strong evidence that this is indeed possible, both on synthetic datasets (Section 4) and on some UCI datasets (Section C). We show below that we see a similar phenomenon in the conversionprediction models.
To study the impact of adversarial training, we performed experiments with a wide range of values of and found that for most campaigns, adversarial training with or results in featureconcentrations significantly better than with natural training, while maintaining AUC (on the validation set) within 0.01 of the AUC of a naturallytrained model. We also experimented with keeping and varying the regularization parameter in the FTRL optimizer, and found that any significantly lowers the AUC of the resulting model, and lower values do not yield a featureconcentration as strong as that achieved by adversarial training. Indeed we find that the effects of adversarial training and regularization are complementary: when an appropriate value of is used in conjunction with say , we find that regularization helps to "clean" up the very low featureweights produced by adversarial training by pushing them to zero.
Table 2 shows a summary of results on 9 campaigns ^{2}^{2}2All campaign IDs and feature names are masked for client confidentiality reasons. In some cases the AUC of the adversariallytrained model is better than that of the naturallytrained model. Recall that the Wts.1Pct metric measures what percent of dimensions (in the exploded space, after 1hot encoding) have absolute weights at least 1 percent of the highest absolute weight. Since most features are categorical, Wts.1Pct is therefore a measure of what percent of featurevalues are significant to the model. This metric (as well as Wts.L1) falls drastically with adversarial training in all cases, which indicates that several of the featurevalues are simply not relevant to predicting the label. There is thus a potentially massive modelcompression that can be done, and this can have benefits in storing, updating and serving models (MediaMath periodically trains around 40,000 models). Table 2 also shows the FI.1Pct and FI.L1 metrics, which are aggregate featureimpact concentration metrics over the natural training dataset. Note that these are at the feature level and not featurevalue level. Since the FI measure of a categorical feature aggregates the FVI metric over all values of this feature, the drop in this metric (when we go from natural to adversarial training) is not as dramatic as in the case of Wts.L1 or Wts.1Pct (and sometimes these are higher than with natural training).
Campaign  training  AUC  Wts.1Pct  Wts.L1  FI.1Pct  FI.L1 

285  nat  0.560  12.93  165.05  0.78  5.80 
adv ( 0.01)  0.556  0.39  13.27  0.37  2.87  
479  nat  0.697  11.23  92.02  3.20  12.52 
adv ( 0.001)  0.694  6.17  66.96  3.13  12.34  
622  nat  0.565  27.11  110.12  6.63  5.73 
adv ( 0.01)  0.561  4.21  18.18  2.87  3.55  
594  nat  0.702  16.91  172.55  0.86  12.54 
adv ( 0.001)  0.702  1.09  19.97  0.77  11.19  
473  nat  0.683  28.02  177.36  3.14  14.69 
adv ( 0.001)  0.673  3.58  55.68  2.84  15.69  
070  nat  0.622  18.53  158.15  4.94  26.21 
adv ( 0.001)  0.625  7.55  107.63  4.46  24.00  
645  nat  0.573  16.26  251.37  2.78  31.78 
adv ( 0.01)  0.627  1.07  34.45  1.12  10.60  
733  nat  0.658  27.35  203.73  4.04  11.36 
adv ( 0.001)  0.667  9.91  108.03  4.13  11.60  
735  nat  0.758  12.20  220.97  1.87  21.82 
adv ( 0.01)  0.765  0.51  21.02  0.75  16.36 
To illustrate the effect of adversarial training in more detail, we focus on campaign number 735 (the bottom row in Table 2) and compare the results from natural training and adversarial training (with ). Figure 4 compares the Feature Impact (FI) values between these models; Figure 5 compares the featureweights dropoff curves of these models; and Figure 6 compares the FI dropoff curves.
Figures 7 and 8 contrast the ability of adversarial training and regularization to improve model concentration while maintaining AUC (on natural test data): adversarial training with improves the concentration metric Wts.1Pct to as low as 0.5% (compared to 12% for a naturally trained model, an improvement factor as high as 24), and yet achieves an AUC slightly higher than with natural training. On the other hand with regularization, using a strength of improves the concentration to 5% (significantly worse than 0.5% for adversarial training) and slightly improves upon the naturallytrained AUC, but any higher value of significantly degrades the AUC, and the Wts.1Pct concentration metric does not go below 2.5%.
7 Conclusion and Future Work
We considered the question of whether adversarial learning can be used as a mechanism to trim logistic models significantly, while maintaining performance (as measured by AUC or accuracy) on natural (unperturbed) test data. From an explainability standpoint, it is highly desirable that models do not heavily weigh features that are irrelevant or marginally influential. We explored this possibility of featureconcentration both theoretically and empirically, in the context of logistic regression models and bounded adversaries. On the theory side, we derived results showing bounds on the expectation of the weight updates, in terms of the adversarial bound, the feature’s bias, its current weight, and the current overall learning stage of the model. Our results suggest there is often a goldilocks zone of adversarial bounds that are "just right": large enough to weed out irrelevant features, yet small enough to maintain reasonable weights on truly predictive ones and hence not impact model performance on natural test data. The practical implication of the goldilocks zone is that it makes it easy to find a suitable , so the desired behavior is not restricted to just one (or a few) "lucky" value of .
Our theory both motivates and at least partially explains our experimental studies. We designed a synthetic dataset containing a mix of predictive features and random nonpredictive features, and showed that natural learning tends to learn significant weights on the nonpredictive features (some of which are higher than that of the predictive features) simply because they show spurious correlations with the label in a finite sample of data. By contrast adversarial training with a large enough bound can weed out these noise features while maintaining weights on the predictive features, hence minimally impacting (if at all) the model’s performance on natural test data.
We demonstrate the featurepruning effect of adversarial training on two toy UCI datasets and realworld advertising responseprediction datasets from MediaMath. On the latter datasets we showed that adversarial training with perturbations with bounds as small as 0.001 or 0.01 can achieve as much as a factor of 20 reduction in the number of "significant" weights (defined as the number of weights whose magnitudes are within 1% of the maximum magnitude), and yet their performance on natural test data is not impacted, and sometimes even improves upon natural training.
We also showed that this effect is not easily replicated with regularization. In particular we showed that in the synthetic datasets, one needs to use an unusuallylarge value of the regularizationweight whereas there is no that achieves a comparable effect in the UCI datasets and realworld MediaMath datasets. As mentioned in the experiments sections, we obtain the best results using the FTRL optimizer, compared to other optimizers such as ADAM, AdaGrad or simple SGD. We leave for future work an explanation of why this is the case.
It is worth pointing out that on specific datasets, it may well be possible to reproduce this modelconcentration behavior with natural model training, using carefully customdesigned hyperparameters (such as a learningrate schedule customized to the dataset). We emphasize, however, that it is simpler to achieve this behavior using adversarial training: it merely involves trying a set of values, without the need for specially customizing the other hyperparameters.
To speed up adversarial training we relied on our closedform formula for the worstcase adversarial perturbation for logistic models. We quantified featureconcentration in a few different ways, including some that are based on the model weights, and some derived from the IntegratedGradient (IG) based featureattribution method. We derived a closedform formula for the IGbased featureattribution, for 1layer neural networks, which we leverage to be able to compute a new metric of aggregate feature importance we introduced, called Feature Impact.
It would be of significant interest to expand our theoretical analysis of adversarial learning, and in particular show a link to accuracy (which is something we did not do, unlike the analysis of Madry2017zz ). Extending the modelconcentration analysis to models with one or more hidden layers would also be interesting. Deriving closed form formulas (or efficient approximations) for adversarial perturbations, as well as featureattributions, for more complex networks would also help make adversarial training and measurement of aggregate feature impact more practical.
Another direction for exploration is to consider more carefully the notion of an adversarial perturbation in structured datasets (a point that was alluded to at the beginning of Section 6). In image domains, an adversarial perturbation is one that preserves perceptual similarity (from a human observer’s perspective) and yet causes a model to misclassify the example. We have sidestepped this issue in this paper since our primary motivation was to study the modelconcentration effects of adversarial learning. However even for this specific purpose, it may be useful to consider other classes of permissible perturbations, such as perturbations that are constrained to be valid inputs when the features are categorical (for example AlDujaili2018yp consider this type of restriction for adversarially robust detection malware detection). In our experiments, we perturb the 1hot encoded vector along all dimensions, which will in general result in a vector that is not a valid representation of any input vector (since multiple dimensions corresponding to a single categorical feature may be nonzero). It is possible that such constrained perturbations produce even better results from a modelconcentration point of view.
References
 (1) D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,” 2017. http://archive.ics.uci.edu/ml.
 (2) M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” arXiv:1703.01365 [cs.LG].
 (3) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv:1312.6199 [cs.CV]. http://arxiv.org/abs/1312.6199.