An emerging problem in trustworthy machine learning is to train models that produce robust interpretations for their predictions. We take a step towards solving this problem through the lens of axiomatic attribution of neural networks. Our theory is grounded in the recent work, Integrated Gradients (IG), in axiomatically attributing a neural network's output change to its input change. We propose training objectives in classic robust optimization models to achieve robust IG attributions. Our objectives give principled generalizations of previous objectives designed for robust predictions, and they naturally degenerate to classic soft-margin training for one-layer neural networks. We also generalize previous theory and prove that the objectives for different robust optimization models are closely related. Experiments demonstrate the effectiveness of our method, and also point to intriguing problems which hint at the need for better optimization techniques or better neural network architectures for robust attribution training.

## Authors

• 7 publications
• 10 publications
• 5 publications
• 35 publications
• 58 publications
10/14/2020

### FAR: A General Framework for Attributional Robustness

Attribution maps have gained popularity as tools for explaining neural n...
12/28/2020

### Enhanced Regularizers for Attributional Robustness

Deep neural networks are the default choice of learning models for compu...
11/29/2019

### On the Benefits of Attributional Robustness

Interpretability is an emerging area of research in trustworthy machine ...
06/11/2020

### Smoothed Geometry for Robust Attribution

Feature attributions are a popular tool for explaining the behavior of D...
05/11/2021

### Improving Molecular Graph Neural Network Explainability with Orthonormalization and Induced Sparsity

Rationalizing which parts of a molecule drive the predictions of a molec...
03/27/2020

### Unpacking Information Bottlenecks: Unifying Information-Theoretic Objectives in Deep Learning

The information bottleneck (IB) principle offers both a mechanism to exp...
10/28/2020

### Attribution Preservation in Network Compression for Reliable Network Interpretation

Neural networks embedded in safety-sensitive applications such as self-d...

## Code Repositories

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Trustworthy machine learning has received considerable attention in recent years. An emerging problem to tackle in this domain is to train models that produce reliable interpretations for their predictions. For example, a pathology prediction model may predict certain images as containing malignant tumor. Then one would hope that under visually indistinguishable perturbations of an image, similar sections of the image, instead of entirely different ones, can account for the prediction. However, as Ghorbani, Abid, and Zou GAZ17 convincingly demonstrated, for existing models, one can generate minimal perturbations that substantially change model interpretations, while keeping their predictions intact. Unfortunately, while the robust prediction problem of machine learning models is well known and has been extensively studied in recent years (for example, MMSTV17 ; SND18 ; WK18 , and also the tutorial by Madry and Kolter KM-tutorial ), there has only been limited progress on the problem of robust interpretations.

In this paper we take a step towards solving this problem by viewing it through the lens of axiomatic attribution of neural networks, and propose Robust Attribution Regularization. Our theory is grounded in the recent work, Integrated Gradients (STY17 , in axiomatically attributing a neural network’s output change to its input change. Specifically, given a model

, two input vectors

, and an input coordinate , defines a path integration (parameterized by a curve from to ) that assigns a number to the -th input as its “contribution” to the change of the model’s output from to . enjoys several natural theoretical properties (such as the Axiom of Completeness111Axiom of Completeness says that summing up attributions of all components should give .) that other related methods violate.

We briefly overview our approach. Given a loss function

and a data generating distribution , our Robust Attribution Regularization objective contains two parts: (1) Achieving a small loss over the distribution , and (2) The attributions of the loss over are “close” to the attributions over , if distributions and are close to each other. We can naturally encode these two goals in two classic robust optimization models: (1) In the uncertainty set model BGN09-robust-optimization where we treat sample points as “nominal” points, and assume that true sample points are from certain vicinity around them, which gives:

 \minimizeθ\Exp(\bfx,y)∼P[ρ(\bfx,y;θ)] where\ \ ρ(\bfx,y;θ)=ℓ(\bfx,y;θ)+λmax\bfx′∈N(\bfx,ε)s(\IGℓy\bfh(\bfx,\bfx′;r))

where

is the attribution w.r.t. neurons in an intermediate layer

, and is a size function (e.g., ) measuring the size of , and (2) In the distributional robustness model SND18 ; EK15 , where closeness between and is measured using metrics such as Wasserstein distance, which gives:

 \minimizeθ\ExpP[ℓ(P;θ)]+λsupQ;M∈∏(P,Q){\ExpZ,Z′[d\IG(Z,Z′)] s.t. \ExpZ,Z′[c(Z,Z′)]≤ρ},

In this formulation, is the set of couplings of and , and is one coupling. is a metric, such as , to measure the cost of an adversary perturbing to . is an upper bound on the expected perturbation cost, thus constraining and to be “close” with each together. is a metric to measure the change of attributions from to , where we want a large -change under a small -change. The supremum is taken over and .

We provide theoretical characterizations of our objectives. First, we show that they give principled generalizations of previous objectives designed for robust predictions. Specifically, under weak instantiations of size function

, and how we estimate

computationally, we can leverage axioms satisfied by to recover the robust prediction objective of MMSTV17 , the input gradient regularization objective of RDV18 , and also the distributional robust prediction objective of SND18 . These results provide theoretical evidence that robust prediction training can provide some control over robust interpretations. Second, for one-layer neural networks, we prove that instantiating as 1-norm coincides with the instantiation of as

, and further coincides with classic soft-margin training, which implies that for generalized linear classifiers, soft-margin training will robustify both predictions and interpretations. Finally, we generalize previous theory on distributional robust prediction

SND18 to our objectives, and show that they are closely related.

Through detailed experiments we study the effect of our method in robustifying attributions. On MNIST, GTSRB and Flower datasets, we report encouraging improvement in attribution robustness. Compared with naturally trained models, we show significantly improved attribution robustness, as well as prediction robustness. Compared with Madry et al.’s model MMSTV17 trained for robust predictions, we demonstrate comparable prediction robustness (sometimes even better), while consistently improving attribution robustness. We observe that even when our training stops, the attribution regularization term remains much more significant compared to the natural loss term. We discuss this problem and point out that current optimization techniques may not have effectively optimized our objectives. These results hint at the need for better optimization techniques or new neural network architectures that are more amenable to robust attribution training.

The rest of the paper is organized as follows: Section 2 briefly reviews necessary background. Section 3 presents our framework for robustifying attributions, and proves theoretical characterizations. Section 4 presents instantiations of our method and their optimization, and we report experimental results in Section 5. Finally, Section 6 concludes with a discussion on the intriguing open problems.

## 2 Preliminaries

Axiomatic attribution and Integrated Gradients Let be a real-valued function, and and be two input vectors. Given that function values changes from to , a basic question is: “How to attribute the function value change to the input variables?” A recent work by Sundararajan, Taly and Yan STY17 provides an axiomatic answer to this question. Formally, let be a curve such that , and , Integrated Gradients () for input variable is defined as the following integral:

 \IGfi(\bfx,\bfx′;r)=∫10∂f(r(t))∂\bfxir′(t)dt, (1)

which formalizes the contribution of the -th variable as the integration of the -th partial as we move along curve . Let be the vector where the -th component is , then satisfies some natural axioms. For example, the Axiom of Completeness says that summing all coordinates gives the change of function value: . We refer readers to the paper STY17 for other axioms satisfies.

Integrated Gradients for an intermediate layer. We can generalize the theory of to an intermediate layer of neurons. The key insight is to leverage the fact that Integrated Gradients is a curve integration. Therefore, given some hidden layer , computed by a function induced by previous layers, one can then naturally view the previous layers as inducing a curve which moves from to , as we move from to along curve . Viewed this way, we can thus naturally compute for in a way that leverages all layers of the network222 Proofs are deferred to B.2., Under curve such that and for moving to , and the function induced by layers before , the attribution for for a differentiable is

 \IGfhi(\bfx,\bfx′)=d∑j=1{∫10∂f(h(r(t)))∂hi∂hi(r(t))∂\bfxjr′(t)dt}. (2)

The corresponding summation approximation is:

 \IGfhi(\bfx,\bfx′)=1md∑j=1{m−1∑k=0∂f(h(r(k/m)))∂hi∂hi(r(k/m))∂\bfxjr′(k/m)} (3)

In this section we propose objectives for achieving robust attribution, and study their connections with existing robust training objectives. At a high level, given a loss function and a data generating distribution , our objectives contain two parts: (1) Achieving a small loss over the data generating distribution , and (2) The attributions of the loss over are “close” to the attributions over distribution , if and are close to each other. We can naturally encode these two goals in existing robust optimization models. Below we do so for two popular models: the uncertainty set model and the distributional robustness model.

### 3.1 Uncertainty Set Model

In the uncertainty set model, for any sample for a data generating distribution , we think of it as a “nominal” point and assume that the real sample comes from a neighborhood around . In this case, given any intermediate layer , we propose the following objective function:

 \minimizeθ\Exp(\bfx,y)∼P[ρ(\bfx,y;θ)]where\ \ ρ(\bfx,y;θ)=ℓ(\bfx,y;θ)+λmax\bfx′∈N(\bfx,ε)s(\IGℓy\bfh(\bfx,\bfx′;r)) (4)

where is a regularization parameter, is the loss function with label fixed: , is a curve parameterization from to , and is the integrated gradients of , and therefore gives attribution of changes of as we go from to . is a size function that measures the “size” of the attribution.333 We stress that this regularization term depends on model parameters through loss function .

We now study some particular instantiations of the objective (4). Specifically, we recover existing robust training objectives under weak instantiations (such as choosing as summation function, which is not metric, or use crude approximation of ), and also derive new instantiations that are natural extensions to existing ones.

[Madry et al.’s robust prediction objective] If we set , and let be the function (sum all components of a vector), then for any curve and any intermediate layer , (4) is exactly the objective proposed by Madry et al. MMSTV17 where . We note that: (1) is a weak size function which does not give a metric. (2) As a result, while this robust prediction objective falls within our framework, and regularizes robust attributions, it allows a small regularization term where attributions actually change significantly but they cancel each other in summation. Therefore, the control over robust attributions can be weak.

[Input gradient regularization] For any and , if we set , , and use only the first term of summation approximation (3) to approximate , then (4) becomes exactly the input gradient regularization of Drucker and LeCun DL92 , where we have . In the above we have considered instantiations of a weak size function (summation function), which recovers Madry et al.’s objective, and of a weak approximation of (picking the first term), which recovers input gradient regularization. In the next example, we pick a nontrivial size function, the 1-norm , use the precise , but then we use a trivial intermediate layer, the output loss .

[Regularizing by attribution of the loss output] Let us set , , and (the output layer of loss function!), then we have . We note that this loss function is a “surrogate” loss function for Madry et al.’s loss function because . Therefore, even at such a trivial instantiation, robust attribution regularization provides interesting guarantees.

### 3.2 Distributional Robustness Model

A different but popular model for robust optimization is the distributional robustness model. In this case we consider a family of distributions , each of which is supposed to be a “slight variation” of a base distribution . The goal of robust optimization is then that certain objective functions obtain stable values over this entire family. Here we apply the same underlying idea to the distributional robustness model: One should get a small loss value over the base distribution , and for any distribution , the -based attributions change only a little if we move from to . This is formalized as:

 \minimizeθ\ExpP[ℓ(P;θ)]+λsupQ∈\calP{Wd\IG(P,Q)},

where the is the Wasserstein distance between and under a distance metric .444

For supervised learning problem where

is of the form , we use the same treatment as in SND18 so that cost function is defined as . All our theory carries over to such which has range . We use to highlight that this metric is related to integrated gradients.

We propose again . We are particularly interested in the case where is a Wasserstein ball around the base distribution , using “perturbation” cost metric . This gives regularization term . An unsatisfying aspect of this objective, as one can observe now, is that and can take two different couplings, while intuitively we want to use only one coupling to transport to . For example, this objective allows us to pick a coupling under which we achieve (recall that Wasserstein distance is an infimum over couplings), and a different coupling under which we achieve , but under , , violating the constraint. This motivates the following modification:

 \minimizeθ\ExpP[ℓ(P;θ)]+λsupQ;M∈∏(P,Q){\ExpZ,Z′[d\IG(Z,Z′)] s.t. \ExpZ,Z′[c(Z,Z′)]≤ρ}, (5)

In this formulation, is the set of couplings of and , and is one coupling. is a metric, such as , to measure the cost of an adversary perturbing to . is an upper bound on the expected perturbation cost, thus constraining and to be “close” with each together. is a metric to measure the change of attributions from to , where we want a large -change under a small -change. The supremum is taken over and .

[Wasserstein prediction robustness] Let be the summation function and , then for any curve and any layer , (5) reduces to , which is the objective proposed by Sinha, Namhoong, and Duchi SND18 for robust predictions.

Lagrange relaxation. For any , the Lagrange relaxation of (5) is

 \minimizeθ{\ExpP[ℓ(P;θ)]+λsupQ;M∈∏(P,Q){\ExpM=(Z,Z′)[d\IG(Z,Z′)−γc(Z,Z′)]}} (6)

where the supremum is taken over (unconstrained) and all couplings of and , and we want to find a coupling under which attributions change a lot, while the perturbation cost from to with respect to is small. Recall that is a normal integrand if for each , the mapping is closed-valued and measurable rockafellar2009variational .

Our next two theorems generalize the duality theory in SND18 to a much larger, but natural, class of objectives. Suppose and for any , and suppose is a normal integrand. Then, Consequently, we have (6) to be equal to the following:

 \minimizeθ\Expz∼P[ℓ(z;θ)+λsupz′{d\IG(z,z′)−γc(z,z′)}] (7)

The assumption is true for what we propose, and is true for any typical cost such as distances. The normal integrand assumption is also very weak, e.g., it is satisfied when is continuous and is closed convex.

Note that (7) and (4) are very similar, and so we use (4) for the rest the paper. Finally, given Theorem 3.2, we are also able to connect (5) and (7) with the following duality result: Suppose and for any , and suppose is a normal integrand. For any , there exists such that the optimal solutions of (7) are optimal for (5).

### 3.3 One Layer Neural Networks

We now consider the special case of one-layer neural networks, where the loss function takes the form of , is the model parameters, is a feature vector, is a label, and is nonnegative. We take to be , which corresponds to a strong instantiation that does not allow attributions to cancel each other. Interestingly, we prove that for natural choices of , this is however exactly Madry et al.’s objective MMSTV17 , which corresponds to . That is, the strong () and weak instantiations () coincide for one-layer neural networks. This thus says that for generalized linear classifiers, “robust interpretation” coincides with “robust predictions,” and further with classic soft-margin training.

Suppose that is differentiable, non-decreasing, and convex. Then for , , and neighborhood, (4) reduces to Madry et al.’s objective:

 m∑i=1max∥\bfx′i−\bfxi∥∞≤εg(−yi⟨\bfw,\bfx′i⟩)  (Madry et % al.'s objective) = m∑i=1g(−yi⟨\bfw,\bfxi⟩+ε∥\bfw∥1)  (soft-margin).

Natural losses, such as Negative Log-Likelihood and softplus hinge loss, satisfy the conditions of this theorem.

## 4 Instantiations and Optimizations

In this section we discuss instantiations of (4) and how to optimize them. We start by presenting two objectives instantiated from our method: (1) IG-NORM, and (2) IG-SUM-NORM. Then we discuss how to use gradient descent to optimize these objectives.

IG-NORM. As our first instantiation, we pick , to be the input layer, and to be the straightline connecting and . This gives:

 \minimizeθ\Exp(\bfx,y)∼P[ℓ(\bfx,y;θ)+λmax\bfx′∈N(\bfx,ε)∥\IGℓy(\bfx,\bfx′)∥1]

IG-SUM-NORM. In the second instantiation we combine the sum size function and norm size function, and define . Where is a regularization parameter. Now with the same and as above, and put , then our method simplifies to:

 \minimizeθ\Exp(\bfx,y)∼P[max\bfx′∈N(\bfx,ε){ℓ(\bfx′,y;θ)+β∥\IGℓy(\bfx,\bfx′)∥1}]

which can be viewed as appending an extra robust IG term to .

Gradient descent optimization. We propose the following gradient descent framework to optimize the objectives. The framework is parameterized by an adversary which is supposed to solve the inner max by finding a point which changes attribution significantly. Specifically, given a point at time step during SGD training, we have the following two steps (this can be easily generalized to mini-batches):

Attack step. We run on to find that produces a large inner max term (that is for IG-NORM, and for IG-NORM-SUM.

Gradient step. Fixing , we can then compute the gradient of the corresponding objective with respect to , and then update the model.

Important objective parameters. In both attack and gradient steps, we need to differentiate (in attack step, is fixed and we differentiate w.r.t. , while in gradient step, this is reversed), and this induces a set of parameters of the objectives to tune for optimization, which is summarized in Table 1. Differentiating summation approximation of

amounts to compute second partial derivatives. We rely on the auto-differentiation capability of TensorFlow

## 5 Experiments

We now perform experiments using our method. We ask the following questions: (1) Comparing models trained by our method and naturally trained models at test time, do we maintain the accuracy on unperturbed test inputs? (2) At test time, if we use attribution attacks mentioned in GAZ17 to perturb attributions while keeping predictions intact, how does the attribution robustness of our models compare with that of the naturally trained models? (3) Finally, how do we compare attribution robustness of our models with weak instantiations for robust predictions?

To answer these questions, We perform experiments on three classic datasets: MNIST mnist , GTSRB stallkamp2012man , and Flower nilsback2006visual . In summary, our findings are the following: (1) Our method results in very small drop in test accuracy compared with naturally trained models. (2) On the other hand, our method gives signficantly better attribution robustness, as measured by correlation analyses. (3) Finally, our models yield comparable prediction robustness (sometimes even better), while consistently improving attribution robustness. In the rest of the section we give more details.

Evaluation setup. In this work we use to compute attributions (i.e. feature importance map), which, as demonstrated by GAZ17 , is more robust compared to other related methods (note that, also enjoys other theoretical properties). To attack attribution while retaining model predictions, we use Iterative Feature Importance Attacks (IFIA) proposed by GAZ17 . Due to lack of space, we defer details of parameters and other settings to the appendix. We use two metrics to measure attribution robustness (i.e. how similar the attributions are between original and perturbed images):

Kendall’s tau rank order correlation. Attribution methods rank all of the features in order of their importance, we thus use the rank correlation  kendall1938new to compare similarity between interpretations.

Top-k intersection. We compute the size of intersection of the most important features before and after perturbation.

Compared with GAZ17 , we use Kendall’s tau correlation, instead of Spearman’s rank correlation. The reason is that we found that on the GTSRB and Flower datasets, Spearman’s correlation is not consistent with visual inspection, and often produces too high correlations. In comparison, Kendall’s tau correlation consistently produces lower correlations and aligns better with visual inspection.

Comparing with natural models. Figure 3 presents experimental results comparing our models with naturally trained models. We report natural accuracy (NA

, accuracy on unperturbed test images) and attribution robustness (TopK intersection and Kendall’s correlation). When computing attribution robustness, we only consider the test samples that are correctly classified by the model. Figures (a), (b), and (c) show significant improvements in attribution robustness (measured by either median or confidence intervals). Table

3 shows very small drop in natural accuracy.

Ineffective optimization. We observe that even when our training stops, the attribution regularization term remains much more significant compared to the natural loss term. For example for IG-NORM, when training stops on MNIST, typically stays at  1, but stays at . This indicates that optimization has not been very effective in minimizing the regularization term. There are two possible reasons to this: (1) Because we use summation approximation of , it forces us to compute second derivatives, which may not be numerically stable for deep networks. (2) The network architecture may be inherently unsuitable for robust attributions, rendering the optimization hard to converge.

Comparing with robust prediction models. Finally we compare with Madry et al.’s models, which are trained for robust prediction. We use AA to denote adversarial accuracy (prediction accuracy on perturbed inputs). Again, IN denotes the average topK intersection ( for MNIST and GTSRB datasets, for Flower), and CO denotes the average Kendall’s rank order correlation. Table 2 gives the details of the results. As we can see, our models give comparable adversarial accuracy, and are sometimes even better (on the Flower dataset). On the other hand, we are consistently better in terms of attribution robustness.

## 6 Conclusion

This paper builds a theory to robustify model interpretations through the lens of axiomatic attributions of neural networks. We show that our theory gives principled generalizations of previous formulations for robust predictions, and we characterize our objectives for one-layer neural networks. Experiments demonstrate the effectiveness of our method, although we observe that when training stops, the attribution regularization term remains significant (typically around tens to hundreds), which indicates ineffective optimization for the objectives (partially this might be due to the fact that optimizing amounts to computing second partial derivatives). We believe that our work opens many intriguing avenues for research, such as better optimization tchniques or better architectures for robust attribution training.

## Appendix A Code

Code for this paper is publicly available at the following repository:

## Appendix B Proofs

Let be two distributions, a coupling

is a joint distribution, where, if we marginalize

to the first component, , it is identically distributed as , and if we marginalize to the second component, , it is identically distributed as . Let be the set of all couplings of and , and let be a “cost” function that maps to a real value. Wasserstein distance between and w.r.t. is defined as

 Wc(P,Q)=infM∈∏(P,Q){\Exp(z,z′)∼M[c(z,z′)]}.

Intuitively, this is to find the “best transportation plan” (a coupling ) to minimize the expected transportation cost (transporting to where the cost is ).

### b.2 Integrated Gradients for an Intermediate Layer

In this section we show how to compute Integrated Gradients for an intermediate layer of a neural network. Let be a function that computes a hidden layer of a neural network, where we map a -dimensional input vector to a -dimensional output vector. Given two points and for computing attribution, again we consider a parameterization (which is a mapping ) such that , and .

The key insight is to leverage the fact that Integrated Gradients is a curve integration. Therefore, given some hidden layer, one can then naturally view the previous layers as inducing a curve which moves from to , as we move from to along curve . Viewed this way, we can thus naturally compute for in a way that leverages all layers of the network. Specifically, consider another curve , defined as , to compute a curve integral. By definition we have

 f(\bfx′)−f(\bfx) =g(h(\bfx′))−g(h(\bfx)) =g(γ(1))−g(γ(0)) =∫10k∑i=1∂f(γ(t))∂hiγ′i(t)dt =k∑i=1∫10∂f(γ(t))∂hiγ′i(t)dt

Therefore we can define the attribution of naturally as

 \IGfhi(\bfx,\bfx′)=∫10∂f(γ(t))∂hiγ′i(t)dt

Let’s unpack this a little more:

 ∫10∂f(γ(t))∂hiγ′i(t)dt =∫10∂f(h(r(t)))∂hid∑j=1∂hi(r(t))∂\bfxjr′j(t)dt =∫10∂f(h(r(t)))∂hid∑j=1∂hi(r(t))∂\bfxjr′j(t)dt =d∑j=1{∫10∂f(h(r(t)))∂hi∂hi(r(t))∂\bfxjr′j(t)dt}

This thus gives the lemma Under curve where and , the attribution for for a differentiable function is

 \IGfhi(\bfx,\bfx′,r)=d∑j=1{∫10∂f(h(r(t)))∂hi∂hi(r(t))∂\bfxjr′(t)dt} (8)

Note that (6) nicely recovers attributions for input layer, in which case is the identity function. Summation approximation. Similarly, we can approximate the above Riemann integral using a summation. Suppose we slice into equal segments, then (2) can be approximated as:

 \IGfhi(\bfx,\bfx′)=1md∑j=1{m−1∑k=0∂f(h(r(k/m)))∂hi∂hi(r(k/m))∂\bfxjr′(k/m)} (9)

### b.3 Proof of Proposition 3.1

If we put and let be the function (sum all components of a vector), then for any curve and any intermediate layer , (4) becomes:

 ρ(\bfx,y;θ) =ℓ(\bfx,y;θ)+max\bfx′∈N(\bfx,ε){sum(\IGℓy(\bfx,\bfx′;r))} =ℓ(\bfx,y;θ)+max\bfx′∈N(\bfx,ε){ℓ(\bfx′,y;θ)−ℓ(\bfx,y;θ)} =max\bfx′∈N(\bfx,ε)ℓ(\bfx′,y;θ)

where the second equality is due to the Axiom of Completeness of .

### b.4 Proof of Proposition 3.1

Input gradient regularization is an old idea proposed by Drucker and LeCun [DL92], and is recently used by Ross and Doshi-Velez [RD18] in adversarial training setting. Basically, for , they propose where they want small gradient at . To recover this objective from robust attribution regularization, let us pick as the function (1-norm to the -th power), and consider the simplest curve . With the naïve summation approximation of the integral we have , where larger is, more accurate we approximate the integral. Now, if we put , which is the coarsest approximation, this becomes , and we have Therefore (4) becomes:

 ρ(\bfx,y;θ)= ℓ(\bfx,y;θ)+λmax\bfx′∈N(\bfx,ε){∥\IGℓy(\bfx,\bfx′;θ)∥q1} ≈ ℓ(\bfx,y;θ)+λmax\bfx′∈N(\bfx,ε){∥(\bfx′−\bfx)⊙∇\bfxℓ(\bfx,y;θ)∥q1}

Put the neighborhood as where and . By Hölder’s inequality, which means that Thus by putting , we recover gradient regularization with regularization parameter .

### b.5 Proof of Proposition 3.1

Let us put , and (the output layer of loss function!), then we have

 ρ(\bfx,y;θ)= ℓy(\bfx)+max\bfx′∈N(\bfx,ε){∥\IGℓyℓy(\bfx,\bfx′;r)∥1} = ℓy(\bfx)+max\bfx′∈N(\bfx,ε){|ℓy(\bfx′)−ℓy(\bfx)|}

where the second equality is because .

### b.6 Proof of Proposition 3.2

Specifically, again, let be the summation function and , then we have Because and are identically distributed, thus the objective reduces to

 supQ;M∈∏(P,Q){\ExpZ,Z′[ℓ(Z;θ)+ℓ(Z′;θ)−ℓ(Z;θ)] s.t. \ExpZ,Z′[c(Z,Z′)]≤ρ} = supQ;M∈∏(P,Q){\ExpZ′[ℓ(Z′;θ)] s.t. \ExpZ,Z′[c(Z,Z′)]≤ρ} = supQ:Wc(P,Q)≤ρ{\ExpQ[ℓ(Q;θ)]},

which is exactly Wasserstein prediction robustness objective.

### b.7 Proof of Theorem 3.2

The proof largely follows that for Theorem 5 in [SND18], and we provide it here for completeness. Since we have a joint supremum over and we have that

 supQ;M∈∏(P,Q){\ExpM=(Z,Z′)[dγ\IG(Z,Z′)]} =supQ;M∈∏(P,Q)∫[d\IG(z,z′)−γc(z,z′)]dM(z,z′) ≤∫supz′{d\IG(z,z′)−γc(z,z′)}dP(z) =\Expz∼P[supz′{dγ\IG(z,z′)}].

We would like to show equality in the above.

Let

denote the space of regular conditional probabilities from

to . Then

 supQ;M∈∏(P,Q)∫[d\IG(z,z′)−γc(z,z′)]dM(z,z′)≥supQ∈Q∫[d\IG(z,z′)−γc(z,z′)]dQ(z′|z)dP(z).

Let denote all measurable mappings from to . Using the measurability result in Theorem 14.60 in [RW09], we have

 supz′∈Z′∫[d\IG(z,z′(z))−γc(z,z′(z))]dP(z)=∫supz′[d\IG(z,z′)−γc(z,z′)]dP(z)

since is a normal integrand.

Let be any measurable function that is -close to attaining the supremum above, and define the conditional distribution to be supported on . Then

 supQ;M∈∏(P,Q)∫[d\IG(z,z′)−γc(z,z′)]dM(z,z′) ≥∫[d\IG(z,z′)−γc(z,z′)]dQ(z′|z)dP(z) =∫[d\IG(z,z′(z))−γc(z,z′(z))]dP(z) ≥∫supz′[d\IG(z,z′)−γc(z,z′)]dP(z)−ϵ ≥supQ;M∈∏(P,Q)∫[d\IG