Robust Machine Learning via Privacy/Rate-Distortion Theory

07/22/2020 ∙ by Ye Wang, et al. ∙ Tufts University University of Illinois at Urbana-Champaign MERL 1

Robust machine learning formulations have emerged to address the prevalent vulnerability of deep neural networks to adversarial examples. Our work draws the connection between optimal robust learning and the privacy-utility tradeoff problem, which is a generalization of the rate-distortion problem. The saddle point of the game between a robust classifier and an adversarial perturbation can be found via the solution of a maximum conditional entropy problem. This information-theoretic perspective sheds light on the fundamental tradeoff between robustness and clean data performance, which ultimately arises from the geometric structure of the underlying data distribution and perturbation constraints. Further, we show that under mild conditions, the worst case adversarial distribution with Wasserstein-ball constraints on the perturbation has a fixed point characterization. This is obtained via the first order necessary conditions for optimality of the derived maximum conditional entropy problem. This fixed point characterization exposes the interplay between the geometry of the ground cost in the Wasserstein-ball constraint, the worst-case adversarial distribution, and the given reference data distribution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The widespread susceptibility of neural networks to adversarial examples Szegedy et al. (2014); Goodfellow et al. (2015) has been demonstrated through a wide variety of practical attacks Sharif et al. (2016); Kurakin et al. (2017); Moosavi-Dezfooli et al. (2017); Eykholt et al. (2018); Athalye et al. (2018a); Van Ranst et al. (2019); Li et al. (2019). This has motivated much research towards mitigating these vulnerabilities, although many earlier defenses have been shown to be ineffective Carlini and Wagner (2017a, b); Athalye et al. (2018b). We focus our attention on robust learning formulations that aim for guaranteed resiliency against the worst-case input perturbations or in a distributional sense. Our work draws the information-theoretic connections between optimal robust learning and the privacy-utility tradeoff problem. We utilize this perspective to shed light on the fundamental tradeoff between robustness and clean data performance, and to inspire novel algorithms for optimizing robust models.

The influential approach of Madry et al. (2018) proposes the robust optimization formulation given by

where represents the worst-case over some set of small perturbations applied to the input of the model (parameterized by ), since the maximization is applied for each instance within the expectation over the pair . This formulation has inspired a plethora of defenses: some that tackle the problem directly (albeit with limitations to scalability) Huang et al. (2017a); Katz et al. (2017); Ehlers (2017); Cheng et al. (2017); Tjeng et al. (2019) and others that employ approximate bounding Wong and Kolter (2018); Wong et al. (2018); Raghunathan et al. (2018a, b); Wong et al. (2019) or noise injection Lecuyer et al. (2018); Li et al. (2018); Cohen et al. (2019) to provide certified robustness guarantees.

In order to both generalize this formulation and to establish the connection to the privacy problem, we consider a strengthened adversary by allowing mixed strategies, which is captured with the perturbation as a channel in the formulation

where represents the set of channels that produce small perturbations. In the case of training a classifier via cross-entropy loss, the model provides an approximation of the posterior and . We study the fundamentally optimal value for the ideal robust learning game by instead considering the minimization over all decision rules instead of a particular parametric family. Under this perspective, we show the following minimax result, with Theorems 1 and 2, that reduces the problem to a maximum conditional entropy problem,

This maximum entropy perspective is equivalent to the information-theoretic treatment of the privacy-utility tradeoff problem Rebollo-Monedero et al. (2010); Calmon and Fawaz (2012); Makhdoumi et al. (2014); Salamatian et al. (2015); Basciftci et al. (2016), where the aim is to design a distortion-constrained data perturbation mechanism (corresponding to ) that maximizes the uncertainty about sensitive information (represented by ). The equivalence between the maximin problem and maximum conditional entropy is used by Calmon and Fawaz (2012) to argue from an adversarial perspective, where represents an privacy attacker that aims to infer the sensitive data, that conditional entropy (or equivalently, mutual information) measures privacy against an inference attack. This perspective is adopted in the learning frameworks of Tripathy et al. (2017); Huang et al. (2017b), where adversarial networks are trained toward solving this maximin problem. Figure 1 illustrates the connection between the robustness and privacy problems.

Ensuring distributional robustness provides even more general guarantees, by considering the worst-case data distribution over some set , for which we similarly have

reducing again to a constrained maximum conditional entropy problem. Distributional robustness subsumes the earlier expected distortion constraint as a special case when is a Wasserstein-ball with a suitably chosen ground metric. In Theorem 3, we show that the maximum conditional entropy problem over a Wasserstein-ball constraint has a fixed point characterization, which exposes the interplay between the geometry of the ground cost in the Wasserstein-ball constraint, the worst-case adversarial distribution, and the given reference data distribution.

We also examine the fundamental tradeoff between model robustness and clean data performance from our information-theoretic perspective. This tradeoff ultimately arises from the geometric structure of the underlying data distribution and the adversarial perturbation constraints. We illustrate these tradeoffs with the numerical analysis of a toy example.

Additional Related Work

In Farnia and Tse (2016), a similar minimax theorem is derived, however technical conditions prevent its direct applicability to adversarial data perturbation, and much of their development focuses on the case where the marginal distribution for the data remains fixed. The similarities between the robust learning and privacy problems are noted by Hamm and Mehra (2017), however, they only state the minimax inequality relating the two. The fundamental tradeoff between clean data and adversarial loss was first theoretically addressed by Tsipras et al. (2019). This theory was further expanded upon by Zhang et al. (2019) and leveraged to develop an improved adversarial training defense.

cross-entropy loss

Figure 1: The Robust Learning and Privacy-Utility Tradeoff problems both involve a game between a classifier and a perturbation that can change its input, but within some constraints. In robust learning, the goal is a classifier that is robust to the adversarial perturbation, and is posed as a minimax problem. The alternative maximin optimization captures the privacy-utility tradeoff problem, where the goal is a perturbation mechanism that hides sensitive information from an adversarial classifier that aims to recover it. Our minimax result shows that these two problems are equivalent.

Notation

We use

to denote the set of conditional probability distributions over

given variables over the sets and , and is similarly defined.

2 Robust Machine Learning

The influential robust learning formulation of Madry et al. (2018) addresses the worst-case attack, as given by

(1)

where is some suitably chosen distortion metric (e.g., often , , or distance), and represents the allowable perturbation. The robust learning formulation in (1) can be viewed as a two-player zero-sum game, where the adversary (corresponding to the inner maximization) plays second using a pure strategy by picking a fixed subject to the distortion constraint. We will instead consider an adversary that utilizes a mixed strategy, where can be a randomized function of as specified by a conditional distribution . This is expressed by a revised formulation given by

(2)

where the expectation is over , and the distortion limit is given by

(3)

Note that under this maximum distortion constraint, allowing mixed strategies does not actually strengthen the adversary, i.e., the games in (1) and (2) have the same value. However, if we replace the distortion limit constraint of (3) with an average distortion constraint, given by

(4)

then the adversary is potentially strengthened, i.e.,

2.1 Distributional Robustness

Since the objective

only depends only the joint distribution of the variables

, the robust learning formulation is straightforward to generalize by instead considering the maximization over an arbitrary set of joint distributions . With a change of variable (replacing with to simplify presentation), this formulation becomes

(5)

which includes the scenarios considered in (1) through (4) as special cases. However, unlike these earlier formulations, (5) allows for the label variable to be potentially changed as well.

Another particular case for is the Wasserstein-ball around a distribution , as given by

(6)

where is the 1-Wasserstein distance Santambrogio (2015); Villani (2009); Peyré and Cuturi (2019) for some ground metric (or in general a cost) on the space . Recall that the 1-Wasserstein distance is given by

where the set of couplings is defined as all joint distributions with the marginals and . Note that maximizing over is equivalent to maximizing over channels subject to the distortion expected constraint , where . Unlike the formulation considered in (2), this channel may also change the label . However, if modifying is prohibited by a distortion metric of the form

(7)

then the 1-Wasserstein distributionally robust learning formulation is equivalent to the earlier formulation in (2) with the average distortion constraint given by (4). Robust-ML with Wasserstein-ball constraints is also referred to as Distributional Robust Optimization (DRO), which appeared in seminal works of Blanchet and Murthy (2016); Blanchet et al. (2018, 2019); Gao and Kleywegt (2016); Gao et al. (2017) and used in for e.g.  Sinha et al. (2017); Lee and Raginsky (2018) for Robust-ML applications. In essence it was shown that DRO is approximately equivalent to imposing Lipschitz constraints on the classifier Cranko et al. (2020); Gao et al. (2017), which can be incorporated into the optimization routine. There is however no characterization of the optimal value of the min-max problem in this setting.

2.2 Optimal Robust Learning

The specifics of the loss function

and model are crucial to analysis. Hence, we will focus specifically on learning classification models, where represents the data features, represent class labels, and the model can be viewed as producing that aims to approximate the underlying posterior . When cross-entropy is the loss function, i.e., , the expected loss, with respect to some distribution , is given by

(8)

Thus, the principle of learning via minimizing the expected cross-entropy loss optimizes the approximate posterior toward the underlying posterior , and the loss is lower bounded by the conditional entropy , which is arguably nonzero for nontrivial classification problems.

The robust learning problem, given by

(9)

still critically depends on the specific parametric family (e.g., neural network architecture) chosen for the model , which determines the corresponding parametric family of approximate posteriors, i.e., . Motivated by the ultimate meta-objective of determining the best architectures for robust learning, we consider the idealized optimal robust learning formulation where the minimization is performed over all conditional distributions , as given by

(10)

which clearly lower-bounds (9), which is specific to the particular parametric family.

3 The Privacy-Utility Tradeoff Problem

In the information-theoretic treatment of the privacy-utility tradeoff problem, the random variables

respectively denote useful and sensitive data, and the aim is to release data produced from a randomized algorithm specified by a channel , while simultaneously preserving privacy with respect to the sensitive variable and maintaining utility with respect to the useful variable . Although privacy can be quantified in various ways (cf. Issa et al. (2016); Liao et al. (2018); Rassouli and Gunduz (2019)), we will focus on a particular information-theoretic approach (see Rebollo-Monedero et al. (2010); Calmon and Fawaz (2012); Makhdoumi et al. (2014); Salamatian et al. (2015); Basciftci et al. (2016)) that utilizes mutual information to measure the privacy leakage, with the aim of making this small in order to preserve privacy. Utility is quantified with respect to a distortion function, , which is suitably chosen for the particular application. Minimizing (or limiting) the distortion captures the objective of maintaining the utility of the data release. Since the useful and sensitive data are correlated (and indeed the problem is uninteresting if they are independent), a tradeoff naturally emerges between the two objectives of preserving privacy and utility.

3.1 Optimal Privacy-Utility Tradeoff

The optimal privacy-utility tradeoff problem is formulated as an information-theoretic optimization problem in Rebollo-Monedero et al. (2010); Calmon and Fawaz (2012), as given by

(11)

where , the constraint , as given in (4), captures the expected distortion budget, and the equivalence follows from since is constant. Similarly, one could consider the alternative maximum distortion constraint , given in (3).

3.2 Adversarial Formulation of Privacy

In Calmon and Fawaz (2012), the privacy-utility problem in (11), is derived from a broader perspective that poses privacy as maximizing the loss of an adversary that mounts a statistical inference attack attempting to recover the sensitive from the release . Their framework considers an adversary that can observe the release and choose a conditional distribution to minimize its expected loss. As observed in Calmon and Fawaz (2012), when cross-entropy (or “self-information”) is the loss, we have that

(12)

with the optimum , which follows from a derivation similar to (8). Thus, the optimal privacy-utility tradeoff given in (11) is equivalent to a maximin problem, as stated in Lemma 1.

Lemma 1 (equivalence of privacy formulations Calmon and Fawaz (2012)).

For any joint distribution and closed, convex constraint set , e.g., or , as given by (3) or (4), we have

where .

3.3 Connections to Rate-Distortion Theory

The privacy-utility tradeoff problem is also highly related to rate-distortion theory, which considers the efficiency of lossy data compression. When , the optimization problem in (11) immediately reduces to the single-letter characterization of the optimal rate-distortion tradeoff. However, the privacy problem considers an inherently single-letter scenario, where we deal with just a single instance of the variables , which could naturally be high-dimensional, but have no restrictions placed on their statistical structure across these dimensions. Another related approach Yamamoto (1983); Sankar et al. (2013) considers an asymptotic coding formulation that replaces

with vectors of iid samples and also adds coding efficiency into the consideration of a three-way rate-privacy-utility tradeoff.

4 Main Results : Duality between Optimal Robust Learning and Privacy-Utility Tradeoffs

The solution to the optimal minimax robust learning problem can be found via a maximum conditional entropy problem related to the privacy-utility tradeoff problem.

Theorem 1.

For any finite sets and , and closed, convex set of joint distributions , we have

(13)
(14)
(15)

where the expectations and entropy are with respect to . Further, the solutions for that minimize (13) are given by

(16)
Proof.

See Appendix in the supplementary material. ∎

Intuitively, the optimal minimax robust decision rule that solves (13) must be consistent with the posterior corresponding to the solution of the maximum conditional entropy problem in (15). However, a given posterior is well-defined only over the support of the marginal distribution of , whereas the robust decision rule needs to be defined over the entire space . Hence, generally, determining the robust decision rule over the entirety of requires considering the solution set in (16), which seems cumbersome, but can be simplified in many cases via the following corollary.

Corollary 1.

Under the paradigm of Theorem 1, let

For all , the corresponding terms of (16) are given by

Further, if

then the solution set given by (16), for the minimization of (13), contains exactly one point and is given by

In the simplest case, if there exists a that has full support over (in the marginal distribution for ), then the optimal robust decision rule that solves the minimization of (13) is simply given by the posterior , which is defined for all .

4.1 Generalization to Arbitrary Alphabets

Extending the result in the previous section to continuous requires one to expand the set of allowable Markov kernels, i.e., conditional probabilities, to what is referred to as the set of generalized decision rules in statistical decision theory Strasser (2011); LeCam (1955); Cam (1986); Vaart (2002). This is because the set of Markov kernels is not compact, while the set of generalized decision rules is. For any , set of bounded continuous functions, and any bounded signed measure on , given a mapping (interpret this as a measurable function over for each fixed ), define a bilinear functional via,

(17)
Definition 1.

Strasser (2011) A generalized decision function is a bilinear function that satisfies, (a) , (b) , (c) .

Define the set of generalized decision rules as the set of bi-linear functions defined via (17) and satisfying the properties (a), (b), (c) above.

(18)

Applying these results, we obtain the following theorem for the case of general alphabets . Note that in contrast to Theorem 1, here the results hold with instead of .

Theorem 2.

Under the paradigm of Theorem 1, for continuous alphabets and discrete ,

(19)
Proof.

Using the fact that the set is convex and compact for the weak topology (Theorem 42.3, Strasser (2011)), that the function is convex in for all , and applying the minimax theorem Pollard (2003),

(20)

The result then follows by noting that, . This result implies that even in the case of continuous alphabets, the worst case algorithm independent adversarial perturbation can be computed by solving for . ∎

5 Implications of the main results

5.1 A fixed point characterization of the worst case perturbation

We consider the particular case when is the Wasserstein-ball around a distribution :

and derive the necessary conditions for optimality for the solution to , where by the subscript in the conditional entropy we highlight the fact that the conditional entropy is computed under the joint distribution . To this end we adopt a Lagrangian viewpoint and we assume that and are continuous bounded and compact sets, but the result can be seen to hold true when is continuous and is discrete. The result is summarized in the Theorem below.

Theorem 3.

If the cost is continuous with continuous first derivative and the distribution is supported on the whole of the domain , the optimal solution to for some satisfies,

(21)

where is the Kantorovich Potential 111Kantorovich Potential is the variable of optimization in the dual problem to the optimal transport problem. We refer the reader to Santambrogio (2015); Villani (2009) and Peyré and Cuturi (2019) for these definitions and notions related to theory of Optimal Transport.corresponding to the optimal solution to the transport problem from to under the ground cost , capital is a constant,

is a uniform distribution over

, and is the marginal distribution under the joint .

Proof.

See Appendix in the supplementary material. ∎

This characterization ties closely the geometry of the perturbations (as reflected via the Kantorovich Potential) with the worst case distribution that maximizes the conditional entropy.

The algorithmic implications of this fixed point relation will be undertaken in an upcoming manuscript.

5.2 Robustness vs Clean Data Loss Tradeoffs

A natural question to ask is whether robustness comes at a price. It has been observed empirically that robust models will underperform on clean data in comparison to conventional, non-robust models. To understand why this is fundamentally unavoidable, we examine the loss for robust and non-robust models in combination with clean data or under adversarial attack.

Let denote the unperturbed (clean data) distribution within the set of potential adversarial attacks . For a given decision rule and distribution , recall that the cross-entropy loss is given by (8) as

The baseline loss of the ideal non-robust model for clean data is given by

Under adversarial attack, the ideal loss of the robust model is given by Theorem 1 as

For a robust model that solves (13), which is characterized by (16), its loss under the clean data distribution is given by

The KL-divergence term must be finite, since we have

where the second inequality follows from being the minimax solution.

Figure 2: Left: Loss as a function of decision rule, varying , and across attacks varying . Right: Loss as a function of attack distortion, varying , and across decision rules varying .

We numerically evaluate these tradeoffs by considering a family of Wasserstein-ball constraint sets , as given by (6), with varying radius around a distribution over finite alphabets . The ground metric is of the form given in (7), which effectively limits the perturbation to only changing within an expected squared-distance distortion constraint of , as equivalent to (4). The distribution was randomly chosen, and has entropies and (in nats).

Leveraging Theorem 1 and Corollary 1, we numerically solve for the robust decision rules,

across a range distortion constraints . In combination with each decision rule, we consider the loss under attacks at varying distortion limits , as given by

Figure 2 plots the loss across the combination of and . On the left of Figure 2, each curve is a fixed attack distortion , over which the decision rule is varied, with the optimal loss obtained when . As increases, the loss for all curves converge to . In the right of Figure 2, the dotted black curve is the maximum conditional entropy over at each , which corresponds to the ideal robust loss when . The other curves are each a fixed decision rule , over which the attack distortion is varied, which exhibits suboptimal loss for mismatched . The beginning of each curve, at , is the clean data loss for each rule, and we can see that clean data loss is degraded as robustness for higher distortions is improved. In the extreme of a decision rule designed to be robust for very high , the loss is uniformly equal to across all , since this robust decision rule only simply guesses the prior .

Broader Impact

As a theory paper regarding the problem of robust learning that addresses the threat posed by adversarial example attack, short-term ethical or societal consequences are not expected. The potential long-term upside of our work is that better theoretical understanding of these issues may lead to the development and application of more resilient machine learning technology to better address safety, security, and reliability concerns. A corresponding risk is that progress toward expanding fundamental knowledge may also be leveraged to realize more sophisticated attacks that may undermine already widely deployed AI systems. However, the advancement of attacks is perhaps inevitable, and, hence, research into defenses must be conducted.

References

Supplementary Material for “Robust Machine Learning via Privacy/Rate-Distortion Theory”

6 Proof of Theorem 1

Proof.

The relations in (15) and the existence of the maximums and minimum in (14) and (15) follow from a straightforward generalization of Lemma 1. The rest of the proof follows the same general steps as the proof of a generalized minimax theorem given by Pollard [2003], except adapted for minimums and maximums rather than infimums and supremums.

For convenience, we define

(22)
(23)

Note that is linear in for fixed , and convex in for fixed . Further, for all , and is compact, convex, and nonempty.

We only need to show that (13) is less than or equal to (14), which would follow from , which is equivalent to (16). Since, each is compact, it is sufficient to show that for every finite subset  [Rudin, 1964, Thm. 2.36]. We will first show this for any two-point set , and later extend this to every finite set through an inductive argument.

Suppose , then a contradiction would occur if we can show that there exists such that for all ,

(24)

since then , where .

For , we immediately have (24), since both and . For (24) to hold for all , we must require

(25)

The supremum is , since and , from the assumption . For (24) to hold for all , we must also require

(26)

The infimum is , since and , from the assumption . Thus, an satisfying both (25) and (26) exists if and only if for all and ,

or equivalently,

(27)

Since (27) is immediate if either or , we need only consider when both and . Define such that

(28)

and let . Since is convex in , , which implies that hence (since we assumed that they are disjoint), which further implies that

(29)

Thus, by combining (28) and (29),

which implies (27) and the existence of , which contradicts the assumption that .

The pairwise result implies that for any finite set , for . Then, we can repeat the argument starting from (22) with further restricted to , i.e., replacing in subsequent steps with , which effectively redefines (23) with , and eventually leads to for . Thus, repeating this argument further yields that for any finite subset , which, as argued earlier, implies (16). ∎

7 Proof of Theorem 3

All the proof steps assume continuous and compact but it is easy to see that the steps hold true for discrete and finite and continuous . We begin with the following definition that is taken from Chapter 7 in Santambrogio [2015].

Definition 2.

Given a functional , if is a regular point222See Chapter 7, Santambrogio [2015] for definition of a regular point. of , and for any perturbation , one calls the first variation of if

(30)

It can be seen that the first variations are unique up a constant. The proof then follows from the following two lemmas.

Lemma 2.

Santambrogio [2015] The first variation of a the optimal transport cost with respect to is given by the Kontorovich potential, , provided it is unique. A sufficient condition for uniqueness of is that the cost is continuous with continuous first derivative and is supported on the whole of the domain.

Lemma 3.

The first variation of the conditional entropy function defined by

(31)

is given by , where is a uniform distribution over and is the marginal over under the joint .

Proof.

Notation: In the following to be concise and avoid a cumbersome notation we will often not explicitly write but just use . On the other hand we will keep explicit the notation so as to not lose sight of it.

By definition consider a perturbation around and let us look at

(32)
(33)
(34)

where . Let us focus on the first term.

(35)