The widespread susceptibility of neural networks to adversarial examples Szegedy et al. (2014); Goodfellow et al. (2015) has been demonstrated through a wide variety of practical attacks Sharif et al. (2016); Kurakin et al. (2017); Moosavi-Dezfooli et al. (2017); Eykholt et al. (2018); Athalye et al. (2018a); Van Ranst et al. (2019); Li et al. (2019). This has motivated much research towards mitigating these vulnerabilities, although many earlier defenses have been shown to be ineffective Carlini and Wagner (2017a, b); Athalye et al. (2018b). We focus our attention on robust learning formulations that aim for guaranteed resiliency against the worst-case input perturbations or in a distributional sense. Our work draws the information-theoretic connections between optimal robust learning and the privacy-utility tradeoff problem. We utilize this perspective to shed light on the fundamental tradeoff between robustness and clean data performance, and to inspire novel algorithms for optimizing robust models.
The influential approach of Madry et al. (2018) proposes the robust optimization formulation given by
where represents the worst-case over some set of small perturbations applied to the input of the model (parameterized by ), since the maximization is applied for each instance within the expectation over the pair . This formulation has inspired a plethora of defenses: some that tackle the problem directly (albeit with limitations to scalability) Huang et al. (2017a); Katz et al. (2017); Ehlers (2017); Cheng et al. (2017); Tjeng et al. (2019) and others that employ approximate bounding Wong and Kolter (2018); Wong et al. (2018); Raghunathan et al. (2018a, b); Wong et al. (2019) or noise injection Lecuyer et al. (2018); Li et al. (2018); Cohen et al. (2019) to provide certified robustness guarantees.
In order to both generalize this formulation and to establish the connection to the privacy problem, we consider a strengthened adversary by allowing mixed strategies, which is captured with the perturbation as a channel in the formulation
where represents the set of channels that produce small perturbations. In the case of training a classifier via cross-entropy loss, the model provides an approximation of the posterior and . We study the fundamentally optimal value for the ideal robust learning game by instead considering the minimization over all decision rules instead of a particular parametric family. Under this perspective, we show the following minimax result, with Theorems 1 and 2, that reduces the problem to a maximum conditional entropy problem,
This maximum entropy perspective is equivalent to the information-theoretic treatment of the privacy-utility tradeoff problem Rebollo-Monedero et al. (2010); Calmon and Fawaz (2012); Makhdoumi et al. (2014); Salamatian et al. (2015); Basciftci et al. (2016), where the aim is to design a distortion-constrained data perturbation mechanism (corresponding to ) that maximizes the uncertainty about sensitive information (represented by ). The equivalence between the maximin problem and maximum conditional entropy is used by Calmon and Fawaz (2012) to argue from an adversarial perspective, where represents an privacy attacker that aims to infer the sensitive data, that conditional entropy (or equivalently, mutual information) measures privacy against an inference attack. This perspective is adopted in the learning frameworks of Tripathy et al. (2017); Huang et al. (2017b), where adversarial networks are trained toward solving this maximin problem. Figure 1 illustrates the connection between the robustness and privacy problems.
Ensuring distributional robustness provides even more general guarantees, by considering the worst-case data distribution over some set , for which we similarly have
reducing again to a constrained maximum conditional entropy problem. Distributional robustness subsumes the earlier expected distortion constraint as a special case when is a Wasserstein-ball with a suitably chosen ground metric. In Theorem 3, we show that the maximum conditional entropy problem over a Wasserstein-ball constraint has a fixed point characterization, which exposes the interplay between the geometry of the ground cost in the Wasserstein-ball constraint, the worst-case adversarial distribution, and the given reference data distribution.
We also examine the fundamental tradeoff between model robustness and clean data performance from our information-theoretic perspective. This tradeoff ultimately arises from the geometric structure of the underlying data distribution and the adversarial perturbation constraints. We illustrate these tradeoffs with the numerical analysis of a toy example.
Additional Related Work
In Farnia and Tse (2016), a similar minimax theorem is derived, however technical conditions prevent its direct applicability to adversarial data perturbation, and much of their development focuses on the case where the marginal distribution for the data remains fixed. The similarities between the robust learning and privacy problems are noted by Hamm and Mehra (2017), however, they only state the minimax inequality relating the two. The fundamental tradeoff between clean data and adversarial loss was first theoretically addressed by Tsipras et al. (2019). This theory was further expanded upon by Zhang et al. (2019) and leveraged to develop an improved adversarial training defense.
to denote the set of conditional probability distributions overgiven variables over the sets and , and is similarly defined.
2 Robust Machine Learning
The influential robust learning formulation of Madry et al. (2018) addresses the worst-case attack, as given by
where is some suitably chosen distortion metric (e.g., often , , or distance), and represents the allowable perturbation. The robust learning formulation in (1) can be viewed as a two-player zero-sum game, where the adversary (corresponding to the inner maximization) plays second using a pure strategy by picking a fixed subject to the distortion constraint. We will instead consider an adversary that utilizes a mixed strategy, where can be a randomized function of as specified by a conditional distribution . This is expressed by a revised formulation given by
where the expectation is over , and the distortion limit is given by
Note that under this maximum distortion constraint, allowing mixed strategies does not actually strengthen the adversary, i.e., the games in (1) and (2) have the same value. However, if we replace the distortion limit constraint of (3) with an average distortion constraint, given by
then the adversary is potentially strengthened, i.e.,
2.1 Distributional Robustness
Since the objective
only depends only the joint distribution of the variables, the robust learning formulation is straightforward to generalize by instead considering the maximization over an arbitrary set of joint distributions . With a change of variable (replacing with to simplify presentation), this formulation becomes
Another particular case for is the Wasserstein-ball around a distribution , as given by
where is the 1-Wasserstein distance Santambrogio (2015); Villani (2009); Peyré and Cuturi (2019) for some ground metric (or in general a cost) on the space . Recall that the 1-Wasserstein distance is given by
where the set of couplings is defined as all joint distributions with the marginals and . Note that maximizing over is equivalent to maximizing over channels subject to the distortion expected constraint , where . Unlike the formulation considered in (2), this channel may also change the label . However, if modifying is prohibited by a distortion metric of the form
then the 1-Wasserstein distributionally robust learning formulation is equivalent to the earlier formulation in (2) with the average distortion constraint given by (4). Robust-ML with Wasserstein-ball constraints is also referred to as Distributional Robust Optimization (DRO), which appeared in seminal works of Blanchet and Murthy (2016); Blanchet et al. (2018, 2019); Gao and Kleywegt (2016); Gao et al. (2017) and used in for e.g. Sinha et al. (2017); Lee and Raginsky (2018) for Robust-ML applications. In essence it was shown that DRO is approximately equivalent to imposing Lipschitz constraints on the classifier Cranko et al. (2020); Gao et al. (2017), which can be incorporated into the optimization routine. There is however no characterization of the optimal value of the min-max problem in this setting.
2.2 Optimal Robust Learning
The specifics of the loss functionand model are crucial to analysis. Hence, we will focus specifically on learning classification models, where represents the data features, represent class labels, and the model can be viewed as producing that aims to approximate the underlying posterior . When cross-entropy is the loss function, i.e., , the expected loss, with respect to some distribution , is given by
Thus, the principle of learning via minimizing the expected cross-entropy loss optimizes the approximate posterior toward the underlying posterior , and the loss is lower bounded by the conditional entropy , which is arguably nonzero for nontrivial classification problems.
The robust learning problem, given by
still critically depends on the specific parametric family (e.g., neural network architecture) chosen for the model , which determines the corresponding parametric family of approximate posteriors, i.e., . Motivated by the ultimate meta-objective of determining the best architectures for robust learning, we consider the idealized optimal robust learning formulation where the minimization is performed over all conditional distributions , as given by
which clearly lower-bounds (9), which is specific to the particular parametric family.
3 The Privacy-Utility Tradeoff Problem
In the information-theoretic treatment of the privacy-utility tradeoff problem, the random variablesrespectively denote useful and sensitive data, and the aim is to release data produced from a randomized algorithm specified by a channel , while simultaneously preserving privacy with respect to the sensitive variable and maintaining utility with respect to the useful variable . Although privacy can be quantified in various ways (cf. Issa et al. (2016); Liao et al. (2018); Rassouli and Gunduz (2019)), we will focus on a particular information-theoretic approach (see Rebollo-Monedero et al. (2010); Calmon and Fawaz (2012); Makhdoumi et al. (2014); Salamatian et al. (2015); Basciftci et al. (2016)) that utilizes mutual information to measure the privacy leakage, with the aim of making this small in order to preserve privacy. Utility is quantified with respect to a distortion function, , which is suitably chosen for the particular application. Minimizing (or limiting) the distortion captures the objective of maintaining the utility of the data release. Since the useful and sensitive data are correlated (and indeed the problem is uninteresting if they are independent), a tradeoff naturally emerges between the two objectives of preserving privacy and utility.
3.1 Optimal Privacy-Utility Tradeoff
where , the constraint , as given in (4), captures the expected distortion budget, and the equivalence follows from since is constant. Similarly, one could consider the alternative maximum distortion constraint , given in (3).
3.2 Adversarial Formulation of Privacy
In Calmon and Fawaz (2012), the privacy-utility problem in (11), is derived from a broader perspective that poses privacy as maximizing the loss of an adversary that mounts a statistical inference attack attempting to recover the sensitive from the release . Their framework considers an adversary that can observe the release and choose a conditional distribution to minimize its expected loss. As observed in Calmon and Fawaz (2012), when cross-entropy (or “self-information”) is the loss, we have that
3.3 Connections to Rate-Distortion Theory
The privacy-utility tradeoff problem is also highly related to rate-distortion theory, which considers the efficiency of lossy data compression. When , the optimization problem in (11) immediately reduces to the single-letter characterization of the optimal rate-distortion tradeoff. However, the privacy problem considers an inherently single-letter scenario, where we deal with just a single instance of the variables , which could naturally be high-dimensional, but have no restrictions placed on their statistical structure across these dimensions. Another related approach Yamamoto (1983); Sankar et al. (2013) considers an asymptotic coding formulation that replaces
with vectors of iid samples and also adds coding efficiency into the consideration of a three-way rate-privacy-utility tradeoff.
4 Main Results : Duality between Optimal Robust Learning and Privacy-Utility Tradeoffs
The solution to the optimal minimax robust learning problem can be found via a maximum conditional entropy problem related to the privacy-utility tradeoff problem.
For any finite sets and , and closed, convex set of joint distributions , we have
where the expectations and entropy are with respect to . Further, the solutions for that minimize (13) are given by
See Appendix in the supplementary material. ∎
Intuitively, the optimal minimax robust decision rule that solves (13) must be consistent with the posterior corresponding to the solution of the maximum conditional entropy problem in (15). However, a given posterior is well-defined only over the support of the marginal distribution of , whereas the robust decision rule needs to be defined over the entire space . Hence, generally, determining the robust decision rule over the entirety of requires considering the solution set in (16), which seems cumbersome, but can be simplified in many cases via the following corollary.
In the simplest case, if there exists a that has full support over (in the marginal distribution for ), then the optimal robust decision rule that solves the minimization of (13) is simply given by the posterior , which is defined for all .
4.1 Generalization to Arbitrary Alphabets
Extending the result in the previous section to continuous requires one to expand the set of allowable Markov kernels, i.e., conditional probabilities, to what is referred to as the set of generalized decision rules in statistical decision theory Strasser (2011); LeCam (1955); Cam (1986); Vaart (2002). This is because the set of Markov kernels is not compact, while the set of generalized decision rules is. For any , set of bounded continuous functions, and any bounded signed measure on , given a mapping (interpret this as a measurable function over for each fixed ), define a bilinear functional via,
Strasser (2011) A generalized decision function is a bilinear function that satisfies, (a) , (b) , (c) .
Define the set of generalized decision rules as the set of bi-linear functions defined via (17) and satisfying the properties (a), (b), (c) above.
Applying these results, we obtain the following theorem for the case of general alphabets . Note that in contrast to Theorem 1, here the results hold with instead of .
Under the paradigm of Theorem 1, for continuous alphabets and discrete ,
The result then follows by noting that, . This result implies that even in the case of continuous alphabets, the worst case algorithm independent adversarial perturbation can be computed by solving for . ∎
5 Implications of the main results
5.1 A fixed point characterization of the worst case perturbation
We consider the particular case when is the Wasserstein-ball around a distribution :
and derive the necessary conditions for optimality for the solution to , where by the subscript in the conditional entropy we highlight the fact that the conditional entropy is computed under the joint distribution . To this end we adopt a Lagrangian viewpoint and we assume that and are continuous bounded and compact sets, but the result can be seen to hold true when is continuous and is discrete. The result is summarized in the Theorem below.
If the cost is continuous with continuous first derivative and the distribution is supported on the whole of the domain , the optimal solution to for some satisfies,
where is the Kantorovich Potential 111Kantorovich Potential is the variable of optimization in the dual problem to the optimal transport problem. We refer the reader to Santambrogio (2015); Villani (2009) and Peyré and Cuturi (2019) for these definitions and notions related to theory of Optimal Transport.corresponding to the optimal solution to the transport problem from to under the ground cost , capital is a constant, is a uniform distribution over
is a uniform distribution over, and is the marginal distribution under the joint .
See Appendix in the supplementary material. ∎
This characterization ties closely the geometry of the perturbations (as reflected via the Kantorovich Potential) with the worst case distribution that maximizes the conditional entropy.
The algorithmic implications of this fixed point relation will be undertaken in an upcoming manuscript.
5.2 Robustness vs Clean Data Loss Tradeoffs
A natural question to ask is whether robustness comes at a price. It has been observed empirically that robust models will underperform on clean data in comparison to conventional, non-robust models. To understand why this is fundamentally unavoidable, we examine the loss for robust and non-robust models in combination with clean data or under adversarial attack.
Let denote the unperturbed (clean data) distribution within the set of potential adversarial attacks . For a given decision rule and distribution , recall that the cross-entropy loss is given by (8) as
The baseline loss of the ideal non-robust model for clean data is given by
Under adversarial attack, the ideal loss of the robust model is given by Theorem 1 as
The KL-divergence term must be finite, since we have
where the second inequality follows from being the minimax solution.
We numerically evaluate these tradeoffs by considering a family of Wasserstein-ball constraint sets , as given by (6), with varying radius around a distribution over finite alphabets . The ground metric is of the form given in (7), which effectively limits the perturbation to only changing within an expected squared-distance distortion constraint of , as equivalent to (4). The distribution was randomly chosen, and has entropies and (in nats).
across a range distortion constraints . In combination with each decision rule, we consider the loss under attacks at varying distortion limits , as given by
Figure 2 plots the loss across the combination of and . On the left of Figure 2, each curve is a fixed attack distortion , over which the decision rule is varied, with the optimal loss obtained when . As increases, the loss for all curves converge to . In the right of Figure 2, the dotted black curve is the maximum conditional entropy over at each , which corresponds to the ideal robust loss when . The other curves are each a fixed decision rule , over which the attack distortion is varied, which exhibits suboptimal loss for mismatched . The beginning of each curve, at , is the clean data loss for each rule, and we can see that clean data loss is degraded as robustness for higher distortions is improved. In the extreme of a decision rule designed to be robust for very high , the loss is uniformly equal to across all , since this robust decision rule only simply guesses the prior .
As a theory paper regarding the problem of robust learning that addresses the threat posed by adversarial example attack, short-term ethical or societal consequences are not expected. The potential long-term upside of our work is that better theoretical understanding of these issues may lead to the development and application of more resilient machine learning technology to better address safety, security, and reliability concerns. A corresponding risk is that progress toward expanding fundamental knowledge may also be leveraged to realize more sophisticated attacks that may undermine already widely deployed AI systems. However, the advancement of attacks is perhaps inevitable, and, hence, research into defenses must be conducted.
- Szegedy et al.  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/1312.6199.
- Goodfellow et al.  Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6572.
Sharif et al. 
Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter.
Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition.In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 1528–1540, 2016.
- Kurakin et al.  Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. ICLR Workshop, 2017. URL https://arxiv.org/abs/1607.02533.
- Moosavi-Dezfooli et al.  Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In , pages 1765–1773, 2017.
Eykholt et al. 
Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei
Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song.
Robust physical-world attacks on deep learning visual classification.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1625–1634, 2018.
- Athalye et al. [2018a] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In International Conference on Machine Learning, pages 284–293, 2018a.
- Van Ranst et al.  Wiebe Van Ranst, Simen Thys, and Toon Goedemé. Fooling automated surveillance cameras: adversarial patches to attack person detection. In CVPR Workshop on The Bright and Dark Sides of Computer Vision: Challenges and Opportunities for Privacy and Security, 2019.
- Li et al.  Juncheng B Li, Frank R Schmidt, and J Zico Kolter. Adversarial camera stickers: A physical camera attack on deep learning classifier. arXiv preprint arXiv:1904.00759, 2019.
- Carlini and Wagner [2017a] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57, 2017a.
Carlini and Wagner [2017b]
Nicholas Carlini and David Wagner.
Adversarial examples are not easily detected: Bypassing ten detection
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3–14, 2017b.
- Athalye et al. [2018b] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, pages 274–283, 2018b.
- Madry et al.  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR), 2018. URL https://arxiv.org/abs/1706.06083.
- Huang et al. [2017a] Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety verification of deep neural networks. In International Conference on Computer Aided Verification, pages 3–29, 2017a.
- Katz et al.  Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pages 97–117, 2017.
Formal verification of piece-wise linear feed-forward neural networks.In International Symposium on Automated Technology for Verification and Analysis, pages 269–286, 2017.
- Cheng et al.  Chih-Hong Cheng, Georg Nührenberg, and Harald Ruess. Maximum resilience of artificial neural networks. In International Symposium on Automated Technology for Verification and Analysis, pages 251–268, 2017.
- Tjeng et al.  Vincent Tjeng, Kai Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations, 2019.
- Wong and Kolter  Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5283–5292, 2018.
- Wong et al.  Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems, pages 8400–8409, 2018.
- Raghunathan et al. [2018a] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. In International Conference on Machine Learning, 2018a.
- Raghunathan et al. [2018b] Aditi Raghunathan, Jacob Steinhardt, and Percy S Liang. Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, pages 10877–10887, 2018b.
- Wong et al.  Eric Wong, Frank R Schmidt, and J Zico Kolter. Wasserstein adversarial examples via projected sinkhorn iterations. arXiv preprint arXiv:1902.07906, 2019.
- Lecuyer et al.  Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471, 2018.
- Li et al.  Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Second-order adversarial attack and certifiable robustness. arXiv preprint arXiv:1809.03113, 2018.
- Cohen et al.  Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918, 2019.
- Rebollo-Monedero et al.  David Rebollo-Monedero, Jordi Forne, and Josep Domingo-Ferrer. From t-closeness-like privacy to postrandomization via information theory. IEEE Transactions on Knowledge and Data Engineering, 22(11):1623–1636, 2010.
- Calmon and Fawaz  Flávio du Pin Calmon and Nadia Fawaz. Privacy against statistical inference. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1401–1408, 2012.
- Makhdoumi et al.  Ali Makhdoumi, Salman Salamatian, Nadia Fawaz, and Muriel Médard. From the information bottleneck to the privacy funnel. In 2014 IEEE Information Theory Workshop (ITW 2014), pages 501–505, 2014.
- Salamatian et al.  Salman Salamatian, Amy Zhang, Flavio du Pin Calmon, Sandilya Bhamidipati, Nadia Fawaz, Branislav Kveton, Pedro Oliveira, and Nina Taft. Managing your private and public data: Bringing down inference attacks against your privacy. IEEE Journal of Selected Topics in Signal Processing, 9(7):1240–1255, 2015.
- Basciftci et al.  Yuksel Ozan Basciftci, Ye Wang, and Prakash Ishwar. On privacy-utility tradeoffs for constrained data release mechanisms. In 2016 Information Theory and Applications Workshop (ITA), pages 1–6, 2016.
- Tripathy et al.  Ardhendu Tripathy, Ye Wang, and Prakash Ishwar. Privacy-preserving adversarial networks. arXiv preprint arXiv:1712.07008, 2017.
- Huang et al. [2017b] Chong Huang, Peter Kairouz, Xiao Chen, Lalitha Sankar, and Ram Rajagopal. Context-aware generative adversarial privacy. Entropy, 19(12):656, 2017b.
Farnia and Tse 
Farzan Farnia and David Tse.
A minimax approach to supervised learning.In Advances in Neural Information Processing Systems, pages 4240–4248, 2016.
- Hamm and Mehra  Jihun Hamm and Akshay Mehra. Machine vs machine: Minimax-optimal defense against adversarial examples. arXiv preprint arXiv:1711.04368, 2017.
Tsipras et al. 
Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and
Robustness may be at odds with accuracy.In International Conference on Learning Representations, 2019.
- Zhang et al.  Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pages 7472–7482, 2019.
- Santambrogio  Filippo Santambrogio. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs and Modeling. Springer, 2015. URL https://www.math.u-psud.fr/~filippo/OTAM-cvgmt.pdf.
- Villani  Cèdric Villani. Optimal Transport: Old and New. Springer, Berlin, Heidelberg, 2009.
- Peyré and Cuturi  Gabriel Peyré and Marco Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11 (5-6):355–602, 2019. URL https://arxiv.org/abs/1803.00567.
- Blanchet and Murthy  Jose Blanchet and Karthyek Murthy. Quantifying Distributional Model Risk Via Optimal Transport. SSRN Electronic Journal, 2016. doi: 10.2139/ssrn.2759640.
- Blanchet et al.  Jose Blanchet, Karthyek Murthy, and Fan Zhang. Optimal Transport Based Distributionally Robust Optimization: Structural Properties and Iterative Schemes. arXiv preprint arXiv:1810.02403, 2018.
- Blanchet et al.  Jose Blanchet, Karthyek Murthy, and Nian Si. Confidence Regions in Wasserstein Distributionally Robust Estimation. arXiv preprint arXiv:1906.01614, 2019.
- Gao and Kleywegt  Rui Gao and Anton J Kleywegt. Distributionally Robust Stochastic Optimization with Wasserstein Distance. arXiv preprint arXiv:1604.02199, 2016.
- Gao et al.  Rui Gao, Xi Chen, and Anton J. Kleywegt. Wasserstein distributional robustness and regularization in statistical learning. CoRR, abs/1712.06050, 2017. URL http://arxiv.org/abs/1712.06050.
- Sinha et al.  Aman Sinha, Hongseok Namkoong, Riccardo Volpi, and John Duchi. Certifying Some Distributional Robustness with Principled Adversarial Training. arXiv e-prints, art. arXiv:1710.10571, October 2017.
- Lee and Raginsky  Jaeho Lee and Maxim Raginsky. Minimax statistical learning with wasserstein distances. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2687–2696. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7534-minimax-statistical-learning-with-wasserstein-distances.pdf.
- Cranko et al.  Zac Cranko, Zhan Shi, Xinhua Zhang, Richard Nock, and Simon Kornblith. Generalised Lipschitz Regularisation Equals Distributional Robustness. arXiv preprint arXiv:2002.04197, 2020.
- Issa et al.  Ibrahim Issa, Sudeep Kamath, and Aaron B Wagner. An operational measure of information leakage. In 2016 Annual Conference on Information Science and Systems (CISS), pages 234–239, 2016.
- Liao et al.  Jiachun Liao, Oliver Kosut, Lalitha Sankar, and Flavio P Calmon. A tunable measure for information leakage. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 701–705, 2018.
- Rassouli and Gunduz  Borzoo Rassouli and Deniz Gunduz. Optimal utility-privacy trade-off with total variation distance as a privacy measure. IEEE Transactions on Information Forensics and Security, 2019.
- Yamamoto  Hirosuke Yamamoto. A source coding problem for sources with additional outputs to keep secret from the receiver or wiretappers. IEEE Transactions on Information Theory, 29(6):918–923, 1983.
- Sankar et al.  Lalitha Sankar, S Raj Rajagopalan, and H Vincent Poor. Utility-privacy tradeoffs in databases: An information-theoretic approach. IEEE Transactions on Information Forensics and Security, 8(6):838–852, 2013.
- Strasser  Helmut Strasser. Mathematical theory of statistics: statistical experiments and asymptotic decision theory, volume 7. Walter de Gruyter, 2011.
- LeCam  L. LeCam. An extension of Wald’s theory of statistical decision functions. Ann. Math. Statist., 26(1):69–81, 03 1955. doi: 10.1214/aoms/1177728594. URL https://doi.org/10.1214/aoms/1177728594.
- Cam  L.L. Cam. Asymptotic Methods in Statistical Decision Theory. Springer series in statistics. Springer My Copy UK, 1986. ISBN 9781461249474. URL https://books.google.com/books?id=BcDxoAEACAAJ.
- Vaart  Aad van der Vaart. The statistical work of lucien le cam. Ann. Statist., 30(3):631–682, 06 2002. doi: 10.1214/aos/1028674836. URL https://doi.org/10.1214/aos/1028674836.
- Pollard  David Pollard. Asymptopia. Unpublished manuscript, 2003. URL http://www.stat.yale.edu/~pollard/Courses/602.spring07/MmaxThm.pdf.
- Rudin  Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill, 1964.
Supplementary Material for “Robust Machine Learning via Privacy/Rate-Distortion Theory”
6 Proof of Theorem 1
The relations in (15) and the existence of the maximums and minimum in (14) and (15) follow from a straightforward generalization of Lemma 1. The rest of the proof follows the same general steps as the proof of a generalized minimax theorem given by Pollard , except adapted for minimums and maximums rather than infimums and supremums.
For convenience, we define
Note that is linear in for fixed , and convex in for fixed . Further, for all , and is compact, convex, and nonempty.
We only need to show that (13) is less than or equal to (14), which would follow from , which is equivalent to (16). Since, each is compact, it is sufficient to show that for every finite subset [Rudin, 1964, Thm. 2.36]. We will first show this for any two-point set , and later extend this to every finite set through an inductive argument.
Suppose , then a contradiction would occur if we can show that there exists such that for all ,
since then , where .
The supremum is , since and , from the assumption . For (24) to hold for all , we must also require
Since (27) is immediate if either or , we need only consider when both and . Define such that
and let . Since is convex in , , which implies that hence (since we assumed that they are disjoint), which further implies that
which implies (27) and the existence of , which contradicts the assumption that .
The pairwise result implies that for any finite set , for . Then, we can repeat the argument starting from (22) with further restricted to , i.e., replacing in subsequent steps with , which effectively redefines (23) with , and eventually leads to for . Thus, repeating this argument further yields that for any finite subset , which, as argued earlier, implies (16). ∎
7 Proof of Theorem 3
All the proof steps assume continuous and compact but it is easy to see that the steps hold true for discrete and finite and continuous . We begin with the following definition that is taken from Chapter 7 in Santambrogio .
Given a functional , if is a regular point222See Chapter 7, Santambrogio  for definition of a regular point. of , and for any perturbation , one calls the first variation of if
It can be seen that the first variations are unique up a constant. The proof then follows from the following two lemmas.
Santambrogio  The first variation of a the optimal transport cost with respect to is given by the Kontorovich potential, , provided it is unique. A sufficient condition for uniqueness of is that the cost is continuous with continuous first derivative and is supported on the whole of the domain.
The first variation of the conditional entropy function defined by
is given by , where is a uniform distribution over and is the marginal over under the joint .
Notation: In the following to be concise and avoid a cumbersome notation we will often not explicitly write but just use . On the other hand we will keep explicit the notation so as to not lose sight of it.
By definition consider a perturbation around and let us look at
where . Let us focus on the first term.