Designing classifiers that are robust to small perturbations to test instances has emerged as a challenging task in machine learning. The goal of robust learning is to design classifiers that still correctly predicts the true label even if the input is perturbed minimally to a “close” instance . In fact, it was shown (Szegedy et al., 2014; Biggio et al., 2013; Goodfellow et al., 2015) that many learning algorithms, and in particular DNNs, are highly vulnerable to such small perturbations and thus adversarial examples can be successfully found. Since then, the machine learning community has been actively engaged to address this problem with many new defenses (Papernot et al., 2016; Madry et al., 2018; Biggio & Roli, 2018) and novel and powerful attacks (Carlini & Wagner, 2017; Athalye et al., 2018).
Do adversarial examples always exist?
This state of affairs suggest that perhaps the existence of adversarial example is due to fundamental reasons that might be inevitable. A sequence of work (Gilmer et al., 2018; Fawzi et al., 2018; Diochnos et al., 2018; Mahloujifar et al., 2019; Shafahi et al., 2018; Dohmatob, 2018) show that for natural theoretical distributions (e.g., isotropic Gaussian of dimension ) and natural metrics over them (e.g., or ), adversarial examples are inevitable. Namely, the concentration of measure phenomenon (Ledoux, 2001; Milman & Schechtman, 1986)
in such metric probability spaces imply that small perturbations are enough to map almost all the instancesinto a close that is misclassified. This line of work, however, does not yet say anything about “natural” distributions of interest such as images or voice, as the precise nature of such distributions are yet to be understood.
Can lessons from cryptography help?
Given the pessimistic state of affairs, researchers have asked if we could use lessons from cryptography to make progress on this problem (Madry, 2018; Goldwasser, 2018; Mahloujifar & Mahmoody, 2018). Indeed, numerous cryptographic tasks (e.g. encryption of long messages) can only be realized against attackers that are computationally bounded. In particular, we know that all encryption methods that use a short key to encrypt much longer messages are insecure against computationally unbounded adversaries. However, when restricted to computationally bounded adversaries this task becomes feasible and suffices for numerous settings. This insight has been extremely influential in cryptography. Nonetheless, despite attempts to build on this insight in the learning setting, we have virtually no evidence on whether this approach is promising. Thus, we ask:
Could we even hope to leverage computational hardness for adversarially robust learning?
Taking a step in realizing this vision, we provide formal definitions for computational variants of robust learning. Following the cryptographic literature, we provide a game based definition of computationally robust learning. Very roughly, a game-based definition consists of two entities: a challenger and an attacker, that interact with each other. In our case, as the first step the challenger generates independent samples from the distribution at hand, use those samples to train a learning algorithm, and obtain a hypothesis . Additionally, the challenger samples a fresh challenge sample from the underlying distribution. Next, the challenger provides the attacker with oracle access to and . At the end of this game, the attacker outputs a value to the challenger. The attacker declares this execution as a “WIN” if is obtained as a small perturbation of and leads to a misclassification. We say that the learning is computationally robust as long as no attacker from a class of adversaries can “WIN” the above game with a probability much better than some base value. (See Definition 2.1.)
This definition is very general and it implies various notions of security by restricting to various classes of attackers. While we focus on polynomially bounded attackers in this paper, we remark that one may also naturally consider other natural classes of attackers based on the setting of interest (e.g. an attacker that can only modify certain part of the image).
What if adversarial examples are actually easy to find?
Mahloujifar & Mahmoody (2019) studied this question, and showed that as long as the input instances come from a product distribution, and if the distances are measured in Hamming distance, adversarial examples with sublinear perturbations can be found in polynomial time. This result, however, did not say anything about other distributions or metrics such as . Thus, it was left open whether computational hardness could be leveraged in any learning problem to guarantee its robustness.
1.1 Our Results
From computational hardness to computational robustness.
In this work, we show that computational hardness can indeed be leveraged to help robustness. In particular, we present a learning problem that has a classifier that is only computationally robust. In fact, let be any learning problem that has a classifier with “small” risk , but that adversarial examples exist for classifier with higher probability under the norm (e.g., could be any of the well-studied problems in the literature with a vulnerable classifier under norm ). Then, we show that there is a “related” problem and a related classifier that has computational risk (i.e., risk in the presence of computationally bounded tampering adversaries) at most , but the risk of will go up all the way to if the tampering attackers are allowed to be computationally unbounded. Namely, computationally bounded adversaries have a much smaller chance of finding adversarial examples of small perturbations for than computationally unbounded attackers do. (See Theorem 3.4.)
The computational robustness of the above construction relies on allowing the hypothesis to sometimes “detect” tampering and output a special symbol . The goal of the attacker is to make the hypothesis output a wrong label and not get detected. Therefore, we have proved, along the way, that allowing tamper detection can also be useful for robustness. Allowing tamper detection, however, is not always an option. For example a real-time decision making classifier (e.g., classifying a traffic sign) that has to output a label, even if it detects that something might be suspicious about the input image. We prove that even in this case, there is a learning problem with binary labels and a classifier for such that computational risk of is almost zero, while its information theoretic risk is , which makes classifiers’ decisions under attack meaningless. (See Theorem 3.9).
In summary, our work provides credence that perhaps restricting attacks to computationally bounded adversaries holds promise for achieving computationally robust machine learning that relies on computational hardness assumptions as is currently done in cryptography.
From computational robustness back to computational hardness.
Our first result shows that computational hardness can be leveraged in some cases to obtain nontrivial computational robustness that beats information theoretic robustness. But how about the reverse direction; are computational hardness assumptions necessary for this goal? We also prove such reverse direction and show that nontrivial computational robustness implies computationally hard problems in . In particular, we show that a non-negligible gap between the success probability of computationally bounded vs. that of unbounded adversaries in attacking the robustness of classifiers implies strong average-case hard distributions for class . Namely, we prove that if the distribution of the instances in learning task is efficiently samplable, and if a classifier for this problem has computational robustness , information theoretic robustness , and , then one can efficiently sample from a distribution that generates Boolean formulas that are satisfiable with overwhelming probability, yet no efficient algorithm can find the satisfying assignments of with a non-negligible probability. (See Theorem 4.2 for the formal statement.)
What world do we live in?
As explained above, our main question is whether adversarial examples could be prevented by relying on computational limitations of the adversary. In fact, even if adversarial examples exist for a classifier, we might be living in either of two completely differnet worlds. One is a world in which computationally unbounded adversaries can find adversarial examples (almost) whenever they exist; and thus, they would be as powerful as information-theoretic adversaries. Another world is one in which machine learning could leverage computational hardness. Our work suggests that there are problems for which this is true, and thus we are living in the better world. Whether or not we can achieve computational robustness for practical problems (such as image classification) that beats their information-theoretic robustness remains an intriguing open question. A related line of work (Bubeck et al., 2018b, a; Degwekar & Vaikuntanathan, 2019) studied other “worlds” that we might be living in, and studied whether adversarial examples are due to the computational hardness of learning robust classifiers. They designed learning problems demonstrating that in some worlds, robust classifiers might exist, while they are hard to be obtained efficiently. We note however, that the goal of those works and our work are quite different. They deal with how computational constraints might be an issue and prevent the learner from reaching its goal, while our focus is on how such constraints on adversaries can help us achieve robustness guarantees that are not achievable information theoretically.
Other related work.
In another line of work (Raghunathan et al., 2018; Wong & Kolter, 2018; Sinha et al., 2018; Wong et al., 2018) the notion of certifiable robustness was developed to prove robustness for individual test instances. More formally, they aim at providing robustness certificates with bounds along with a decision made on a test instance , with the guarantee that any at distance at most from is correctly classified. However, these guarantees, so far, are not strong enough to rule out attacks completely, as larger magnitudes of perturbation (than the levels certified) still can fool the classifiers while the instances look the same to the human.
Our main result of separating computational and information theoretic robustness (Theorem 3.4) is proved using Hamming distance over Boolean strings (of length ) which is equivalent to using
norm over the “noise” added to the input. For any Boolean “noise” vector, we have for any where is ’s norm. So, we immediately get results for other norms as well. Our result of obtaining hardness from computational robustness (Theorem 4.2) is general and applies to any polynomial time computable metric or norm.
We prove our main result about the possibility of computationally robust classifiers (Theorem 3.4) by “wrapping” an arbitrary learning problem with a vulnerable classifier by adding computational certification based on cryptographic digital signatures to test instances. A digital signature scheme (see Definition 3.1) operates based on two generated keys , where is private and is used for signing messages, and is public and is used for verifying signatures. Such schemes come with the guarantee that a computationally bounded adversary with the knowledge of cannot sign new messages on its own, even if it is given signatures on some previous messages. Digital signature schemes can be constructed based on the assumption that one-way functions exist.111Here, we need signature schemes with “short” signatures of poly-logarithmic length over the security parameter. They could be constructed based on exponentially hard one-way functions (Rompel, 1990) by picking the security parameter sub-exponentially smaller that usual and using universal one-way hash functions to hash the message to poly-logarithmic length.. Below we describe the ideas behind this result in two steps.
(Initial Attempt). Suppose is the distribution over of a learning problem with input space and label space . Suppose had a hypothesis that can predict correct labels reasonably well, . Suppose, at the same time, that a (perhaps computationally unbounded) adversary can perturb test instances like into a close adversarial example that is now likely to be misclassified by ,
Now we describe a related problem , its distribution of examples , and a classifier for . To sample an example from , we first sample and then modify to by attaching a short signature to . The label of remains the same as that of . Note that will be kept secret to the sampling algorithm of . The new classifier will rely on the public parameter that is available to it. Given an input , first checks its integrity by verifying that the given signature is valid for . If the signature verification does not pass, rejects the input as adversarial without outputting a label, but if this test passes, outputs .
To successfully find an adversarial example for through a small perturbation of sampled as , an adversary can pursue either of the following strategies. (I) One strategy is that tries to find a new signature for the same , which will constitute as a sufficiently small perturbation as the signature is short. Doing so, however, is not considered a successful attack, as the label of remains the same as that of the true label of the untampered point . (II) Another strategy is to perturb the part of into a close instance and then trying to find a correct signature for it, and outputting . Doing so would be a successful attack, because the signature is short, and thus would indeed be a close instance to . However, doing this is computationally infeasible, due to the very security definition of the signature scheme. Note that is a forgery for the signature scheme, which a computationally bounded adversary cannot construct because of the security of the underlying signature scheme. This means that the computational risk of would remain at most .
We now observe that information theoretic (i.e., computationally unbounded) attackers can succeed in finding adversarial examples for with probability at least . In particular, such attacks can first find an adversarial example for (which is possible with probability over the sampled ), construct a signature for , and then output . Recall that an unbounded adversary can construct a signature for using exhaustive search.
(Actual construction). One main issue with the above construction is that it needs to make publicly available, as a public parameter to the hypothesis (after it is sampled as part of the description of the distribution ). The other issue is that the distribution is not publicly samplable in polynomial time, because to get a sample from one needs to use the signing key , but that key is kept secret. We resolve these two issues with two more ideas. The first idea is that, instead of generating one pair of keys for and keeping secret, we can generate a fresh pair of keys every time that we sample and attach also to the actual instance . The modified hypothesis also uses this key and verifies using . This way, the distribution is publicly samplable, and moreover, there is no need for making available as a public parameter. However, this change of the distribution introduces a new possible way to attack the scheme and to find adversarial examples. In particular, now the adversary can try to perturb into a close string for which it knows a corresponding signing key , and then use to sign an adversarial example for and output . However, to make this attack impossible for the attacker under small perturbations of instances, we use error correction codes and employ an encoding of the verification key (instead of ) that needs too much change before one can fool a decoder to decode to any other . But as long as the adversary cannot change , the adversary cannot attack the robustness computationally. (See Construction 3.3.)
To analyze the construction above (see Theorem 3.4), we note that the computational robustness of can be as high as , because the encoded would need to change the encoded , and if remains the same it is hard computationally to do any attack beyond the original risk of the problem (that needs no adversarial perturbations). On the other hand, a computationally unbounded adversary can focus on perturbing into and then forge a short signatures for it, which means bits of tampering, and this could be as small as .
Our construction above has the benefit that it could be defined as a wrapper around any natural vulnerable classifier. However, the computational robustness of the constructed classifier relies on sometimes detecting tampering attacks and not outputting a label. We give an alternative construction for a setting that the classifier always has to output a label. We again use digital signatures as the main ingredient of our construction, though our construction is no longer wrapped around and arbitrary natural task. See Construction 3.8 for more details.
2 Defining Computational Risk and Computationally Robust Learning
We use calligraphic letters (e.g., ) for sets and capital non-calligraphic letters (e.g., ) for distributions. By we denote sampling from . For a randomized algorithm , we denotes the randomized execution of on input outputting . A classification problem is specified by the following components: set is the set of possible instances, is the set of possible labels, is a joint distribution over , and is the space of hypothesis. For simplicity we work with problems that have a single distribution (e.g.,
is the distribution of labeled images from a data set like MNIST or CIFAR-10). We did not state the loss function explicitly, as we work with classification problems and use the zero-one loss by default. For the fixed distribution, the risk or error of a hypothesis is . We are usually interested in learning problems with a specific metric defined over for the purpose of defining adversarial perturbations of bounded magnitude controlled by . In that case, we might simply write , but is implicitly defined over . Finally, for a metric over , we let be the ball of radius centered at under the metric . By default, we work with Hamming distance , but our definitions can be adapted to any other metrics. We usually work with families of problems where determines the length of (and thus input lengths of ).
Allowing tamper detection.
In this work we expand the standard notion of hypotheses and allow to output a special symbol as well (without adding to ), namely we have . This symbol can be used to denote “out of distribution” points, or any form of tampering. In natural scenarios, when is not an adversarially tampered instance. However, we allow this symbol to be output by even in no-attack settings as long as its probability is small enough.
We follow the tradition of game-based security definitions in cryptography (Naor, 2003; Shoup, 2004; Goldwasser & Kalai, 2016; Rogaway & Zhang, 2018). Games are the most common way that security is defined in cryptography. These games are defined between a challenger and an adversary . Consider the case of a signature scheme. In this case the challenger is a signature scheme and an adversary is given oracle access to the signing functionality (i.e. adversary can give a message to the oracle and obtains the corresponding signature ). Adversary wins the game if he can provide a valid signature on a message that was not queried to the oracle. The security of the signature scheme is then defined informally as follows: any probabilistic polynomial time/size adversary can win the game by probability that is bounded by a negligible function on the security parameter. We describe a security game for tampering adversaries with bounded tampering budget in , but the definition is more general and can be used for other adversary classes.
Definition 2.1 (Security game of adversarially robust learning).
Let be a classification problem where the components are parameterized by . Let be a learning algorithm with sample complexity for . Consider the following game between a challenger , and an adversary with tampering budget .
samples i.i.d. examples and gets hypothesis where .
then samples a test example and sends to the adversary .
Having oracle (gates, in case of circuits) access to hypothesis and a sampler for , the adversary obtains the adversarial instance and outputs .
Winning conditions: In case , the adversary wins if ,222Therefore, if , without loss of generality, the adversary can output and in case , the adversary wins if all the following hold: (1) , (2) , and (3) .
Why separating winning case for from ?
One might wonder why we separate the winning condition for the two cases of and . The reason is that is supposed to capture tamper detection. So, if the adversary does not change and the hypothesis outputs , this is an error, and thus should contribute to the risk. More formally, when we evaluate risk, we have , which implicitly means that outputting contributes to the risk. However, if adversary’s perturbs to leads to , it means the adversary has not succeeded in its attack, because the tampering is detected. In fact, if we simply require the other 3 conditions to let adversary win, the notion of “adversarial risk” (see Definition 2.2) might be even less than the normal risk, which is counter intuitive.
Alternative definitions of winning for the adversary.
The winning condition for the adversary could be defined in other ways as well. In our Definition 2.1, the adversary wins if and . This condition is inspired by the notion of corrupted input Feige et al. (2015), is extended to metric spaces in Madry et al. (2018), and is used in and many subsequent works. An alternative definition for adversary’s goal, formalized in Diochnos et al. (2018) and used in Gilmer et al. (2018); Diochnos et al. (2018); Bubeck et al. (2018a); Degwekar & Vaikuntanathan (2019) requires to be different from the true label of (rather than ). This condition requires the misclassification of , and thus, would belong to the “error-region” of . Namely, if we let be the ground truth function, the error-region security game requires . Another stronger definition of adversarial risk is given by Suggala et al. (2018) in which the requirement condition requires both conditions: (1) the ground truth should not change , and that (2) is misclassified. For natural distributions like images or voice, where the ground truth is robust to small perturbations, all these three definitions for adversary’s winning are equivalent.
Stronger attack models.
In the attack model of Definition 2.1, we only provided the label of to the adversary and also give her the sample oracle from . A stronger attacker can have access to the “concept” function which is sampled from the distribution of given (according to ). This concept oracle might not be efficiently computable, even in scenarios that is efficiently samplable. In fact, even if is not efficiently samplable, just having access to a large enough pool of i.i.d. sampled data from is enough to run the experiment of Definition 2.1. In alternative winning conditions (e.g., the error-region definition) for Definition 2.1 discussed above, it makes more sense to also include the ground truth concept oracle given as oracle to the adversary, as the adversary needs to achieve . Another way to strengthen the power of adversary is to give him non-black-box access to the components of the game (see Papernot et al. (2017)). In definition 2.1, by default, we model adversaries who have black-box access to , but one can define non-black-box (a.k.a. white-box) access to each of , if they are polynomial size objects.
Diochnos et al. (2018) focused on bounded perturbation adversaries that are unbounded in their running time and formalized notions of of adversarial risk for a given hypothesis with respect to the -perturbing adversaries. Using Definition 2.1, in Definition 2.2, we retrieve the notions of standard risk, adversarial risk, and its (new) computational variant.
Definition 2.2 (Adversarial risk of hypotheses and learners).
Suppose is a learner for a problem . For a class of attackers we define
where the winning is in the experiment of Definition 2.1. When the attacker is fixed, we simply write . For a trivial attacker who outputs , it holds that . When includes attacker that are only bounded by perturbations, we use notation , and when the adversary is further restricted to all -size (oracle-aided) circuits, we use notation . When is a learner that outputs a fixed hypothesis , by substituting with , we obtain the following similar notions for : , , , and .
Definition 2.3 (Computationally robust learners and hypotheses).
Let be a family of classification parameterized by . We say that a learning algorithm is a computationally robust learner with risk at most against -perturbing adversaries, if for any polynomial , there is a negligible function such that
Again, when is a learner that outputs a fixed hypothesis for each , we say that is a computationally robust hypothesis with risk at most against -perturbing adversaries, if is so. In both cases, we might simply say that (or ) has computational risk at most .
Discussion (falsifiability of computational robustness).
If the learner is polynomial time, and that the distribution is samplable in polynomial time (e.g., by sampling first and then using a generative model to generate for ), then the the computational robustness of learners as defined based on Definitions 2.3 and 2.1 is a “falsifiable” notion of security as defined by Naor (2003). Namely, if an adversary claims that it can break the computational robustness of the learner , it can prove so in polynomial time by participating in a challenge-response game and winning in this game with a noticeable probability more than . This feature is due to the crucial property of the challenger in Definition 2.1 that is a polynomial time algorithm itself, and thus can be run efficiently. Not all security games have efficient challengers (e.g., see Pandey et al. (2008)).
3 From Computational Hardness to Computational Robustness
In this section, we will first prove our main result that shows the existence of a learning problem with classifiers that are only computationally robust. We first prove our result by starting from any hypothesis that is vulnerable to adversarial examples; e.g., this could be any of the numerous algorithms shown to be susceptible to adversarial perturbations.
Before going over the constructions, we recall some useful tools.
3.1 Useful Tools
Definition 3.1 (One-time signature schemes).
A one-time signature scheme consists of three probabilistic polynomial time (PPT) algorithms
which satisfy the following properties:
Completeness: For every
Unforgeability: For every positive polynomial , for every and every pair of circuits with size the following probability is negligible in :
Definition 3.2 (Error correction codes).
An error correction code with code rate and error rate consists of two algorithms and as follows.
The encode algorithm takes a Boolean string and outputs a Boolean string such that .
The decode algorithm takes a Boolean string and outputs either or a Boolean string . It holds that for all , and where , it holds that
3.2 Computational Robustness with Tamper Detection
Our first construction uses hypothesis with tamper detection (i.e, output capability).
Construction 3.3 (Computational robustness using tamper detection).
Let be a learning problem and a classifier for such that . We construct a family of learning problems (based on the fixed problem ) with a family of classifiers . In our construction we use signature scheme for which the bit-length of is and the bit-length of signature is . 333Such signatures exist assuming exponentially hard one-way functions (Rompel, 1990). We also use an error correction code with code rate and error rate .
The space of instances for is .
The set of labels is .
The distribution is defined by the following process: first sample , , , then encode and output .
The classifier is defined as
For family of Construction 3.3, the family of classifiers is computationally robust with risk at most against adversaries with budget . On the other hand is not robust against information theoretic adversaries of budget , if is not robust to perturbations:
Theorem 3.4 means that, the computational robustness of could be as large as (by choosing a code with constant error correction rate) while its information theoretic adversarial robustness could be as small as (note that is a constant here) by choosing a signature scheme with short signatures of poly-logarithmic length.
Proof of Theorem 3.4.
We first prove the following claim about the risk of .
For problem we have
The proof follows from the completeness of the signature scheme. We have,
Now we prove the computational robustness of .
For family , and for any polynomial there is a negligible function such that for all
Let be the family of circuits maximizing the adversarial risk for for all . We build a sequence of circuits , such that and are of size at most . just samples a random and outputs . gets and , calls to get and outputs . Note that can provide all the oracles needed to run if the sampler from , and are all computable by a circuit of polynomial size. Otherwise, we need to assume that our signature scheme is secure with respect to those oracles and the proof will follow. We have,
Note that implies that based on the error rate of the error correcting code. Also implies that is a valid signature for under verification key . Therefore, we have,
Thus, by the unforgeability of the one-time signature scheme we have
which by Claim 3.5 implies
Now we show that is not robust against computationally unbounded attacks.
For family and any we have
For any define where is the closes point to where and is a valid signature such that . Based on the fact that the size of signature is , we have Also, it is clear that because is a valid signature. Also, . Therefore we have
This concludes the proof of Theorem 3.4. ∎
3.3 Computational Robustness without Tamper Detection
The following theorem shows an alternative construction that is incomparable to Construction 3.3, as it does not use any tamper detection. On the down side, the construction is not defined with respect to an arbitrary (vulnerable) classifier of a natural problem.
Construction 3.8 (Computational robustness without tamper detection).
Let be a distribution over with a balanced “label” bit: . We construct a family of learning problems with a family of classifiers . In our construction we use a signature scheme for which the bit-length of is and the bit-length of signature is and an error correction code with code rate and error rate .
The space of instances for is .
The set of labels is .
The distribution is defined as follows: first sample , then sample and compute . Then compute . If sample a random that is not a valid signature of w.r.t . Then output . Otherwise compute and output .
The classifier is defined as
For family of Construction 3.8, the family of classifiers has risk and is computationally robust with risk at most against adversaries of budget . On the other hand is not robust against information theoretic adversaries of budget :
Note that reaching adversarial risk makes the classifier’s decisions meaningless as a random coin toss achieves this level of accuracy.
Proof of Theorem 3.9.
First it is clear that for problem we have . Now we prove the computational robustness of .
For family , and for any polynomial there is a negligible function such that for all
Similar to proof of Claim 3.6 we prove this based on the security of the signature scheme. Let be the family of circuits maximizing the adversarial risk for for all . We build a sequence of circuits and such that and are of size at most . just asks the signature for . gets and does the following: It first samples , computes encodings and and if , it samples a random then calls on input to get . Then it checks all ’s and if there is any of them that it outputs , otherwise it aborts and outputs . If it aborts and outputs . Note that can provide all the oracles needed to run if the sampler from , and are all computable by a circuit of polynomial size. Otherwise, we need to assume that our signature scheme is secure with respect to those oracles and the proof will follow. We have,
Note that implies that and based on the error rate of the error correcting code. Also implies that . This is because if , the adversary has to make all the signatures invalid which is impossible with tampering budget . Therefore must be and one of the signatures in must pass the verification because the prediction of should be . Therefore we have
Thus, by the unforgeability of the one-time signature scheme we have
Now we show that is not robust against computationally unbounded attacks.
For family and any we have
For any define as follows: If , does nothing and outputs . If , search all possible signatures to find a signature such that . It then outputs . Based on the fact that the size of signature is , we have Also, it is clear that because the first signature is always a valid signature. Therefore we have