We consider the task of certifying the correctness of an image classifier,i.e. a system taking as input an image and categorising it. As a main example we will consider the MNIST classification task, which consists in categorising hand-written digits. Our experimental results are later reproduced for the drop-in dataset Fashion MNIST (XRV17).
The usual evaluation procedure consists in setting aside from the dataset a validation set, and to report on the success percentage of the image classifier on the validation set. With this procedure, it is commonly accepted that the MNIST classification task is solved, with some convolutional networks achieving above 99.7% accuracy (see e.g. CMS12; pmlr-v28-wan13). Further results suggest that even the best convolutional networks cannot be considered to be robust, given the persistence of adversarial examples: a small perturbation – invisible to the human eye – in images from the dataset is enough to induce misclassification (SZSBEGF13).
This is a key motivation for the verification of neural networks: can we assert the robustness of a neural network, i.e. the absence of adversarial examples? This question has generated a growing interest in the past years at the crossing of different research communities (see e.g. HKWW17; KBDJK17; WZCSHDBD18; GMDTCV18; MGV18; GKPB18; KHIJLLSTWZDK19), with a range of prototype tools achieving impressive results. The robustness question is formulated as follows: given an image and , are all -perturbations of correctly classified?
We point to a weakness of the formalisation: it is local, meaning it is asserted for a given image (and then typically checked against a finite set of images). In this paper, we investigate a global approach for specifying the robustness of an image classifier. Let us start from the ultimate robustness objective, which reads:
For every category, for every real-life image of this category and for every perturbation of this image, the perturbed image is correctly classified.
Formalising this raises three questions:
How do we quantify over all real-life images?
What are perturbed images?
How do we effectively check robustness?
In this work we propose a formalisation based on generative models. A generative model is a system taking as input a random noise and generating images, in other words it represents a probabilistic distribution over images.
Our specification depends on two parameters . Informally, it reads:
An image classifier is
-robust with respect to a generative model if the probability that for a noise, all -perturbations of generate correctly classified images is at least .
The remainder of the paper presents experiments supporting the claims that the global robustness specification has the following important properties.
Global. The first question stated above is about quantifying over all images. The global robustness we propose addresses this point by (implicitly) quantifying over a very large and representative set of images.
Robust. The second question is about the notion of perturbed images. The essence of generative models is to produce images reminiscent of real images (from the dataset); hence testing against images given by a generative model includes the very important perturbation aspect present in the intuitive definition of correctness.
Effective. The third question is about effectivity. We will explain that global robustness can be effectively evaluated for image classifiers built using neural networks.
XLZHLS18 train generative models for finding adversarial examples, and more specifically introduce a different training procedure (based on a new objective function) whose goal is to produce adversarial examples. Our approach is different in that we use generative models with the usual training procedure and objective, which is to produce a wide range of realistic images.
2 Global Correctness
This section serves as a technical warm-up for the next one: we introduce the notion of global correctness, a step towards our main definition of global robustness.
We use for representing images with the infinity norm over , and let be the set of categories, so an image classifier represents a function .
A generative model represents a distribution over images, and in effect is a neural network which takes as input a random noise in the form of a
-dimensional vectorand produces an image . Hence it represents a function
. We typically use a Gaussian distribution for the random noise, written.
Our first definition is of global correctness, it relies on a first key but simple idea, which is to compose a generative model with an image classifier : we construct a new neural network by simply rewiring the output of to the input of , so represents a distribution over categories. Indeed, it takes as input a random noise and outputs a category.
Definition 1 (Global Correctness).
Given for each a generative model for images of category , we say that the image classifier is -correct with respect to the generative models if for each ,
In words, the probability that for a noise the image generated (using ) is correctly classified (by ) is at least .
Our definition of global correctness hinges on two properties of generative models:
generative models produce a wide variety of images,
generative models produce (almost only) realistic images.
The first assumption is the reason for the success of generative adversarial networks (GAN) (GPMXWOCB14). We refer for instance to KLA18 and to the attached website thispersondoesnotexist.com for a demo.
In our experiments the generative models we used are out of the shelf generative adversarial networks (GAN) (GPMXWOCB14), with hidden layers of respectively , , and nodes, producing images of single digits.
To test the second assumption we performed a first experiment called the manual score experiment. We picked digit images using a generative model and asked individuals to tell for each of them whether they are “near-perfect”, “perturbed but clearly identifiable”, “hard to identify”, or “rubbish”, and which digit they represent. The results are that images were correctly identified; among them images were declared “near-perfect” by all individuals, with another including “perturbed but clearly identifiable”, and were considered “hard to identify” by at least one individual yet correctly identified. The remaining were “rubbish” or incorrectly identified. It follows that against this generative model, we should require an image classifier to be at least -correct, and even -correct to match human perception.
To check whether a classifier is -correct, the Monte Carlo integration method is a natural approach: we sample random noises , and count for how many ’s we have that
. The central limit theorem states that the ratio of positives overconverges to as . It follows that samples gives a precision on this number.
In practice, rather than sampling the random noises independently, we form (large) batches and leverage the tensor-based computation, enabling efficient GPU computation.
3 Global Robustness
We introduce the notion of global robustness, which gives stronger guarantees than global correctness. Indeed, it includes the notion of perturbations for images.
The usual notion of robustness, which we call here local robustness, can be defined as follows.
Definition 2 (Local Robustness).
We say that the image classifier is -robust around the image of category if
In words, all -perturbations of are correctly classified (by ).
One important aspect in this definition is the choice of the norm for the perturbations (here we use the infinity norm). We ignore this as it will not play a role in our definition of robustness. A wealth of techniques have been developed for checking local robustness of neural networks, with state of the art tools being able to handle nets with thousands of neurons.
Our definition of global robustness is supported by the two properties of generative models discussed above in the context of global correctness, plus a third one:
generative models produce perturbations of realistic images.
To illustrate this we designed a second experiment called the random walk experiment: we perform a random walk on the space of random noises while observing the ensued sequence of images produced by the generative model. More specifically, we pick a random noise , and define a sequence of random noises with obtained from by adding a small random noise to ; this induces the sequence of images . The result is best visualised in an animated GIF (see the Github repository), see also the first images in Figure 2. This supports the claim that images produced with similar random noises are (often) close to each other; in other words the generative model is (almost everywhere) continuous.
Our definition of global robustness is reminiscent of the provably approximately correct learning framework developed by Valiant84. It features two parameters. The first parameter, , quantifies the probability that a generative model produces a realistic image. The second parameter, , measures the perturbations on the noise, which by the continuity property discussed above transfers to perturbations of the produced images.
Definition 3 (Global Robustness).
Given for each a generative model for images of category , we say that the image classifier is -robust with respect to the generative models if for each ,
In words, the probability that for a noise , all -perturbations of generate (using ) images correctly classified (by ) is at least .
To check whether a classifier is -robust, we extend the previous ideas using the Monte Carlo integration: we sample random noises , and count for how many ’s the following property holds:
The central limit theorem states that the ratio of positives over converges to
as . As before, it follows that samples gives a precision on this number.
In other words, checking global robustness reduces to combining Monte Carlo integration with checking local robustness.
The code for all experiments can be found on the Github repository
All experiments are presented in Jupyter notebook format with pre-trained models to be easily reproduced. Our experiments are all reproduced on the drop-in Fashion-MNIST dataset (XRV17), obtaining similar results.
We report on experiments designed to assess the benefit of these two notions, whose common denominator is to go from a local property to a global one by composing with a generative model.
We first evaluate the global correctness of several image classifiers, showing that it provides a finer way of evaluating them than the usual test set. We then turn to global robustness and show how the negation of robustness can be witnessed by realistic adversarial examples.
The second set of experiments addresses the fact that both global correctness and robustness notions depend on the choice of a generative model. We show that this dependence can be made small, but that it can also be used for refining the correctness and robustness notions.
Choice of networks
In all the experiments, our base case for image classifiers have hidden layers of increasing capacities: the first one, referred to as “small”, has layers with (number of nodes), “medium” corresponds to , and “large” to . The generative model are as described above, with hidden layers of respectively , , and nodes.
For each of these three architectures we either use the standard MNIST training set (6,000 images of each digit), or an augmented training set (24,000 images), obtained by rotations, shear, and shifts. The same distinction applies to GANs: the “simple GAN” uses the standard training set, and the “augmented GAN” the augmented training set.
Finally, we work with two networks obtained through robust training procedures. The first one was proposed by MadryMSTV18 for the MNIST Adversarial Example Challenge (the goal of the challenge was to find adversarial examples, see below), and the second one was defined by PapernotMWJS15
through the process of defense distillation.
Evaluating Global Correctness
We evaluated the global correctness of all the image classifiers mentioned above against simple and augmented GANs, and reported the results in the table below. The last column is the usual validation procedure, meaning the number of correct classification on the MNIST test set of 10,000 images. They all perform very well, and close to perfectly (above
), against this metric, hence cannot be distinguished. Yet the composition with a generative model reveals that their performance outside of the test set are actually different. It is instructive to study the outliers for each image classifier,i.e. the generated images which are incorrectly classified. We refer to the Github repository for more experimental results along these lines.
|Classifier||simple GAN||augmented GAN||test set|
|Standard training set|
|Augmented training set|
|Robust training procedures|
Finding Realistic Adversarial Examples
Checking the global robustness of an image classifier is out of reach for state of the art verification tools. Indeed, a single robustness check on a medium size net takes somewhere between dozens of seconds to a few minutes, and to get a decent approximation we need to perform tens of thousands local robustness checks. Hence with considerable computational efforts we could analyse one image classifier, but could not perform a wider comparison of different training procedures and influence on different aspects. Thus our experiments focus on the negation of robustness, which is finding realistic adversarial examples, that we define now.
Definition 4 (Realistic Adversarial Example).
An -realistic adversarial example for an image classifier with respect to a generative model is an image such that there exists another image with
In words, and are two -close random noises which generate images and that are classified differently by .
Note that a realistic adversarial example is not necessarily an adversarial example: the images and may differ by more than . However, this is the assumption 3. discussed when defining global robustness, if and are close, then typically and are two very resemblant images, so the two notions are indeed close.
We introduce two algorithms for finding realistic adversarial examples, which are directly inspired by algorithms developed for finding adversarial examples. The key difference is that realistic adversarial examples are searched by analysing the composed network .
Let us consider two digits, for the sake of explanation, and . We have a generative model generating images of and an image classifier .
The first algorithm is a black-box attack, meaning that it does not have access to the inner structure of the networks and it can only simulate them. It consists in sampling random noises, and performing a local search for a few steps. From a random noise , we inspect the random noise for a few small random noises , and choose the random noise maximising the score of by the net , written in the pseudocode given in Algorithm 1. The algorithm is repeatedly run until a realistic adversarial example is found.
The second algorithm is a white-box attack, meaning that it uses the inner structure of the networks. It is similar to the previous one, except that the local search is replaced by a gradient ascent to maximise the score of by the net . In other words, instead of choosing a direction at random, it follows the gradient to maximise the score. It is reminiscent of the projected gradient descent (PGD) attack, but performed on the composed network. The pseudocode is given in Algorithm 2.
Both attacks successfully find realistic adversarial examples within less than a minute. The adjective “realistic”, which is subjective, is justified as follows: most attacks constructing adversarial examples create unrealistic images by adding noise or modifying pixels, while with our definition the realistic adversarial examples are images produced by the generative model, hence potentially more realistic. See Figure 3 for some examples.
On the Dependence on the Generative Model
Both global correctness and robustness notions are defined with respect to a generative model. This raises a question: how much does it depend on the choice of the generative model?
To answer this question we trained two GANs using the exact same training procedure but with two disjoint training sets, and used the two GANs to evaluate several image classifiers. The outcome is that the two GANs yield sensibly the same results against all image classifiers. This suggests that the global correctness indeed does not depend dramatically on the choice of the generative model, provided that it is reasonably good and well-trained. We refer to the Github repository for a complete exposition of the results.
Since the training set of the MNIST dataset contains 6,000 images of each digit, splitting it in two would not yield two large enough training sets. Hence we used the extended MNIST (EMNIST) dataset CohenATS17, which provided us with (roughly) 34,000 images of each digit, hence two disjoint datasets of about 17,000 images.
On the Influence of Data Augmentation
Data augmentation is a classical technique for increasing the size of a training set, it consists in creating new training data by applying a set of mild transformations to the existing training set. In the case of digit images, common transformations include rotations, shear, and shifts.
Unsurprisingly, crossing the two training sets, e.g. using the standard training set for the image classifier and an augmented one for the generative model yields worse results than when using the same training set. More interestingly, the robust networks MadryMSTV18; PapernotMWJS15, which are trained using an improved procedure but based on the standard training set, perform well against generative models trained on the augmented training set. In other words, one outcome of the improved training procedure is to better capture the natural image transformations, even if they were never used in training.
We defined two notions: global correctness and global robustness, based on generative models, aiming at quantifying the usability of an image classifier. We performed some experiments on the MNIST dataset to understand the merits and limits of our definitions. An important challenge lies ahead: to make the verification of global robustness doable in a reasonable amount of time and computational effort.