|Original: Ford Model T||Original: Wombat|
|Starting point: Tibetan mastiff||Starting point: Garter snake|
|Adv. example: Tibetan mastiff||Adv. example: Garter snake|
Black-box adversarial attacks describe a scenario in which an attacker has access only to the input and output of a machine learning model, but no specific knowledge about its parameters or architecture[Papernot:2017:PBA:3052973.3053009]. Intuitively, it may seem improbable for an attacker to simply guess the vulnerabilities of a model, but recent work has demonstrated that it is indeed possible to create custom-tailored adversarial examples for any black box when allowed to query the model a large number of times [brendel17, ilyas18a, Ilyas2018PriorCB, autozoom, zoo].
Naturally, the practicality of such attacks depends on the number of queries required and much work has gone into designing efficient optimization strategies. Gradient estimation techniques have been popular[ilyas18a, zoo, autozoom] for models that provide real-valued output (e.g. softmax activations), and sophisticated sampling strategies have been proposed that drastically reduce the number of iterations [Ilyas2018PriorCB]. The same has happened in the much harder label-only setting, where models only output a single discrete value (e.g. the top-1 class label). Early success in this setting was sparked by the Boundary Attack [brendel17], which essentially performs a random walk along the decision boundary, and several variants have been proposed that improve the efficiency of this search procedure [guo2018low, brunner18guessing]. Another family of successful black-box attacks exploits the fact that machine learning models often share vulnerabilities. They train surrogate models, perform white-box attacks and find adversarial examples that can be transferred to the model under attack [madry_towards_2017, tramer17ens, Papernot:2017:PBA:3052973.3053009].
It is evident that considerable effort has gone into the design of sophisticated optimization procedures. Yet surprisingly, the question of how to initialize them has not been discussed in detail. We consider this to be a rather important gap in current literature, since costly optimization procedures can often be sped up by a smart choice of starting points. Our contribution is as follows:
We discuss beneficial properties of starting points and how they can improve the efficiency of iterative black-box attacks.
As a proof of concept, we propose a simple copy-and-paste scheme, in which patches from images of the adversarial class are added to the image under attack.
We use this strategy to initialize a state-of-the-art attack and evaluate it against an ImageNet classifier. Our initialization reduces the number of queries by 81% when compared with previous results, and thus forms a new state of the art in query-efficient black-box attacks. The source code for repeating our experiment is publicly available 111 https://github.com/ttbrunner/blackbox_starting_points .
In this work, we exclusively focus on the targeted label-only black-box setting, where an attacker must change the classification to a specific label and does not have access to gradients or confidence scores. This is one of the hardest settings currently considered [brendel17, brunner18guessing, chengHardlabel, ilyas18a] and therefore our results should be valid for easier settings as well.
2 Initialization strategies
Currently, black-box adversarial attacks start with either (a) the original image or (b) an example of the target class. In the case of (a), the attack tries to take steps into directions that lead to an adversarial region. This is considered very hard and is typically approached by estimating gradients [ilyas18a, zoo] or transferring them from a surrogate [madry_towards_2017]. This approach can be unreliable and many queries must be spent before the adversarial region is found [brunner18guessing]. In order to improve reliability, other attacks [brendel17, ilyas18a, brunner18guessing] employ (b), where the starting point already has the desired class label but the distance to the original image is high. As a result, the attack needs to travel a great distance through the input space, requiring many steps until it arrives at an example that is reasonably close to the original.
Both strategies have potential for improvement. For (a), Tramèr et al. [tramer17ens] propose adding small random perturbations to the input, which they find to increase the overall success rate. In this work, we focus on (b) – recent black-box attacks have achieved impressive results using this method [brunner18guessing, ilyas18a], and at the same time it seems very easy to improve. Surely some images of the target class would be better suited than others, and the number of required queries could be reduced by choosing them in a systematic manner.
2.1 Criteria for suitable starting points
In image classification, an adversarial example is considered successful if has low distance to the original image (e.g. measured by some norm of the perturbation) and is at the same time classified as an adversarial label. We assume two properties to be beneficial for attack efficiency:
Starting points should be close. Intuitively, it makes sense to pick points that are already close to that goal. The optimization procedure would then merely refine them, requiring less iterations to arrive at an adversarial example or produce a better one in the same number of steps. The most straightforward approach is to search a large data set (e.g. ImageNet) for images of the target class and then pick the one with the lowest distance to the original.
Starting points should reduce dimensionality. Optimization often suffers from very large search spaces. An ImageNet example at a resolution of 299 x 299 has 268,203 dimensions. It is therefore desirable to reduce this search space and to concentrate only on specific dimensions. Notably, Brunner et al. [brunner18guessing] demonstrate that attacks gain efficiency by concentrating only on pixels that differ between the current image and the original, ignoring the rest. A suitable starting point should facilitate this, ideally by not replacing the original entirely but only small regions of it. The attack can then be limited to these regions.
3 Copy-pasting adversarial features
In order to test our assumptions, we propose a simple strategy that segments images of the target class into small patches and then inserts them into the image under attack. It is geared towards attacks targeting the norm, but similar strategies could be applied to improve attacks. This work should be understood as a proof of concept that offers many opportunities for refinement. Nevertheless, our evaluation in Section LABEL:sec_experiments shows that this simple approach already delivers a large boost in efficiency.
3.1 Segmentation by saliency
The most important pixels for classification are those that contain salient features. They typically concentrate in small regions (e.g. nose and eyes of an animal, see Figure LABEL:fig:blending2), whereas the rest of an image matters little for the predicted class label. The unimportant regions can safely be removed – compare the starting points in Figure LABEL:fig:comp (left), where removing the background significantly lowers the -distance to the original but retains the adversarial label. To do this, we can apply any segmentation method of our choice. In our implementation, we use a surrogate model to construct saliency maps for images of the target class and then blend these pixels into the original image.
Saliency maps are model-specific, and therefore our initialization could be interpreted as a transfer attack that is not guaranteed to generalize across models. To address this concern, we apply heavy smoothing and amplification to the map. This results in contiguous patches that cover the salient regions and are therefore likely to contain the core motif of an image (see Figure LABEL:fig:blending2). We expect this method to generalize well across models, but in practice it can also be replaced by any other segmentation technique available to the attacker.