In the last few years, motivated by the large popularity of deep learning-based models, a growing number of studies on the properties of adversarial attacks and of the corresponding defenses have been produced by the scientific communityPapernot et al. (2016a); Biggio and Roli (2018); Goodfellow et al. (2018); Carlini et al. (2019b); Shafahi et al. (2019); Sotgiu et al. (2020) (see Miller et al. (2020) for a recent review on this topic). Most of the existing works either propose methods for improving classifier robustness by modifying the learning algorithm to explicitly account for the presence of adversarial data perturbations Goodfellow et al. (2014); Papernot et al. (2016b); Sinha et al. (2018), or develop specific detection mechanisms for adversarial examples Carlini and Wagner (2017); Ma et al. (2018); Samangouei et al. (2018); Pang et al. (2018); Lee et al. (2018); Miller et al. (2020). Only a few approaches focus on the semi-supervised learning setting Miyato et al. (2015); Park et al. (2018); Akcay et al. (2018); Carmon et al. (2019); Miyato et al. (2018); Zhai et al. (2019); Najafi et al. (2019); Alayrac et al. (2019), although it provides a natural setting for many real-world applications in which labeling data is costly while unlabeled samples are readily available. More importantly, to the best of our knowledge, the problem of multi-label classification, in which each example can belong to more classes, is only preliminary discussed in the context of adversarial learning in Song et al. (2018), while using adversarial examples to improve some types of multi-label classifiers is evaluated in Wu et al. (2017); Babbar and Schölkopf (2018).
In this paper, we focus on multi-label classification and, in particular, in the case in which domain knowledge on the relationships among the considered classes is available. Such knowledge can be naturally expressed by First-Order Logic (FOL) clauses, and, following the learning framework of Gnecco et al. (2015); Diligenti et al. (2017)
, it can be used to improve the classifier by enforcing FOL-based constraints on the unsupervised or partially labeled portions of the training set. A well-known intuition in adversarial machine learning suggests that a reliable model of the distribution of the data could be used to spot adversarial examples, being them not sampled from such distribution, but it is not a straightforward procedureGrosse et al. (2017). We borrow such intuition and we intersect it with the idea that semi-supervised examples can help learn decision boundaries that better follow the marginal data distribution, coherently with the available knowledge Melacci and Belkin (2011); Diligenti et al. (2017). For those reasons, we study the role of those constraints that implement domain knowledge as a mean to not only better shape the decision boundaries, but also as a measure to spot examples that are likely to be generated in an adversarial setting.
What we propose differs from existing literature on semi-supervised learning in adversarial settings. Miyato et al. Miyato et al. (2015, 2018) and Park et el. Park et al. (2018) exploit adversarial training (virtual adversarial training and adversarial dropout, respectively) to favor regularity around the supervised and unsupervised training data, with the aim of improving the classifier performance. The work in Akcay et al. (2018) develops an anomaly detector using adversarial training in the semi-supervised setting. Self-supervised learning is exploited in Carmon et al. (2019); Najafi et al. (2019) to improve adversarial robustness, stability criteria are enforced on unlabeled training data in Zhai et al. (2019), while the work in Alayrac et al. (2019) specifically focuses on an unsupervised adversarial training procedure, exploited in semi-supervised classification. Our model neither exploits adversarial training nor any adversary-aware training criteria aimed at gaining intrinsic regularity. We focus on the role of domain knowledge as an indirect mean to increase adversarial robustness and, afterwards, to detect adversarial examples. Indeed, all the described methods could also be applied jointly with what we propose. Moreover, our approach also differs from other adversarial-example detectors Carlini and Wagner (2017); Ma et al. (2018); Samangouei et al. (2018); Miller et al. (2020) as it has no additional training cost and negligible runtime cost.
This paper contributes in showing that domain knowledge is a powerful feature (1) to improve robustness of multi-label classifiers and (2) to detect adversarial examples. To properly evaluate the robustness of our approach, which remains one of the most challenging problems in adversarial machine learning Carlini et al. (2019a); Athalye et al. (2018); Biggio and Roli (2018), we propose (3) a novel multi-label knowledge-driven attack that can implement both black-box and white-box adaptive attacks. While we show that an adaptive attack having access to the domain knowledge exploited by our classifier can bypass it, even though at the cost of an increased perturbation size, it remains an open issue to understand how hard for an attacker would be to infer such knowledge in practical cases. For this reason, we believe that our work can provide a significant contribution towards both evaluating and designing robust multi-label classifiers.
2 Learning with Domain Knowledge
We consider a vector function, where and . Each function is responsible for implementing a specific task on the considered input domain .111This notion can be trivially extended to the case in which the task functions operate in different domains. In the context of this paper, without loss of generality, we consider multi-label classification problems with classes, in which each input is associated to one or more classes. Function predicts the membership degree of to the -th class. Moreover, when we restrict the output of to , we can think of as the fuzzy logic predicate that models the truth degree of belonging to class . In order to simplify the notation, we will frequently make no explicit distinctions between function names, predicate names, class names or between input samples and predicate variables.
Whenever we focus on the predicate-oriented interpretation of each , First-Order Logic (FOL) becomes the natural way of describing relationships among the considered classes, i.e., the most effective type of domain knowledge that could be eventually available in a multi-label problem; e.g., , for some , meaning that the intersection between the -th class and the -th class is always included in the -one. The framework of Learning from Constraints Gnecco et al. (2015); Gori and Melacci (2013); Diligenti et al. (2017) follows the idea of converting domain knowledge into constraints on the learning problem and it studies, amongst a variety of other knowledge-oriented constraints (see, e.g., Table 2 in Gnecco et al. (2015)), the process of handling FOL formulas so that they can be both injected into the learning problem or used as a knowledge verification measure Gori and Melacci (2013); Diligenti et al. (2017). Such knowledge is enforced on those training examples for which either no information or only partial/incomplete labeling is available, thus casting the learning problem in the semi-supervised setting. As a result, the multi-label classifier can improve its performance and make predictions on out-of-sample data that are more coherent with the domain knowledge (see, e.g., Table 4 in Gnecco et al. (2015)). In particular, FOL formulas that represent the domain knowledge of the considered problem are converted into numerical constraints using Triangular Norms (T-Norms, Klement et al. (2013)), binary functions that generalize the conjunction operator . Following the previous example, is converted into a bilateral constraint that, in the case of the product T-Norm, is . The on the right-hand side of the constraint is due to the fact that the numerical formula must hold true (i.e, ), while the left-hand side is in . We indicate with
the loss function associated to. In the simplest case (as followed in this paper) such loss is , where the minimum value of is zero. The quantifier is translated by enforcing the constraints on a discrete data sample . The loss function associated to knowledge is obtained by taking the sum (average) over the data in , and, since we usually have formulas whose relative importance could be uneven, we get
where is the vector that collects the scalar weights of the FOL formulas, and .
In this paper, is implemented with a neural architecture with output units and weights collected in . We distinguish between the use of Eq. (1) as a loss function in the training stage and its use as a measure to evaluate the constraint fulfillment on out-of-sample data. In detail, the classifier is trained on the training set by minimizing
where is the importance of the FOL formulas at training time, and modulates the weight of the constraint loss with respect to the supervision loss suploss, being the supervision information attached to some of the data in . The optimal is chosen by cross-validation, maximizing the classifier performance. When the classifier is evaluated on a test sample , the measure
with weights and , returns a score that indicates the fulfillment of the domain knowledge on (the lower the better). Note that and might not necessarily be equivalent, even if certainly related. In particular, one may differently weigh the importance of some formulas during training to better accommodate the gradient-descent procedure and avoid bad local minima.
It is important to notice that Eq. (2) enforces domain knowledge only on the training data . There are no guarantees that such knowledge will be fulfilled in the whole input space . This suggests that optimizing Eq. (2) yields a stronger fulfillment of knowledge over the space regions where the training points are distributed (low values of ), while could return larger values when departing from the distribution of the training data. The constraint enforcement is soft, so that the second term in Eq. (2) is not necessarily zero at the end of the optimization.
3 Exploiting Domain Knowledge against Adversarial Attacks
The basic idea behind this paper is that the constraint loss of Eq. (1) is not only useful to enforce domain knowledge into the learning problem, but also (i) to gain some robustness with respect to adversarial attacks and (ii) as a tool to detect adversarial examples at no additional training cost. The example in Fig. 1 illustrates the main principles followed in this work, in a multi-label classification problem with classes (cat, animal, motorbike, vehicle) for which the following domain knowledge is available, together with labeled and unlabeled training data:
Such knowledge is converted into numerical constraints, as described in Sect. 2, and the loss function is devised and enforced on the training data predictions during classifier training (Eq. 2). Fig. 1 shows two examples of the learned classifier.
Considering point (i.), in both cases, the decision boundaries are altered on the unlabeled data, enforcing the classifier to take a knowledge-coherent decision over the unlabeled training points and to better cover the marginal distribution of the data. This knowledge-driven regularity improves classifier robustness to adversarial attacks, as we will discuss in Sect. 4. Going into further details to illustrate claim (ii
.), in (a) we have the most likely case, in which decision boundaries are not always perfectly tight to the data distribution, and they might be not closed (ReLU networks typically return high-confidence predictions far from the training dataHein et al. (2019)). Three different attacks are shown (purple). In attack , an example of motorbike is perturbed to become an element of the cat class, but Eq. (4) is not fulfilled anymore. In attack , an example of animal is attacked to avoid being predicted as animal. However, it falls in a region where no predictions are yielded, violating Eq. (7). Attack number consists of an adversarial attack to create a fake cat, that however, is also predicted as vehicle, thus violating Eq. (4)) and Eq. (6). In (b) we have an ideal and extreme case, with very tight and closed decision boundaries. Some classes are well separated, it is harder to generate adversarial examples by slightly perturbing the available data, while it is easy to fall in regions for which Eq. (7) is not fulfilled. The pictures in (c) show the unfeasible regions in which the constraint loss is significantly larger, thus offering a natural criterion to spot adversarial examples that fall outside of the training data distribution.
is estimated by cross-validation in order to avoid rejecting (or rejecting a small number of22210% in our experiments.) the examples in the validation set . Eq. (8) computes the constraint loss on the validation data , using the importance weights (that we will discuss in what follows), as in Eq. (3). The rationale behind this idea is that those samples for which the constraint loss is larger than what it is on the distribution of the considered data should be rejected. The training samples are the ones over which domain knowledge was enforced during the training stage, while the validation set represents data on which knowledge was not enforced, but that are sampled from the same distribution from which the training set is sampled, making them good candidates for estimating .
Differently from ad-hoc detectors, that usually require to train generative models, this rejection procedure comes at no additional training cost.333Generative models on the fulfillment of the single constraints could be considered too. The procedure is effective whenever the functions in are not too strongly paired with respect to , and we formalize the notion of “pairing” as follows.
This notion indicates that if the constraint loss is fulfilled in similar ways over the training data distribution and the space areas close to it, then there is no room for detecting those examples that should be rejected. While it is not straightforward to conclude about the pairing of before training the classifier, the soft constraining scheme of Eq. (2) allows the classification functions to be paired in a less strong manner that what they would be when using hard constraints.444See Teso (2019) for a discussion on hard constraints and graphical models in an adversarial context.
Note that a multi-label system is usually equipped with activation functions that do not structurally enforce any dependencies among the classes (e.g., differently from what happens with softmax), so it is naturally able to respond without assigning any classes to the input (white areas in Fig.1). This property has been recently discussed as a mean for gaining robustness to adversarial examples Shafahi et al. (2019); Bendale and Boult (2016). The formula in Eq. (7) is what allows our model to spot examples that might fall in this “I don’t know” area. Dependencies among classes are only introduced by the constraint loss in Eq. (2) on the training data.
The choice of is crucial in the definition of the reject function . On the one hand, in some problems we might have access to the certainty degree of each FOL formula, that could be used to set , otherwise it seems natural to select an unbiased set of weights , , . On the other hand, several FOL formulas involve the implication operator , that naturally implements if-then rules (if class then class ) or, equivalently, rules that are about hierarchies, since models an inclusion (class included in class ). However, whenever the premises are false, the whole formula holds true. It might be easy to trivially fulfill the associated constrains by zeroing all the predicates in the premises, eventually avoiding rejection. As rule of thumb, it is better to select ’s that are larger for those constraints that favor the activation of the involved predicates.
Attacking Multi-label Classifiers.
Robustness against adversarial examples is typically evaluated against black-box and white-box attacks Biggio and Roli (2018); Miller et al. (2020). In the black-box setting, the attacker is assumed to have only black-box query access to the target model, ignoring the presence of any defense mechanisms such as the use of domain-knowledge constraints. However, a surrogate model can be trained on data ideally sampled from the same distribution of that used to train the target model. Within these assumptions, gradient-based attacks can be optimized against the surrogate model, and then transferred/evaluated against the target one Papernot et al. (2016a); Demontis et al. (2019). In the white-box setting, instead, the attacker is assumed to know everything about the target model, including the defense mechanism. White-box attacks are thus expected to exploit knowledge of the defense mechanism to bypass it.
We propose here a novel multi-label knowledge-driven adversarial attack (MKA) that naturally extends the formulation of single-label attacks and allows staging both black-box and white-box (adaptive) attacks against our approach. Given , being the test set, we consider the sets of ground-truth positive and negative classes, and , respectively. Let us define , and , i.e., () is the index of the positive (negative) class with the smallest (largest) output score. These are essentially the indices of the classes for which is closer to the decision boundaries. Our attack optimizes the following objective,
is the value of the logit of, is an -norm ( in our experiments), and in the case of image data with pixel intensities in we also have . The scalar is used to threshold the values of the logits, to avoid increasing/decreasing them in an unbounded way. Optimizing the logit values is preferable to avoid sigmoid saturation (in our experiments, we set ). While the definition of Eq. (9) is limited to a pair of classes, we dynamically update and whenever logit ( goes beyond (above) the threshold (), thus multiple classes are considered by the attack, compatibly with the maximum number of iterations of the optimizer. This strategy resulted to be more effective than jointly optimizing all the classes in and . Moreover, the classes involved in the attack can be a subset of the whole set. For white-box attacks, we use to enforce domain knowledge and avoid rejection. For black-box attacks, instead, we set . Eq. (9) is minimized via projected gradient descent ( samples and iterations in our experiments).
We considered three image classification datasets, referred to as ANIMALS, CIFAR-100 and PASCAL-Part respectively. The first one is a collection of real-world images of animals (multiple resolutions), taken from the ImageNet database,555ANIMALS http://www.image-net.org/, CIFAR-100 https://www.cs.toronto.edu/~kriz/cifar.html the second one is a popular benchmark composed of RGB images () belonging to different types of classes (vehicles, flowers, people, etc.), while the last dataset is composed of images in which both objects (Man, Dog, Car, Train, etc.) and object-parts (Head, Paw, Beak, etc.) are labeled.666PASCAL-Part: https://www.cs.stanford.edu/~roozbeh/pascal-parts/pascal-parts.html All datasets are used in a multi-label classification setting, in which each image is paired with a fixed number of binary attributes. In the case of ANIMALS there are attributes, where the first ones, also referred to as “main” attributes/classes, are about the specific animal classes (albatross, cheetah, tiger, giraffe, zebra, ostrich, penguin) while the other attributes are about features of the animal classes (mammal, bird, carnivore, fly, etc.). The CIFAR-100 dataset is composed of attributes, out of which are fine-grained class labels (“main” attributes) and are superclasses. In the PASCAL-Part dataset, after merging classes as in Donadello et al. (2017), we have classes, out of which are objects (“main” attributes) and the remaining are object-parts. We have the use of domain knowledge that holds for all the available examples. In the case of ANIMALS, it is a collection of FOL formulas that were defined in the benchmark of P.H. Winston Winston and Horn (1986), and they involve relationships between animal classes and animal properties, such as FLY LAYEGGS BIRD. In CIFAR-100, FOL formulas are about the father-son relationships between classes, while in PASCAL-Part they either list all the parts belonging to a certain object, i.e., MOTORBIKE WHEEL HEADLIGHT HANDLEBAR SADDLE , or they list all the objects in which a part can be found, i.e., HANDLEBAR BICYCLE MOTORBIKE. In all cases we also introduced a disjunction or a mutual-exclusivity constraint among the main attributes, and another disjunction among the other attributes. See Table 1 and the supplementary material for more details. Each dataset was divided into training and test sets (the latter indicated with ). The training set was further divided into a learning set (), used to train the classifiers, and a validation set (), used to tune the model parameters. We defined a semi-supervised learning scenario in which only a portion of the training set is labeled, sometimes partially (i.e., only a fraction of the attributes of a labeled example is known), as detailed in Table 1. We indicated with Labeled the percentage of labeled training data, and with Partial the percentage of attributes that are unknown for each labeled example.777When splitting the training data into and , we kept the same percentages of unknown attributes in both the splits (of course, all the examples in are labeled). Moreover, when generating partial labels, we ensured that the percentages of discarded positive and negative attributes were the same.
We compared two neural architectures, based on the popular backbone ResNet50 trained in ImageNet. In the first network, referred to as TL, we transferred the ResNet50 model and trained the last layer from scratch in order to predict the dataset-specific attributes (sigmoid activation). The second network, indicated with FT, has the same structure of TL
, but we also fine-tuned the last convolutional layer. Each model is based on the product T-Norm, and it was trained for a number of epochsthat we selected as follows: epochs in ANIMALS, (TL) or (FT) epochs in CIFAR-100, and (TL) or (FT) in PASCAL-Part, using minibatches of size . We used the Adam optimizer, with an initial step size of , except for FT in CIFAR-100, for which we used to speedup convergence. We selected the model at the epoch that led to the largest F1 in .
To evaluate performance, we considered the (macro) F1 score and a metric restricted to the main classes.888We compared the output values against to obtain binary labels. For ANIMALS and CIFAR-100, the main classes are mutually exclusive, so we measured the accuracy in predicting the winning main class (AccMain), while in PASCAL-Part we kept the F1 score (F1Main) as multiple main classes can be predicted on the same input. The results obtained after tuning are reported in Table 2, averaged over 3 runs. The selected parameters are reported in the supplementary material. We considered unconstrained () and constrained (+C) models and, for TL, we also considered a strongly-constrained (+CC) model with inferior performance but higher coherence (greater ) among the predicted attributes (that might lead to a worse fitting of the supervisions).999In FT, due to the larger number of parameters, the constraint loss was already small in the +C case. The introduction of domain knowledge allows the constrained classifiers to outperform the unconstrained ones.
To evaluate adversarial robustness, we used the MKA attack procedure described in Sect. 2. We restricted the attack to work on the main classes associated to the most important attributes of each problem, assuming that the decisions of the classifier on the other classes are not exposed, but only internally used to evaluate knowledge-related constraints and eventually reject samples that violate them. In ANIMALS and CIFAR-100 we assumed the attacker to have access to the information on the mutual exclusivity of the main classes, so that and in Eq. (9) are not required to change during the attack optimization. We also set to maximize confidence of misclassifications at each given perturbation bound . All the following results are averaged after having attacked twice the model obtained after each of the 3 training runs.
In the black-box setting, we assumed the attacker to be also aware of the network architecture of the target classifier, and attacks were generated from a surrogate model trained on a different realization of the training set. Fig. 2 shows the classification quality as a function of the data perturbation bound , comparing models trained with and without constraints against those implementing the detection/rejection mechanism described in Eq. (3). When using such mechanism, the rejected examples are marked as correctly classified if they are adversarial (), otherwise () they are marked as points belonging to an unknown class, slightly worsening the performance. The +C/+CC models show larger accuracy/F1 than the unconstrained ones. Despite the lower results at , models that are more strongly constrained (+CC) resulted to be harder to attack for increasing values of . When the knowledge-based detector is activated, the improvements with respect to models without rejection are significantly evident. No model is specifically designed to face adversarial attacks and, of course, there are no attempts to reach state-of-the art results. However, the positive impact of exploiting domain knowledge can be observed in all the considered models and datasets, and for almost all the values of , confirming that such knowledge is not only useful to improve classifier robustness, but also as a mean to detect adversarial examples at no additional training cost. In general, FT models yield better results, due to the larger number of optimized parameters. In ANIMALS the rejection dynamics are providing large improvements in both TL and FT, while the impact of domain knowledge is mostly evident on the robustness of FT. In CIFAR-100, domain knowledge only consists of basic hierarchical relations, with no intersections among child classes or among father classes. By inspecting the classifier, we found that it is pretty frequent for the fooling examples to be predicted with a strongly-activated father class and a (coherent) child class, i.e., we have strongly-paired classes, accordingly to Def. 1. Differently, the domain knowledge in the other datasets is more structured, yielding better detection quality on average, remarking the importance of the level of detail of such knowledge to counter adversarial examples. In the case of PASCAL-Part, the detection mechanism turned out to better behave in unconstrained classifiers, even if it has a positive impact also on the constrained ones. This is due to the intrinsic difficulty of making predictions on this dataset, especially when considering small object-parts. The false positives have a negative effect in the training stage of the knowledge-constrained classifiers.
To provide a comprehensive, worst-case evaluation of the adversarial robustness of our approach, we also considered a white-box adaptive attacker that knows everything about the target model and exploits knowledge of the defense mechanism to bypass it. Of course, this attack always evades detection if the perturbation size is sufficiently large. We evaluated multiple values of of Eq. (9), selecting the one that yielded the lowest values of such objective function. In Fig. 3 we report the outcome of two selected cases, showing that, even if the accuracy drop is obviously evident for both datasets, in ANIMALS the constrained classifiers require larger perturbations than the unconstrained ones to reduce the performance of the same quantity. Thus, fooling the detection mechanisms is not always trivial as one might expect, even in this worst-case setting. We refer the reader to the supplementary material for more details about these attacks and their optimization. Finally, let us point out that the performance drop caused by the white-box attack is much larger than that observed in the black-box case. However, since domain knowledge is not likely to be available to the attacker in many practical settings, it remains an open challenge to develop stronger, practical black-box attacks that are able to infer and exploit such knowledge to bypass our defense mechanism.
In this paper we investigated the role of domain knowledge in adversarial settings. Focusing on multi-label classification, we injected knowledge expressed by First-Order Logic in the training stage of the classifier, not only with the aim of improving its quality, but also as a mean to build a detector of adversarial examples at no additional cost. We proposed a multi-label attack procedure and showed that knowledge-constrained classifiers can improve their robustness against both black-box and white-box attacks, depending on the nature of the available domain knowledge. We believe that these findings will open the investigation of domain knowledge as a feature to further improve the robustness of multi-label classifiers against adversarial attacks.
The outcomes of this work might help in fostering advancements in those studies that are about adversarial machine learning. In a long-term perspective, this could lead to the development of more robust machine learning-based multi-label classifiers. We believe that there are neither ethical aspects nor evident future societal consequences that should be discussed in the context of this work.
Akcay et al. (2018)
Samet Akcay, Amir Atapour-Abarghouei, and Toby P Breckon.
Ganomaly: Semi-supervised anomaly detection via adversarial training.In
Asian Conference on Computer Vision, pages 622–637. Springer, 2018.
- Alayrac et al. (2019) Jean-Baptiste Alayrac, Jonathan Uesato, Po-Sen Huang, Alhussein Fawzi, Robert Stanforth, and Pushmeet Kohli. Are labels required for improving adversarial robustness? In Advances in Neural Information Processing Systems, pages 12214–12223, 2019.
- Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, volume 80 of JMLR Workshop and Conference Proceedings, pages 274–283. JMLR.org, 2018.
- Babbar and Schölkopf (2018) Rohit Babbar and Bernhard Schölkopf. Adversarial extreme multi-label classification. arXiv preprint arXiv:1803.01570, 2018.
Bendale and Boult (2016)
Abhijit Bendale and Terrance E Boult.
Towards open set deep networks.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1563–1572, 2016.
- Biggio and Roli (2018) Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 84:317–331, 2018.
Carlini and Wagner (2017)
Nicholas Carlini and David Wagner.
Adversarial examples are not easily detected: Bypassing ten detection
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3–14, 2017.
- Carlini et al. (2019a) Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. ArXiv e-prints, 1902.06705, 2019a.
- Carlini et al. (2019b) Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705, 2019b.
- Carmon et al. (2019) Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C Duchi, and Percy S Liang. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems, pages 11190–11201, 2019.
- Demontis et al. (2019) Ambra Demontis, Marco Melis, Maura Pintor, Matthew Jagielski, Battista Biggio, Alina Oprea, Cristina Nita-Rotaru, and Fabio Roli. Why do adversarial attacks transfer? Explaining transferability of evasion and poisoning attacks. In 28th USENIX Security Symposium (USENIX Security 19). USENIX Association, 2019.
- Diligenti et al. (2017) Michelangelo Diligenti, Marco Gori, and Claudio Sacca. Semantic-based regularization for learning and inference. Artificial Intelligence, 244:143–165, 2017.
- Donadello et al. (2017) Ivan Donadello, Luciano Serafini, and Artur d’Avila Garcez. Logic tensor networks for semantic image interpretation. arXiv preprint arXiv:1705.08968, 2017.
- Gnecco et al. (2015) Giorgio Gnecco, Marco Gori, Stefano Melacci, and Marcello Sanguineti. Foundations of support constraint machines. Neural computation, 27(2):388–480, 2015.
- Goodfellow et al. (2018) Ian Goodfellow, Patrick McDaniel, and Nicolas Papernot. Making machine learning robust against adversarial inputs. Communications of the ACM, 61(7):56–66, 2018.
- Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
Gori and Melacci (2013)
Marco Gori and Stefano Melacci.
Constraint verification with kernel machines.
IEEE Trans. Neural Networks Learn. Syst., 24(5):825–831, 2013.
- Grosse et al. (2017) Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.
- Hein et al. (2019) Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 41–50, 2019.
- Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. Early methods for detecting adversarial images. arXiv preprint arXiv:1608.00530, 2016.
- Hendrycks and Gimpel (2017) Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. ICLR, arXiv preprint arXiv:1610.02136, 2017.
- Klement et al. (2013) Erich Peter Klement, Radko Mesiar, and Endre Pap. Triangular norms, volume 8. Springer Science & Business Media, 2013.
- Lee et al. (2018) Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7167–7177, 2018.
- Ma et al. (2018) Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E Houle, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018.
Melacci and Belkin (2011)
Stefano Melacci and Mikhail Belkin.
Laplacian Support Vector Machines Trained in the Primal.Journal of Machine Learning Research, 12:1149–1184, March 2011. ISSN 1532-4435.
- Miller et al. (2020) David J Miller, Zhen Xiang, and George Kesidis. Adversarial learning targeting deep neural network classification: A comprehensive review of defenses against attacks. Proceedings of the IEEE, 108(3):402–433, 2020.
- Miyato et al. (2015) Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. ICLR, arXiv preprint arXiv:1507.00677, 2015.
- Miyato et al. (2018) Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
- Najafi et al. (2019) Amir Najafi, Shin-ichi Maeda, Masanori Koyama, and Takeru Miyato. Robustness to adversarial perturbations in learning from incomplete data. In Advances in Neural Information Processing Systems, pages 5542–5552, 2019.
- Pang et al. (2018) Tianyu Pang, Chao Du, Yinpeng Dong, and Jun Zhu. Towards robust detection of adversarial examples. In Advances in Neural Information Processing Systems, pages 4579–4589, 2018.
- Papernot et al. (2016a) Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016a.
- Papernot et al. (2016b) Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597. IEEE, 2016b.
- Park et al. (2018) Sungrae Park, JunKeon Park, Su-Jin Shin, and Il-Chul Moon. Adversarial dropout for supervised and semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Samangouei et al. (2018) Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. ICLR, arXiv preprint arXiv:1805.06605, 2018.
- Shafahi et al. (2019) Ali Shafahi, W. Ronny Huang, Christoph Studer, Soheil Feizi, and Tom Goldstein. Are adversarial examples inevitable? In International Conference on Learning Representations, 2019.
- Sinha et al. (2018) Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. ICLR, arXiv preprint arXiv:1710.10571, 2018.
- Song et al. (2018) Q. Song, H. Jin, X. Huang, and X. Hu. Multi-label adversarial perturbations. In 2018 IEEE International Conference on Data Mining (ICDM), pages 1242–1247, 2018.
- Sotgiu et al. (2020) Angelo Sotgiu, Ambra Demontis, Marco Melis, Battista Biggio, Giorgio Fumera, Xiaoyi Feng, and Fabio Roli. Deep neural rejection against adversarial examples. EURASIP Journal on Information Security, 2020:1–10, 2020.
- Teso (2019) Stefano Teso. Does symbolic knowledge prevent adversarial fooling? arXiv preprint arXiv:1912.10834, 2019.
- Winston and Horn (1986) Patrick Henry Winston and Berthold K Horn. Lisp. 1986.
Wu et al. (2017)
Yi Wu, David Bamman, and Stuart Russell.
Adversarial training for relation extraction.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1778–1783, 2017.
- Zhai et al. (2019) Runtian Zhai, Tianle Cai, Di He, Chen Dan, Kun He, John Hopcroft, and Liwei Wang. Adversarially robust generalization just requires more unlabeled data. arXiv preprint arXiv:1906.00555, 2019.
Appendix A Attack Optimization
Our attack optimizes Eq. (9) via projected gradient descent. Black-box attacks are non-adaptive, and thus ignore the defense mechanism. For this reason, the constraint loss term in our attack is ignored by setting its multiplier and . For white-box attacks on ANIMALS and PASCAL-PART, we set and , respectively, while setting . These values are chosen to appropriately scale the values of the constraint loss term w.r.t. the logit difference (i.e., the first term in Eq. 9, lower bounded by ). This is required to have the sample misclassified while also fulfilling the domain-knowledge constraints. The process is better illustrated in Figs. 4 and 5, in which we respectively report the behavior of the black-box and white-box attack optimization on a single image from the ANIMALS dataset, with . In particular, in each Figure we report the source image, the (magnified) adversarial perturbation, and the resulting adversarial examples, along with some plots describing the optimization process, i.e., how the attack loss of Eq. (9) is minimized across iterations, and how the softmax-scaled outputs on the main classes and the logarithm of the constraint loss change accordingly.
In both the black-box and white-box cases, the attack loss is progressively reduced during the iterations of the optimization procedure. While the albatross prediction is progressively transformed into ostrich, the constraint loss increases across iterations, exceeding the rejection threshold. Thus, the adversarial example is correctly detected. Similarly, the white-box attack is able to initially flip the prediction from albatross to ostrich, allowing the constraint loss to increase. However, after this initial phase, the attack correctly reduces the constraint loss after its initial bump, bringing its value below the rejection threshold. The system thus fails to detect the corresponding adversarial example. Finally, it is also worth remarking that, in both cases, the final perturbations do not substantially compromise the source image content, remaining essentially imperceptible to the human eye.
Appendix B Parameter Settings
In Table 3, for each model, we report the optimal value of used in our experiments, selected via a 3-fold cross-validation procedure. For completeness, in Table 4, we also report the value of the constraint loss measured on the test set . We used , setting each component to , with the exception of the weight of the mutual exclusivity constraint or the disjunction of the main classes, which was set to to enforce the classifier to take decisions on the unsupervised portion of the training data.
Appendix C Domain Knowledge
Each dataset is composed of a set of attributes (classes) that we formalize with logic predicates. Such predicates participate in First-Order Logic (FOL) formulas that model the available domain knowledge. The FOL formulas that define the domain knowledge of the ANIMALS, CIFAR-100 and PASCAL-Part data are reported in Table 5, Table C, and Table 8, respectively, where each predicate is indicated with capital letters. In each table (bottom part) we also report those rules that are about activating at least one of the attributes of each level of the hierarchy. Following the nomenclature used in the paper, the main attributes of the ANIMALS dataset are ALBATROSS, GIRAFFE, CHEETAH, OSTRICH, PENGUIN, TIGER, ZEBRA, while the other attributes are MAMMAL, HAIR, MILK, FEATHERS, BIRD, FLY, LAYEGGS, MEAT, CARNIVORE, POINTEDTEETH, CLAWS, FORWARDEYS, HOOFS, UNGULATE, CUD, EVENTOED, TAWNY, BLACKSTRIPES, LONGLEGS, LONGNECK, DARKSPOTS, WHITE, BLACK, SWIM, BLACKWHITE, GOODFLIER. In the case of the CIFAR-100 dataset, the main attributes are the ones associated with the predicates of Table C that belong to the premises of the shortest FOL formulas (i.e., the formulas in the form A B, where the main attribute is A). Formulas in PASCAL-Part are relationships between objects and object-parts. The same part can belong to multiple objects, and in each objects several parts might be visible. See Table 8 for the list of classes (main classes are in the premises of the second block of formulas).
In ANIMALS and CIFAR-100, a mutual exclusion predicate is imposed on the main classes. As a matter of fact, in these two datasets, each image is only about a single main class. The predicate defined below, can be devised in different ways. The first, straightforward approach consists in considering the disjunction of the true cases in the truth table of the predicate:
where is the set of the main classes, with cardinality and is the logic predicate corresponding to the -th output of the network . This formulation of the predicate is what we used in the ANIMALS dataset. When there are seveal classes, as in CIFAR-100, this formulation leads to optimization issues, since it turned out to be complicated to find a good balance between the effect of this constraint and the supervision-fitting term. For this reason, the mutual exclusivity in CIFAR-100 was defined as a disjunction of the main classes followed by a set of implications that are used to implement the mutual exclusion of the predicates,
that resulted easier to tune, since we have multiple soft constraints that could be eventually violated to accommodate the optimization procedure.
|FLY LAYEGGS BIRD|
|MAMMAL MEAT CARNIVORE|
|MAMMAL POINTEDTEETH CLAWS FORWARDEYES CARNIVORE|
|MAMMAL HOOFS UNGULATE|
|MAMMAL CUD UNGULATE|
|MAMMAL CUD EVENTOED|
|CARNIVORE TAWNY DARKSPOTS CHEETAH|
|CARNIVORE TAWNY BLACKSTRIPES TIGER|
|UNGULATE LONGLEGS LONGNECK TAWNY DARKSPOTS GIRAFFE|
|BLACKSTRIPES UNGULATE WHITE ZEBRA|
|BIRD FLY LONGLEGS LONGNECK BLACK OSTRICH|
|BIRD FLY SWIM BLACKWHITE PENGUIN|
|BIRD GOODFLIER ALBATROSS|
|MAMMAL HAIR MILK FEATHERS BIRD FLY LAYEGGS MEAT|
|CARNIVORE POINTEDTEETH CLAWS FORWARDEYS HOOFS UNGULATE|
|CUD EVENTOED TAWNY BLACKSTRIPES LONGLEGS LONGNECK|
|DARKSPOTS WHITE BLACK SWIM BLACKWHITE GOODFLIER|
|AQUATIC MAMMALS (BEAVER DOLPHIN OTTER SEAL WHALE)|
|BEAVER AQUATIC MAMMALS|
|DOLPHIN AQUATIC MAMMALS|
|OTTER AQUATIC MAMMALS|
|SEAL AQUATIC MAMMALS|
|WHALE AQUATIC MAMMALS|
|FISH (AQUARIUM FISH FLATFISH RAY SHARK TROUT)|
|FLOWERS (ORCHID POPPY ROSE SUNFLOWER TULIP)|
|FOOD_CONTAINERS (BOTTLE BOWL CAN CUP PLATE)|
|FRUIT_AND_VEGETABLES (APPLE MUSHROOM ORANGE PEAR|
|HOUSEHOLD_ELECTRICAL_DEVICES (CLOCK KEYBOARD LAMP|
|HOUSEHOLD_FURNITURE (BED CHAIR COUCH TABLE WARDROBE)|
|INSECTS (BEE BEETLE BUTTERFLY CATERPILLAR COCKROACH)|
|LARGE_CARNIVORES (BEAR LEOPARD LION TIGER WOLF)|
|LARGE_MAN-MADE_OUTDOOR_THINGS (BRIDGE CASTLE HOUSE ROAD|
|LARGE_NATURAL_OUTDOOR_SCENES (CLOUD FOREST MOUNTAIN|
|LARGE_OMNIVORES_AND_HERBIVORES (CAMEL CATTLE CHIMPANZEE|
|MEDIUM_MAMMALS (FOX PORCUPINE POSSUM RACCOON|
|NON-INSECT_INVERTEBRATES (CRAB LOBSTER SNAIL SPIDER|
|PEOPLE (BABY MAN WOMAN BOY GIRL)|
|REPTILES (CROCODILE DINOSAUR LIZARD SNAKE TURTLE)|
|SMALL_MAMMALS (HAMSTER MOUSE RABBIT SHREW SQUIRREL)|
|TREES (MAPLE_TREE OAK_TREE PALM_TREE PINE_TREE|
|VEHICLES1 (BIKE BUS MOTORBIKE PICKUP_TRUCK TRAIN)|
|VEHICLES2 (LAWN MOWER ROCKET STREETCAR TANK TRACTOR)|
|LAWN MOWER VEHICLES2|
|BEETLE, BICYCLE, BOTTLE, BOWL , BOY, BRIDGE, BUS,|
|BUTTERFLY, CAMEL, CAN, CASTLE, CATERPILLAR , CATTLE, CHAIR|
|CHIMPANZEE, CLOCK, CLOUD , COCKROACH, COUCH, CRAB,|
|CROCODILE , CUP, DINOSAUR, DOLPHIN, ELEPHANT, FLATFISH,|
|FOREST, FOX, GIRL, HAMSTER, HOUSE, KANGAROO, KEYBOARD,|
|LAMP , LAWN_MOWER, LEOPARD, LION, LIZARD, LOBSTER, MAN,|
|MAPLE_TREE , MOTORCYCLE, MOUNTAIN, MOUSE, MUSHROOM,|
|OAK_TREE, ORANGE, ORCHID, OTTER, PALM_TREE, PEAR,|
|PICKUP_TRUCK , PINE_TREE, PLAIN, PLATE, POPPY, PORCUPINE,|
|POSSUM, RABBIT, RACCOON, RAY, ROAD, ROCKET, ROSE, SEA,|
|SEAL, SHARK, SHREW, SKUNK SKYSCRAPER, SNAIL, SNAKE,|
|SPIDER, SQUIRREL, STREETCAR, SUNFLOWER, SWEET_PEPPER, TABLE,|
|TANK, TELEPHONE, TELEVISION, TIGER, TRACTOR, TRAIN, TROUT,|
|TULIP, TURTLE, WARDROBE, WHALE, WILLOW_TREE, WOLF|
|WOMAN, WORM )|
|FRUIT AND VEGETABLES, HOUSEHOLD ELECTRICAL , HOUSEHOLD FURNITURE,|
|INSECTS , LARGE CARNIVORES, MAN-MADE OUTDOOR ,|
|NATURAL OUTDOOR SCENES, OMNIVORES AND HERBIVORES, MEDIUM MAMMALS,|
|INVERTEBRATES , PEOPLE , REPTILES , SMALL MAMMALS, TREES,|
|VEHICLES1, VEHICLES2 )|
|TORSO (PERSON HORSE COW DOG BIRD CAT SHEEP)|
|LEG (PERSON HORSE COW DOG BIRD CAT SHEEP)|
|HEAD (PERSON HORSE COW DOG BIRD CAT SHEEP)|
|EAR (PERSON HORSE COW DOG CAT SHEEP)|