Can Domain Knowledge Alleviate Adversarial Attacks in Multi-Label Classifiers?

06/06/2020 ∙ by Stefano Melacci, et al. ∙ Università di Siena UNIFI Universita Cagliari 16

Adversarial attacks on machine learning-based classifiers, along with defense mechanisms, have been widely studied in the context of single-label classification problems. In this paper, we shift the attention to multi-label classification, where the availability of domain knowledge on the relationships among the considered classes may offer a natural way to spot incoherent predictions, i.e., predictions associated to adversarial examples lying outside of the training data distribution. We explore this intuition in a framework in which first-order logic knowledge is converted into constraints and injected into a semi-supervised learning problem. Within this setting, the constrained classifier learns to fulfill the domain knowledge over the marginal distribution, and can naturally reject samples with incoherent predictions. Even though our method does not exploit any knowledge of attacks during training, our experimental analysis surprisingly unveils that domain-knowledge constraints can help detect adversarial examples effectively, especially if such constraints are not known to the attacker. While we also show that an adaptive attack exploiting knowledge of the constraints may still deceive our classifier, it remains an open issue to understand how hard for an attacker would be to infer such constraints in practical cases. For this reason, we believe that our approach may provide a significant step towards designing robust multi-label classifiers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last few years, motivated by the large popularity of deep learning-based models, a growing number of studies on the properties of adversarial attacks and of the corresponding defenses have been produced by the scientific community 

Papernot et al. (2016a); Biggio and Roli (2018); Goodfellow et al. (2018); Carlini et al. (2019b); Shafahi et al. (2019); Sotgiu et al. (2020) (see Miller et al. (2020) for a recent review on this topic). Most of the existing works either propose methods for improving classifier robustness by modifying the learning algorithm to explicitly account for the presence of adversarial data perturbations Goodfellow et al. (2014); Papernot et al. (2016b); Sinha et al. (2018), or develop specific detection mechanisms for adversarial examples Carlini and Wagner (2017); Ma et al. (2018); Samangouei et al. (2018); Pang et al. (2018); Lee et al. (2018); Miller et al. (2020). Only a few approaches focus on the semi-supervised learning setting Miyato et al. (2015); Park et al. (2018); Akcay et al. (2018); Carmon et al. (2019); Miyato et al. (2018); Zhai et al. (2019); Najafi et al. (2019); Alayrac et al. (2019), although it provides a natural setting for many real-world applications in which labeling data is costly while unlabeled samples are readily available. More importantly, to the best of our knowledge, the problem of multi-label classification, in which each example can belong to more classes, is only preliminary discussed in the context of adversarial learning in Song et al. (2018), while using adversarial examples to improve some types of multi-label classifiers is evaluated in Wu et al. (2017); Babbar and Schölkopf (2018).

In this paper, we focus on multi-label classification and, in particular, in the case in which domain knowledge on the relationships among the considered classes is available. Such knowledge can be naturally expressed by First-Order Logic (FOL) clauses, and, following the learning framework of Gnecco et al. (2015); Diligenti et al. (2017)

, it can be used to improve the classifier by enforcing FOL-based constraints on the unsupervised or partially labeled portions of the training set. A well-known intuition in adversarial machine learning suggests that a reliable model of the distribution of the data could be used to spot adversarial examples, being them not sampled from such distribution, but it is not a straightforward procedure 

Grosse et al. (2017). We borrow such intuition and we intersect it with the idea that semi-supervised examples can help learn decision boundaries that better follow the marginal data distribution, coherently with the available knowledge Melacci and Belkin (2011); Diligenti et al. (2017). For those reasons, we study the role of those constraints that implement domain knowledge as a mean to not only better shape the decision boundaries, but also as a measure to spot examples that are likely to be generated in an adversarial setting.

What we propose differs from existing literature on semi-supervised learning in adversarial settings. Miyato et al. Miyato et al. (2015, 2018) and Park et el. Park et al. (2018) exploit adversarial training (virtual adversarial training and adversarial dropout, respectively) to favor regularity around the supervised and unsupervised training data, with the aim of improving the classifier performance. The work in Akcay et al. (2018) develops an anomaly detector using adversarial training in the semi-supervised setting. Self-supervised learning is exploited in Carmon et al. (2019); Najafi et al. (2019) to improve adversarial robustness, stability criteria are enforced on unlabeled training data in Zhai et al. (2019), while the work in Alayrac et al. (2019) specifically focuses on an unsupervised adversarial training procedure, exploited in semi-supervised classification. Our model neither exploits adversarial training nor any adversary-aware training criteria aimed at gaining intrinsic regularity. We focus on the role of domain knowledge as an indirect mean to increase adversarial robustness and, afterwards, to detect adversarial examples. Indeed, all the described methods could also be applied jointly with what we propose. Moreover, our approach also differs from other adversarial-example detectors Carlini and Wagner (2017); Ma et al. (2018); Samangouei et al. (2018); Miller et al. (2020) as it has no additional training cost and negligible runtime cost.

This paper contributes in showing that domain knowledge is a powerful feature (1) to improve robustness of multi-label classifiers and (2) to detect adversarial examples. To properly evaluate the robustness of our approach, which remains one of the most challenging problems in adversarial machine learning Carlini et al. (2019a); Athalye et al. (2018); Biggio and Roli (2018), we propose (3) a novel multi-label knowledge-driven attack that can implement both black-box and white-box adaptive attacks. While we show that an adaptive attack having access to the domain knowledge exploited by our classifier can bypass it, even though at the cost of an increased perturbation size, it remains an open issue to understand how hard for an attacker would be to infer such knowledge in practical cases. For this reason, we believe that our work can provide a significant contribution towards both evaluating and designing robust multi-label classifiers.

2 Learning with Domain Knowledge

We consider a vector function

, where and . Each function is responsible for implementing a specific task on the considered input domain .111This notion can be trivially extended to the case in which the task functions operate in different domains. In the context of this paper, without loss of generality, we consider multi-label classification problems with classes, in which each input is associated to one or more classes. Function predicts the membership degree of to the -th class. Moreover, when we restrict the output of to , we can think of as the fuzzy logic predicate that models the truth degree of belonging to class . In order to simplify the notation, we will frequently make no explicit distinctions between function names, predicate names, class names or between input samples and predicate variables.

Whenever we focus on the predicate-oriented interpretation of each , First-Order Logic (FOL) becomes the natural way of describing relationships among the considered classes, i.e., the most effective type of domain knowledge that could be eventually available in a multi-label problem; e.g., , for some , meaning that the intersection between the -th class and the -th class is always included in the -one. The framework of Learning from Constraints Gnecco et al. (2015); Gori and Melacci (2013); Diligenti et al. (2017) follows the idea of converting domain knowledge into constraints on the learning problem and it studies, amongst a variety of other knowledge-oriented constraints (see, e.g., Table 2 in Gnecco et al. (2015)), the process of handling FOL formulas so that they can be both injected into the learning problem or used as a knowledge verification measure Gori and Melacci (2013); Diligenti et al. (2017). Such knowledge is enforced on those training examples for which either no information or only partial/incomplete labeling is available, thus casting the learning problem in the semi-supervised setting. As a result, the multi-label classifier can improve its performance and make predictions on out-of-sample data that are more coherent with the domain knowledge (see, e.g., Table 4 in Gnecco et al. (2015)). In particular, FOL formulas that represent the domain knowledge of the considered problem are converted into numerical constraints using Triangular Norms (T-Norms, Klement et al. (2013)), binary functions that generalize the conjunction operator . Following the previous example, is converted into a bilateral constraint that, in the case of the product T-Norm, is . The on the right-hand side of the constraint is due to the fact that the numerical formula must hold true (i.e, ), while the left-hand side is in . We indicate with

the loss function associated to

. In the simplest case (as followed in this paper) such loss is , where the minimum value of is zero. The quantifier is translated by enforcing the constraints on a discrete data sample . The loss function associated to knowledge is obtained by taking the sum (average) over the data in , and, since we usually have formulas whose relative importance could be uneven, we get

(1)

where is the vector that collects the scalar weights of the FOL formulas, and .

In this paper, is implemented with a neural architecture with output units and weights collected in . We distinguish between the use of Eq. (1) as a loss function in the training stage and its use as a measure to evaluate the constraint fulfillment on out-of-sample data. In detail, the classifier is trained on the training set by minimizing

(2)

where is the importance of the FOL formulas at training time, and modulates the weight of the constraint loss with respect to the supervision loss suploss, being the supervision information attached to some of the data in . The optimal is chosen by cross-validation, maximizing the classifier performance. When the classifier is evaluated on a test sample , the measure

(3)

with weights and , returns a score that indicates the fulfillment of the domain knowledge on (the lower the better). Note that and might not necessarily be equivalent, even if certainly related. In particular, one may differently weigh the importance of some formulas during training to better accommodate the gradient-descent procedure and avoid bad local minima.

It is important to notice that Eq. (2) enforces domain knowledge only on the training data . There are no guarantees that such knowledge will be fulfilled in the whole input space . This suggests that optimizing Eq. (2) yields a stronger fulfillment of knowledge over the space regions where the training points are distributed (low values of ), while could return larger values when departing from the distribution of the training data. The constraint enforcement is soft, so that the second term in Eq. (2) is not necessarily zero at the end of the optimization.

3 Exploiting Domain Knowledge against Adversarial Attacks

The basic idea behind this paper is that the constraint loss of Eq. (1) is not only useful to enforce domain knowledge into the learning problem, but also (i) to gain some robustness with respect to adversarial attacks and (ii) as a tool to detect adversarial examples at no additional training cost. The example in Fig. 1 illustrates the main principles followed in this work, in a multi-label classification problem with classes (cat, animal, motorbike, vehicle) for which the following domain knowledge is available, together with labeled and unlabeled training data:

(4)
(5)
(6)
(7)

Such knowledge is converted into numerical constraints, as described in Sect. 2, and the loss function is devised and enforced on the training data predictions during classifier training (Eq. 2). Fig. 1 shows two examples of the learned classifier.


                (a)                                          (b)                             (c)

Figure 1: Toy example using the domain knowledge of Eqs. (4-7) on 4 classes: cat (yellow), animal (blue), motorbike (green), vehicle (red). Labeled/unlabeled training data are depicted with rounded dots/gray triangles. (a,b) The decision regions for each class are shown: (a) open/loose decision boundaries; (b) tight/closed decision boundaries. The white area is associated with no predictions. Some adversarial examples (purple arrows/dots) are detected as they end up in regions that violate the constraints. (c) The feasible/unfeasible regions that fulfill/violate the constraints for (a,b) are shown.

Considering point (i.), in both cases, the decision boundaries are altered on the unlabeled data, enforcing the classifier to take a knowledge-coherent decision over the unlabeled training points and to better cover the marginal distribution of the data. This knowledge-driven regularity improves classifier robustness to adversarial attacks, as we will discuss in Sect. 4. Going into further details to illustrate claim (ii

.), in (a) we have the most likely case, in which decision boundaries are not always perfectly tight to the data distribution, and they might be not closed (ReLU networks typically return high-confidence predictions far from the training data 

Hein et al. (2019)). Three different attacks are shown (purple). In attack , an example of motorbike is perturbed to become an element of the cat class, but Eq. (4) is not fulfilled anymore. In attack , an example of animal is attacked to avoid being predicted as animal. However, it falls in a region where no predictions are yielded, violating Eq. (7). Attack number consists of an adversarial attack to create a fake cat, that however, is also predicted as vehicle, thus violating Eq. (4)) and Eq. (6). In (b) we have an ideal and extreme case, with very tight and closed decision boundaries. Some classes are well separated, it is harder to generate adversarial examples by slightly perturbing the available data, while it is easy to fall in regions for which Eq. (7) is not fulfilled. The pictures in (c) show the unfeasible regions in which the constraint loss is significantly larger, thus offering a natural criterion to spot adversarial examples that fall outside of the training data distribution.

Following these intuitions, and motivated by the approach of Hendrycks and Gimpel (2017, 2016), we define a rejection criterion as the Boolean expression

(8)

where

is estimated by cross-validation in order to avoid rejecting (or rejecting a small number of

22210% in our experiments.) the examples in the validation set . Eq. (8) computes the constraint loss on the validation data , using the importance weights (that we will discuss in what follows), as in Eq. (3). The rationale behind this idea is that those samples for which the constraint loss is larger than what it is on the distribution of the considered data should be rejected. The training samples are the ones over which domain knowledge was enforced during the training stage, while the validation set represents data on which knowledge was not enforced, but that are sampled from the same distribution from which the training set is sampled, making them good candidates for estimating .

Differently from ad-hoc detectors, that usually require to train generative models, this rejection procedure comes at no additional training cost.333Generative models on the fulfillment of the single constraints could be considered too. The procedure is effective whenever the functions in are not too strongly paired with respect to , and we formalize the notion of “pairing” as follows.

Definition 1

We consider a classification problem whose training data are distributed accordingly to the probability density

. Given and , the functions in are strongly paired whenever , being

a discrete set of samples uniformly distributed around the support of

.

This notion indicates that if the constraint loss is fulfilled in similar ways over the training data distribution and the space areas close to it, then there is no room for detecting those examples that should be rejected. While it is not straightforward to conclude about the pairing of before training the classifier, the soft constraining scheme of Eq. (2) allows the classification functions to be paired in a less strong manner that what they would be when using hard constraints.444See Teso (2019) for a discussion on hard constraints and graphical models in an adversarial context.

Note that a multi-label system is usually equipped with activation functions that do not structurally enforce any dependencies among the classes (e.g., differently from what happens with softmax), so it is naturally able to respond without assigning any classes to the input (white areas in Fig. 

1). This property has been recently discussed as a mean for gaining robustness to adversarial examples Shafahi et al. (2019); Bendale and Boult (2016). The formula in Eq. (7) is what allows our model to spot examples that might fall in this “I don’t know” area. Dependencies among classes are only introduced by the constraint loss in Eq. (2) on the training data.

The choice of is crucial in the definition of the reject function . On the one hand, in some problems we might have access to the certainty degree of each FOL formula, that could be used to set , otherwise it seems natural to select an unbiased set of weights , , . On the other hand, several FOL formulas involve the implication operator , that naturally implements if-then rules (if class then class ) or, equivalently, rules that are about hierarchies, since models an inclusion (class included in class ). However, whenever the premises are false, the whole formula holds true. It might be easy to trivially fulfill the associated constrains by zeroing all the predicates in the premises, eventually avoiding rejection. As rule of thumb, it is better to select ’s that are larger for those constraints that favor the activation of the involved predicates.

Attacking Multi-label Classifiers.

Robustness against adversarial examples is typically evaluated against black-box and white-box attacks Biggio and Roli (2018); Miller et al. (2020). In the black-box setting, the attacker is assumed to have only black-box query access to the target model, ignoring the presence of any defense mechanisms such as the use of domain-knowledge constraints. However, a surrogate model can be trained on data ideally sampled from the same distribution of that used to train the target model. Within these assumptions, gradient-based attacks can be optimized against the surrogate model, and then transferred/evaluated against the target one Papernot et al. (2016a); Demontis et al. (2019). In the white-box setting, instead, the attacker is assumed to know everything about the target model, including the defense mechanism. White-box attacks are thus expected to exploit knowledge of the defense mechanism to bypass it.

We propose here a novel multi-label knowledge-driven adversarial attack (MKA) that naturally extends the formulation of single-label attacks and allows staging both black-box and white-box (adaptive) attacks against our approach. Given , being the test set, we consider the sets of ground-truth positive and negative classes, and , respectively. Let us define , and , i.e., () is the index of the positive (negative) class with the smallest (largest) output score. These are essentially the indices of the classes for which is closer to the decision boundaries. Our attack optimizes the following objective,

(9)

where

is the value of the logit of

, is an -norm ( in our experiments), and in the case of image data with pixel intensities in we also have . The scalar is used to threshold the values of the logits, to avoid increasing/decreasing them in an unbounded way. Optimizing the logit values is preferable to avoid sigmoid saturation (in our experiments, we set ). While the definition of Eq. (9) is limited to a pair of classes, we dynamically update and whenever logit ( goes beyond (above) the threshold (), thus multiple classes are considered by the attack, compatibly with the maximum number of iterations of the optimizer. This strategy resulted to be more effective than jointly optimizing all the classes in and . Moreover, the classes involved in the attack can be a subset of the whole set. For white-box attacks, we use to enforce domain knowledge and avoid rejection. For black-box attacks, instead, we set . Eq. (9) is minimized via projected gradient descent ( samples and iterations in our experiments).

4 Experiments

We considered three image classification datasets, referred to as ANIMALS, CIFAR-100 and PASCAL-Part respectively. The first one is a collection of real-world images of animals (multiple resolutions), taken from the ImageNet database,

555ANIMALS http://www.image-net.org/, CIFAR-100 https://www.cs.toronto.edu/~kriz/cifar.html the second one is a popular benchmark composed of RGB images () belonging to different types of classes (vehicles, flowers, people, etc.), while the last dataset is composed of images in which both objects (Man, Dog, Car, Train, etc.) and object-parts (Head, Paw, Beak, etc.) are labeled.666PASCAL-Part: https://www.cs.stanford.edu/~roozbeh/pascal-parts/pascal-parts.html All datasets are used in a multi-label classification setting, in which each image is paired with a fixed number of binary attributes. In the case of ANIMALS there are attributes, where the first ones, also referred to as “main” attributes/classes, are about the specific animal classes (albatross, cheetah, tiger, giraffe, zebra, ostrich, penguin) while the other attributes are about features of the animal classes (mammal, bird, carnivore, fly, etc.). The CIFAR-100 dataset is composed of attributes, out of which are fine-grained class labels (“main” attributes) and are superclasses. In the PASCAL-Part dataset, after merging classes as in Donadello et al. (2017), we have classes, out of which are objects (“main” attributes) and the remaining are object-parts. We have the use of domain knowledge that holds for all the available examples. In the case of ANIMALS, it is a collection of FOL formulas that were defined in the benchmark of P.H. Winston Winston and Horn (1986), and they involve relationships between animal classes and animal properties, such as FLY LAYEGGS BIRD. In CIFAR-100, FOL formulas are about the father-son relationships between classes, while in PASCAL-Part they either list all the parts belonging to a certain object, i.e., MOTORBIKE WHEEL HEADLIGHT HANDLEBAR SADDLE , or they list all the objects in which a part can be found, i.e., HANDLEBAR BICYCLE MOTORBIKE. In all cases we also introduced a disjunction or a mutual-exclusivity constraint among the main attributes, and another disjunction among the other attributes. See Table 1 and the supplementary material for more details. Each dataset was divided into training and test sets (the latter indicated with ). The training set was further divided into a learning set (), used to train the classifiers, and a validation set (), used to tune the model parameters. We defined a semi-supervised learning scenario in which only a portion of the training set is labeled, sometimes partially (i.e., only a fraction of the attributes of a labeled example is known), as detailed in Table 1. We indicated with Labeled the percentage of labeled training data, and with Partial the percentage of attributes that are unknown for each labeled example.777When splitting the training data into and , we kept the same percentages of unknown attributes in both the splits (of course, all the examples in are labeled). Moreover, when generating partial labels, we ensured that the percentages of discarded positive and negative attributes were the same.

Dataset Attr Main #Examples %Labeled %Partial
ANIMALS
CIFAR-100
PASCAL-Part
Table 1: Datasets and details on the experimental setting. “Attr” stands for attributes; “Main” are main attributes. See the main text for a description of the other column labels.

We compared two neural architectures, based on the popular backbone ResNet50 trained in ImageNet. In the first network, referred to as TL, we transferred the ResNet50 model and trained the last layer from scratch in order to predict the dataset-specific attributes (sigmoid activation). The second network, indicated with FT, has the same structure of TL

, but we also fine-tuned the last convolutional layer. Each model is based on the product T-Norm, and it was trained for a number of epochs

that we selected as follows: epochs in ANIMALS, (TL) or (FT) epochs in CIFAR-100, and (TL) or (FT) in PASCAL-Part, using minibatches of size . We used the Adam optimizer, with an initial step size of , except for FT in CIFAR-100, for which we used to speedup convergence. We selected the model at the epoch that led to the largest F1 in .

Standard Evaluation.

To evaluate performance, we considered the (macro) F1 score and a metric restricted to the main classes.888We compared the output values against to obtain binary labels. For ANIMALS and CIFAR-100, the main classes are mutually exclusive, so we measured the accuracy in predicting the winning main class (AccMain), while in PASCAL-Part we kept the F1 score (F1Main) as multiple main classes can be predicted on the same input. The results obtained after tuning are reported in Table 2, averaged over 3 runs. The selected parameters are reported in the supplementary material. We considered unconstrained () and constrained (+C) models and, for TL, we also considered a strongly-constrained (+CC) model with inferior performance but higher coherence (greater ) among the predicted attributes (that might lead to a worse fitting of the supervisions).999In FT, due to the larger number of parameters, the constraint loss was already small in the +C case. The introduction of domain knowledge allows the constrained classifiers to outperform the unconstrained ones.

Metric Dataset TL TL+C TL+CC FT FT+C
F1 ANIMALS 0.9863 0.9917
CIFAR 0.5509 0.6399
PASCAL 0.7000 0.7097
AccMain F1Main ANIMALS 0.9925 0.9909
CIFAR 0.5563 0.6161
PASCAL 0.7595 0.7505
Table 2: Multi-label classification results (mean std) on test set , for different models. The second row-block is restricted to the main classes (Accuracy or F1). See the main text for further details.

ANIMALS                                          CIFAR-100                                       PASCAL-Part


Figure 2: Black-box attacks. Classification quality of vanilla and knowledge-constrained models in function of . Dotted plots include rejection (Rej) of inputs that are detected to be adversarial.

Adversarial Evaluation.

To evaluate adversarial robustness, we used the MKA attack procedure described in Sect. 2. We restricted the attack to work on the main classes associated to the most important attributes of each problem, assuming that the decisions of the classifier on the other classes are not exposed, but only internally used to evaluate knowledge-related constraints and eventually reject samples that violate them. In ANIMALS and CIFAR-100 we assumed the attacker to have access to the information on the mutual exclusivity of the main classes, so that and in Eq. (9) are not required to change during the attack optimization. We also set to maximize confidence of misclassifications at each given perturbation bound . All the following results are averaged after having attacked twice the model obtained after each of the 3 training runs.

In the black-box setting, we assumed the attacker to be also aware of the network architecture of the target classifier, and attacks were generated from a surrogate model trained on a different realization of the training set. Fig. 2 shows the classification quality as a function of the data perturbation bound , comparing models trained with and without constraints against those implementing the detection/rejection mechanism described in Eq. (3). When using such mechanism, the rejected examples are marked as correctly classified if they are adversarial (), otherwise () they are marked as points belonging to an unknown class, slightly worsening the performance. The +C/+CC models show larger accuracy/F1 than the unconstrained ones. Despite the lower results at , models that are more strongly constrained (+CC) resulted to be harder to attack for increasing values of . When the knowledge-based detector is activated, the improvements with respect to models without rejection are significantly evident. No model is specifically designed to face adversarial attacks and, of course, there are no attempts to reach state-of-the art results. However, the positive impact of exploiting domain knowledge can be observed in all the considered models and datasets, and for almost all the values of , confirming that such knowledge is not only useful to improve classifier robustness, but also as a mean to detect adversarial examples at no additional training cost. In general, FT models yield better results, due to the larger number of optimized parameters. In ANIMALS the rejection dynamics are providing large improvements in both TL and FT, while the impact of domain knowledge is mostly evident on the robustness of FT. In CIFAR-100, domain knowledge only consists of basic hierarchical relations, with no intersections among child classes or among father classes. By inspecting the classifier, we found that it is pretty frequent for the fooling examples to be predicted with a strongly-activated father class and a (coherent) child class, i.e., we have strongly-paired classes, accordingly to Def. 1. Differently, the domain knowledge in the other datasets is more structured, yielding better detection quality on average, remarking the importance of the level of detail of such knowledge to counter adversarial examples. In the case of PASCAL-Part, the detection mechanism turned out to better behave in unconstrained classifiers, even if it has a positive impact also on the constrained ones. This is due to the intrinsic difficulty of making predictions on this dataset, especially when considering small object-parts. The false positives have a negative effect in the training stage of the knowledge-constrained classifiers.

To provide a comprehensive, worst-case evaluation of the adversarial robustness of our approach, we also considered a white-box adaptive attacker that knows everything about the target model and exploits knowledge of the defense mechanism to bypass it. Of course, this attack always evades detection if the perturbation size is sufficiently large. We evaluated multiple values of of Eq. (9), selecting the one that yielded the lowest values of such objective function. In Fig. 3 we report the outcome of two selected cases, showing that, even if the accuracy drop is obviously evident for both datasets, in ANIMALS the constrained classifiers require larger perturbations than the unconstrained ones to reduce the performance of the same quantity. Thus, fooling the detection mechanisms is not always trivial as one might expect, even in this worst-case setting. We refer the reader to the supplementary material for more details about these attacks and their optimization. Finally, let us point out that the performance drop caused by the white-box attack is much larger than that observed in the black-box case. However, since domain knowledge is not likely to be available to the attacker in many practical settings, it remains an open challenge to develop stronger, practical black-box attacks that are able to infer and exploit such knowledge to bypass our defense mechanism.

ANIMALS                                    PASCAL-Part

Figure 3: White-box attacks in two selected cases.

5 Conclusions

In this paper we investigated the role of domain knowledge in adversarial settings. Focusing on multi-label classification, we injected knowledge expressed by First-Order Logic in the training stage of the classifier, not only with the aim of improving its quality, but also as a mean to build a detector of adversarial examples at no additional cost. We proposed a multi-label attack procedure and showed that knowledge-constrained classifiers can improve their robustness against both black-box and white-box attacks, depending on the nature of the available domain knowledge. We believe that these findings will open the investigation of domain knowledge as a feature to further improve the robustness of multi-label classifiers against adversarial attacks.

Broader Impact

The outcomes of this work might help in fostering advancements in those studies that are about adversarial machine learning. In a long-term perspective, this could lead to the development of more robust machine learning-based multi-label classifiers. We believe that there are neither ethical aspects nor evident future societal consequences that should be discussed in the context of this work.

References

Appendix A Attack Optimization

Our attack optimizes Eq. (9) via projected gradient descent. Black-box attacks are non-adaptive, and thus ignore the defense mechanism. For this reason, the constraint loss term in our attack is ignored by setting its multiplier and . For white-box attacks on ANIMALS and PASCAL-PART, we set and , respectively, while setting . These values are chosen to appropriately scale the values of the constraint loss term w.r.t. the logit difference (i.e., the first term in Eq. 9, lower bounded by ). This is required to have the sample misclassified while also fulfilling the domain-knowledge constraints. The process is better illustrated in Figs. 4 and 5, in which we respectively report the behavior of the black-box and white-box attack optimization on a single image from the ANIMALS dataset, with . In particular, in each Figure we report the source image, the (magnified) adversarial perturbation, and the resulting adversarial examples, along with some plots describing the optimization process, i.e., how the attack loss of Eq. (9) is minimized across iterations, and how the softmax-scaled outputs on the main classes and the logarithm of the constraint loss change accordingly.

In both the black-box and white-box cases, the attack loss is progressively reduced during the iterations of the optimization procedure. While the albatross prediction is progressively transformed into ostrich, the constraint loss increases across iterations, exceeding the rejection threshold. Thus, the adversarial example is correctly detected. Similarly, the white-box attack is able to initially flip the prediction from albatross to ostrich, allowing the constraint loss to increase. However, after this initial phase, the attack correctly reduces the constraint loss after its initial bump, bringing its value below the rejection threshold. The system thus fails to detect the corresponding adversarial example. Finally, it is also worth remarking that, in both cases, the final perturbations do not substantially compromise the source image content, remaining essentially imperceptible to the human eye.

Figure 4: Black-box attack on the ANIMALS dataset. While the attack is able to flip the initial prediction from albatross to ostrich, the attack is eventually detected as the constraint loss remains above the rejection threshold (dashed black line).
Figure 5: White-box attack on the ANIMALS dataset. The attack is able to flip the initial prediction from albatross to ostrich, and then starts reducing the constraint loss which eventually falls below the rejection threshold (dashed black line). The attack sample remains thus undetected.

Appendix B Parameter Settings

In Table 3, for each model, we report the optimal value of used in our experiments, selected via a 3-fold cross-validation procedure. For completeness, in Table 4, we also report the value of the constraint loss measured on the test set . We used , setting each component to , with the exception of the weight of the mutual exclusivity constraint or the disjunction of the main classes, which was set to to enforce the classifier to take decisions on the unsupervised portion of the training data.

Model ANIMALS CIFAR-100 PASCAL-Part
FT 0 0 0
FT+C
FT+CC
TL 0 0 0
TL+C
Table 3: Parameter used in the experiments.
Model ANIMALS CIFAR-100 PASCAL-Part
FT
FT+C
FT+CC
TL
TL+C
Table 4: Values of the constraint loss measure on the test data .

Appendix C Domain Knowledge

Each dataset is composed of a set of attributes (classes) that we formalize with logic predicates. Such predicates participate in First-Order Logic (FOL) formulas that model the available domain knowledge. The FOL formulas that define the domain knowledge of the ANIMALS, CIFAR-100 and PASCAL-Part data are reported in Table 5, Table C, and Table 8, respectively, where each predicate is indicated with capital letters. In each table (bottom part) we also report those rules that are about activating at least one of the attributes of each level of the hierarchy. Following the nomenclature used in the paper, the main attributes of the ANIMALS dataset are ALBATROSS, GIRAFFE, CHEETAH, OSTRICH, PENGUIN, TIGER, ZEBRA, while the other attributes are MAMMAL, HAIR, MILK, FEATHERS, BIRD, FLY, LAYEGGS, MEAT, CARNIVORE, POINTEDTEETH, CLAWS, FORWARDEYS, HOOFS, UNGULATE, CUD, EVENTOED, TAWNY, BLACKSTRIPES, LONGLEGS, LONGNECK, DARKSPOTS, WHITE, BLACK, SWIM, BLACKWHITE, GOODFLIER. In the case of the CIFAR-100 dataset, the main attributes are the ones associated with the predicates of Table C that belong to the premises of the shortest FOL formulas (i.e., the formulas in the form A B, where the main attribute is A). Formulas in PASCAL-Part are relationships between objects and object-parts. The same part can belong to multiple objects, and in each objects several parts might be visible. See Table 8 for the list of classes (main classes are in the premises of the second block of formulas).

In ANIMALS and CIFAR-100, a mutual exclusion predicate is imposed on the main classes. As a matter of fact, in these two datasets, each image is only about a single main class. The predicate defined below, can be devised in different ways. The first, straightforward approach consists in considering the disjunction of the true cases in the truth table of the predicate:

(10)

where is the set of the main classes, with cardinality and is the logic predicate corresponding to the -th output of the network . This formulation of the predicate is what we used in the ANIMALS dataset. When there are seveal classes, as in CIFAR-100, this formulation leads to optimization issues, since it turned out to be complicated to find a good balance between the effect of this constraint and the supervision-fitting term. For this reason, the mutual exclusivity in CIFAR-100 was defined as a disjunction of the main classes followed by a set of implications that are used to implement the mutual exclusion of the predicates,

(11)

that resulted easier to tune, since we have multiple soft constraints that could be eventually violated to accommodate the optimization procedure.

HAIR MAMMAL
MILK MAMMAL
FEATHER BIRD
FLY LAYEGGS BIRD
MAMMAL MEAT CARNIVORE
MAMMAL POINTEDTEETH CLAWS FORWARDEYES CARNIVORE
MAMMAL HOOFS UNGULATE
MAMMAL CUD UNGULATE
MAMMAL CUD EVENTOED
CARNIVORE TAWNY DARKSPOTS CHEETAH
CARNIVORE TAWNY BLACKSTRIPES TIGER
UNGULATE LONGLEGS LONGNECK TAWNY DARKSPOTS GIRAFFE
BLACKSTRIPES UNGULATE WHITE ZEBRA
BIRD FLY LONGLEGS LONGNECK BLACK OSTRICH
BIRD FLY SWIM BLACKWHITE PENGUIN
BIRD GOODFLIER ALBATROSS
mutual_excl(ALBATROSS, GIRAFFE, CHEETAH, OSTRICH, PENGUIN, TIGER, ZEBRA)
MAMMAL HAIR MILK FEATHERS BIRD FLY LAYEGGS MEAT
CARNIVORE POINTEDTEETH CLAWS FORWARDEYS HOOFS UNGULATE
CUD EVENTOED TAWNY BLACKSTRIPES LONGLEGS LONGNECK
DARKSPOTS WHITE BLACK SWIM BLACKWHITE GOODFLIER
Table 5: Domain knowledge, ANIMALS dataset.
Table 6: Domain knowledge, CIFAR-100 dataset.
AQUATIC MAMMALS (BEAVER DOLPHIN OTTER SEAL WHALE)
BEAVER AQUATIC MAMMALS
DOLPHIN AQUATIC MAMMALS
OTTER AQUATIC MAMMALS
SEAL AQUATIC MAMMALS
WHALE AQUATIC MAMMALS
FISH (AQUARIUM FISH FLATFISH RAY SHARK TROUT)
AQUARIUM_FISH FISH
FLATFISH FISH
RAY FISH
SHARK FISH
TROUT FISH
FLOWERS (ORCHID POPPY ROSE SUNFLOWER TULIP)
ORCHID FLOWERS
POPPY FLOWERS
ROSE FLOWERS
SUNFLOWER FLOWERS
TULIP FLOWERS
FOOD_CONTAINERS (BOTTLE BOWL CAN CUP PLATE)
BOTTLE FOOD_CONTAINERS
BOWL FOOD_CONTAINERS
CAN FOOD_CONTAINERS
CUP FOOD_CONTAINERS
PLATE FOOD_CONTAINERS
FRUIT_AND_VEGETABLES (APPLE MUSHROOM ORANGE PEAR
SWEET_PEPPER)
APPLE FRUIT_AND_VEGETABLES
MUSHROOM FRUIT_AND_VEGETABLES
ORANGE FRUIT_AND_VEGETABLES
PEAR FRUIT_AND_VEGETABLES
SWEET_PEPPER FRUIT_AND_VEGETABLES
HOUSEHOLD_ELECTRICAL_DEVICES (CLOCK KEYBOARD LAMP
TELEPHONE TELEVISION)
CLOCK HOUSEHOLD_ELECTRICAL_DEVICES
KEYBOARD HOUSEHOLD_ELECTRICAL_DEVICES
LAMP HOUSEHOLD_ELECTRICAL_DEVICES
TELEPHONE HOUSEHOLD_ELECTRICAL_DEVICES
TELEVISION HOUSEHOLD_ELECTRICAL_DEVICES
HOUSEHOLD_FURNITURE (BED CHAIR COUCH TABLE WARDROBE)
BED HOUSEHOLD_FURNITURE
CHAIR HOUSEHOLD_FURNITURE
COUCH HOUSEHOLD_FURNITURE
TABLE HOUSEHOLD_FURNITURE
WARDROBE HOUSEHOLD_FURNITURE
INSECTS (BEE BEETLE BUTTERFLY CATERPILLAR COCKROACH)
BEE INSECTS
BEETLE INSECTS
BUTTERFLY INSECTS
CATERPILLAR INSECTS
COCKROACH INSECTS
LARGE_CARNIVORES (BEAR LEOPARD LION TIGER WOLF)
BEAR LARGE_CARNIVORES
LEOPARD LARGE_CARNIVORES
LION LARGE_CARNIVORES
TIGER LARGE_CARNIVORES
WOLF LARGE_CARNIVORES
LARGE_MAN-MADE_OUTDOOR_THINGS (BRIDGE CASTLE HOUSE ROAD
SKYSCRAPER)
BRIDGE LARGE_MAN-MADE_OUTDOOR_THINGS
CASTLE LARGE_MAN-MADE_OUTDOOR_THINGS
HOUSE LARGE_MAN-MADE_OUTDOOR_THINGS
ROAD LARGE_MAN-MADE_OUTDOOR_THINGS
SKYSCRAPER LARGE_MAN-MADE_OUTDOOR_THINGS
LARGE_NATURAL_OUTDOOR_SCENES (CLOUD FOREST MOUNTAIN
PLAIN SEA)
CLOUD LARGE_NATURAL_OUTDOOR_SCENES
FOREST LARGE_NATURAL_OUTDOOR_SCENES
MOUNTAIN LARGE_NATURAL_OUTDOOR_SCENES
PLAIN LARGE_NATURAL_OUTDOOR_SCENES
SEA LARGE_NATURAL_OUTDOOR_SCENES
LARGE_OMNIVORES_AND_HERBIVORES (CAMEL CATTLE CHIMPANZEE
ELEPHANT KANGAROO)
CAMEL LARGE_OMNIVORES_AND_HERBIVORES
CATTLE LARGE_OMNIVORES_AND_HERBIVORES
CHIMPANZEE LARGE_OMNIVORES_AND_HERBIVORES
ELEPHANT LARGE_OMNIVORES_AND_HERBIVORES
KANGAROO LARGE_OMNIVORES_AND_HERBIVORES
MEDIUM_MAMMALS (FOX PORCUPINE POSSUM RACCOON
SKUNK)
FOX MEDIUM_MAMMALS
PORCUPINE MEDIUM_MAMMALS
POSSUM MEDIUM_MAMMALS
RACCOON MEDIUM_MAMMALS
SKUNK MEDIUM_MAMMALS
NON-INSECT_INVERTEBRATES (CRAB LOBSTER SNAIL SPIDER
WORM)
CRAB NON-INSECT_INVERTEBRATES
LOBSTER NON-INSECT_INVERTEBRATES
SNAIL NON-INSECT_INVERTEBRATES
SPIDER NON-INSECT_INVERTEBRATES
WORM NON-INSECT_INVERTEBRATES
PEOPLE (BABY MAN WOMAN BOY GIRL)
BABY PEOPLE
BOY PEOPLE
GIRL PEOPLE
MAN PEOPLE
WOMAN PEOPLE
REPTILES (CROCODILE DINOSAUR LIZARD SNAKE TURTLE)
CROCODILE REPTILES
DINOSAUR REPTILES
LIZARD REPTILES
SNAKE REPTILES
TURTLE REPTILES
SMALL_MAMMALS (HAMSTER MOUSE RABBIT SHREW SQUIRREL)
HAMSTER SMALL_MAMMALS
MOUSE SMALL_MAMMALS
RABBIT SMALL_MAMMALS
SHREW SMALL_MAMMALS
SQUIRREL SMALL_MAMMALS
TREES (MAPLE_TREE OAK_TREE PALM_TREE PINE_TREE
WILLOW_TREE)
MAPLE_TREE TREES
OAK_TREE TREES
PALM_TREE TREES
PINE_TREE TREES
WILLOW_TREE TREE
VEHICLES1 (BIKE BUS MOTORBIKE PICKUP_TRUCK TRAIN)
BIKE VEHICLES1
BUS VEHICLES1
MOTORBIKE VEHICLES1
PICKUP VEHICLES1
TRAIN VEHICLES1
VEHICLES2 (LAWN MOWER ROCKET STREETCAR TANK TRACTOR)
LAWN MOWER VEHICLES2
ROCKET VEHICLES2
STREETCAR VEHICLES2
TANK VEHICLES2
TRACTOR VEHICLES2
mutual_excl( APPLE, AQUARIUM FISH, BABY, BEAR, BEAVER , BED, BEE,
BEETLE, BICYCLE, BOTTLE, BOWL , BOY, BRIDGE, BUS,
BUTTERFLY, CAMEL, CAN, CASTLE, CATERPILLAR , CATTLE, CHAIR
CHIMPANZEE, CLOCK, CLOUD , COCKROACH, COUCH, CRAB,
CROCODILE , CUP, DINOSAUR, DOLPHIN, ELEPHANT, FLATFISH,
FOREST, FOX, GIRL, HAMSTER, HOUSE, KANGAROO, KEYBOARD,
LAMP , LAWN_MOWER, LEOPARD, LION, LIZARD, LOBSTER, MAN,
MAPLE_TREE , MOTORCYCLE, MOUNTAIN, MOUSE, MUSHROOM,
OAK_TREE, ORANGE, ORCHID, OTTER, PALM_TREE, PEAR,
PICKUP_TRUCK , PINE_TREE, PLAIN, PLATE, POPPY, PORCUPINE,
POSSUM, RABBIT, RACCOON, RAY, ROAD, ROCKET, ROSE, SEA,
SEAL, SHARK, SHREW, SKUNK SKYSCRAPER, SNAIL, SNAKE,
SPIDER, SQUIRREL, STREETCAR, SUNFLOWER, SWEET_PEPPER, TABLE,
TANK, TELEPHONE, TELEVISION, TIGER, TRACTOR, TRAIN, TROUT,
TULIP, TURTLE, WARDROBE, WHALE, WILLOW_TREE, WOLF
WOMAN, WORM )
mutual_excl( AQUATIC MAMMALS, FISH, FLOWERS, FOOD CONTAINERS,
FRUIT AND VEGETABLES, HOUSEHOLD ELECTRICAL , HOUSEHOLD FURNITURE,
INSECTS , LARGE CARNIVORES, MAN-MADE OUTDOOR ,
NATURAL OUTDOOR SCENES, OMNIVORES AND HERBIVORES, MEDIUM MAMMALS,
INVERTEBRATES , PEOPLE , REPTILES , SMALL MAMMALS, TREES,
VEHICLES1, VEHICLES2 )
SCREEN (TVMONITOR)
COACH (TRAIN)
TORSO (PERSON HORSE COW DOG BIRD CAT SHEEP)
LEG (PERSON HORSE COW DOG BIRD CAT SHEEP)
HEAD (PERSON HORSE COW DOG BIRD CAT SHEEP)
EAR (PERSON HORSE COW DOG CAT SHEEP)
EYE (PERSON