ARIA: Adversarially Robust Image Attribution for Content Provenance

by   Maksym Andriushchenko, et al.
University of Surrey

Image attribution – matching an image back to a trusted source – is an emerging tool in the fight against online misinformation. Deep visual fingerprinting models have recently been explored for this purpose. However, they are not robust to tiny input perturbations known as adversarial examples. First we illustrate how to generate valid adversarial images that can easily cause incorrect image attribution. Then we describe an approach to prevent imperceptible adversarial attacks on deep visual fingerprinting models, via robust contrastive learning. The proposed training procedure leverages training on ℓ_∞-bounded adversarial examples, it is conceptually simple and incurs only a small computational overhead. The resulting models are substantially more robust, are accurate even on unperturbed images, and perform well even over a database with millions of images. In particular, we achieve 91.6 perturbations on manipulated images compared to 80.1 We also show that robustness generalizes to other types of imperceptible perturbations unseen during training. Finally, we show how to train an adversarially robust image comparator model for detecting editorial changes in matched images.



page 2

page 4

page 7

page 13

page 16


Attribution-driven Causal Analysis for Detection of Adversarial Examples

Attribution methods have been developed to explain the decision of a mac...

OSCAR-Net: Object-centric Scene Graph Attention for Image Attribution

Images tell powerful stories but cannot always be trusted. Matching imag...

Semantic Adversarial Examples

Deep neural networks are known to be vulnerable to adversarial examples,...

Robust Synthesis of Adversarial Visual Examples Using a Deep Image Prior

We present a novel method for generating robust adversarial image exampl...

Attributable-Watermarking of Speech Generative Models

Generative models are now capable of synthesizing images, speeches, and ...

Adversarial Robustness with Non-uniform Perturbations

Robustness of machine learning models is critical for security related a...

Brain Programming is Immune to Adversarial Attacks: Towards Accurate and Robust Image Classification using Symbolic Learning

In recent years, the security concerns about the vulnerability of Deep C...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fake news and misinformation are major societal threats being addressed by new computer vision methods to determine content authenticity. Such methods fall into two camps: detection and attribution. Detection methods automatically identify manipulated or synthetic images through visual artifacts or statistics

[66, 60, 61]. Attribution methods match an image to a trusted database of originals [46, 6, 5]. Once matched, any differences may be visualized, and any associated provenance data displayed. Rather than making automated judgments, the goal of image attribution is to enable users to make more informed trust decisions [25].

This paper considers specifically the image attribution problem where the goal is to differentiate between ‘non-editorial’ transformation of content (e.g. due to resolution, format or quality change) and editorial change where content is digitally altered to change its meaning. Nguyen et al. [46] use contrastive training to learn a visual hashing function that is invariant to non-editorial, but sensitive to editorial changes. In such ‘tamper-sensitive’ matching, a manipulated image would not be falsely corroborated by provenance data associated with the original. By contrast, Black et al. [6] learn a ‘tamper-invariant’ image fingerprint which is insensitive to both non-editorial and editorial change, and visually highlight manipulated changes using a separate model.

This paper reports on the novel problem of adversarial attack and defense for these image attribution approaches

. They rely on deep neural networks to learn visual fingerprints for near-duplicate image matching. However, deep networks are known to be vulnerable to

adversarial attacks [53] that use subtle image perturbations to cause dramatic changes in the output of the models. Visual search models based on deep networks are not exceptions and recently have been shown to be also vulnerable to adversarial attacks [16].

We make the following technical contributions:

  1. [label=0., itemsep=1mm, leftmargin=0mm, labelwidth=2cm, itemindent=5mm, topsep=2mm]

  2. Adversarial attack of image attribution models. We present a white-box gradient-based method for crafting adversarial examples to attack both tamper-sensitive and tamper-invariant image attribution models. We show it is possible to closely match the perceptual fingerprints of unrelated images by using small

    -adversarial perturbations. For tamper-sensitive models, we show that the original image may be incorrectly matched given a manipulated query. For tamper-invariant models, we additionally show that the image comparison post-process used to visualize areas of image manipulation may also be fooled to show either no manipulation or false areas of manipulation. Thus, we show that trust in both state-of-the-art image attribution approaches may be defeated by adversarial examples.

  3. Robust contrastive learning for image attribution. We describe a novel robust contrastive learning algorithm to train image fingerprinting models for attribution, to ensure robustness both to non-editorial transformations and imperceptible adversarial perturbations, preventing attacks illustrated in Fig. 1. We show that this algorithm improves adversarial robustness of both tamper-sensitive [46] and tamper-invariant [6] image fingerprinting models, and we also discuss how to make the image comparator model [6] robust. The approach is conceptually simple and leads to a relatively small computational overhead ( slowdown) compared to standard contrastive learning. We also show that our adversarially-robust image hashing models have benefits in terms of interpretability: they output perceptually similar images under hash inversion attacks [52].

Our work comprehensively studies adversarial robustness for the growing body of image attribution work, and is timely given the emergence of cross-industry standards reliant, in part, on such attribution. For example, the Coalition for Content Provenance and Authenticity (C2PA) [1] proposes to fight misinformation by embedding secure audit information on content manipulation within image metadata. Many online media distribution channels (such as social media sites) strip away this metadata. C2PA proposes to counter this by the use of perceptual hashes for image attribution, and advocates computing the hashes on the client side due to privacy concerns. This implies, as with our work, that the attacker has white-box access to the target model (i.e. the attacker is assumed to know all details of the system being attacked). Our work is the first to demonstrate the significance of these attacks on attribution models. Moreover, by offering the first defense against adversarial attacks for such models, we contribute to the protection of provenance systems implementing such standards.

2 Related Work

Image fingerprinting for provenance.

Image fingerprinting models robust to non-editorial transformations were proposed in Black et al. [6] and Nguyen et al. [46]. These represent two complementary approaches to applying image retrieval to the attribution problem. Both approaches match query images to a trusted database of originals, invariant to non-editorial changes such as resolution, format or quality change. However, the approaches differ in their consideration of manipulated images. Black et al. [6] train the image retrieval model to match manipulated images to originals successfully, i.e., to bring such images pairs close together in the search embedding. By contrast, Nguyen et al. [46] train the model to separate such image pairs, encouraging matching to fail in the presence of content manipulation. The advantage of [46] is a simpler pipeline as it has only an image retrieval model (with an optional geometric verification step), while [6] relies on a separate image comparator network that analyzes whether a pair of images are identical, different or have been manipulated, visualizing the difference. Whilst robust to non-editorial distortions, both approaches are not robust to adversarial attacks, motivating our work.

Adversarial attacks on deep neural networks for image classification were pioneered by Szegedy et al. [53], who demonstrated that minor perturbations of pixel values are sufficient to induce significant classification mistakes despite little perceptible difference (‘covert’ approaches). Goodfellow et al. [21]

demonstrated linearity of this effect in input space, introducing the fast gradient sign method (FGSM) to quickly compute adversarial perturbations via backpropagation without the need for solving costly optimizations. An iterative form of this method for more robust attacks was later presented


. Such attacks have received significant attention in recent years with many variants proposed to covertly attack image classifiers

[45, 22, 14]. Adversarial patches take a complementary, ‘overt’, approach via synthesis of vivid ‘stickers’ [7] that occupy only a small region yet induce misclassification [7, 19] or misdetection [11, 56].

Recently, adversarial attacks on image retrieval models have been demonstrated via similar means. Tolias et al. [57] show that image retrieval models are non-robust. They perform targeted attacks in the white-box and semi black-box setting (unknown pooling). Bai et al. [4] and Dolhansky and Ferrer [16] show that image hashing models can be fooled as well, including attacks which exactly produce target hashes.

Contrastive learning.

Several popular self-supervised learning approaches are based on contrastive learning: SimCLR

[13], MoCo [28], and BYOL [26]

. Most robust self-supervised approaches focus on robust transfer learning

[31, 12, 38, 36, 64, 34, 24] or multi-objective optimization [32, 44, 8] to improve adversarial robustness. The focus of these works differ from our focus on image retrieval. In particular, they do not benchmark image retrieval performance and for training, they rely on augmentations that are optimized for transfer learning and not for image attribution (e.g., extreme crops used in SimCLR: between 8% and 100% of the area as in Chen et al. [13]). E.g., Kim et al. [38] propose a two-stage robust training approach: first generating instance-wise adversarial examples for the SimCLR loss and then combining together the standard SimCLR loss with the robust loss. Tamkin et al. [54] propose to use an image-to-image network that generates adversarial examples which are then projected onto a (relatively large) -ball and they do not cover adversarial robustness. However, neither of these approaches is considered in the image retrieval setup. [50] propose adversarial training in the latent space with the goal of improving standard generalization. Closer to image attribution, Panum et al. [47] combine deep metric learning algorithms with adversarial training but they perform only small-scale experiments and evaluate their models via nearest neighbour classification which is different from the retrieval under editorial and non-editorial distortions as we do in the context of image attribution.

Figure 2: Attacks on image fingerprinting (IF) models. Visualization of untargeted adversarial examples of size on two IF approaches with complementary goals. Upper: a model seeking to match an original image, invariant to any editorial change in the query [6]. Lower: a model seeking to avoid matching an edited query to the original [46]. In both cases it is possible to attack the model to defeat the goal, and in both cases ARIA defends it successfully.
Figure 3: Attack on the image comparator (IC) model of Black et al. [6]. Visualization of the heatmap generated for an undefended and ARIA defended models queried using an adversarial example of budget . The attack targets a heatmap prediction within the bottom-right quadrant of the image. Shown for two different images drawn from the PSBattles dataset. In all cases the ARIA defended model predicts a heatmap near-identical to the original heatmap inferred by the undefended model in the absence of the attack.

3 Vulnerability of Attribution Models

We start from studying adversarial vulnerability within the context of the image attribution approaches of Nguyen et al. [46] and Black et al. [6].

1. Image fingerprinting (IF). We consider an IF model which performs the mapping of an image to its

-dimensional feature vector (fingerprint) used to output the most similar image from a database

. We denote by an original image contained in , and the image modified by some transformation. For all IF methods, we wish the match to be invariant to a set of non-editorial transformations , and in some cases (e.g. Black et al. [6]) also editorial manipulations of the image .

In all cases we consider adversarial perturbations of specific to the model generated under some perturbation budget . We call the perturbations targeted or untargeted perturbations depending on whether an attack targets retrieval of a specific incorrect image, or its objective is simply to prevent retrieval of the correct image. In the case of methods seeking to retrieve for only [46], a specific form of targeted attack attempts to fool to retrieve its original as illustrated in Fig. 2.

2. Image comparison (IC). IC methods visualize pixel regions containing editorial changes, performing an ‘intelligent differencing’ operation between and the top retrieval from the IF model, ignoring any visual change due to non-editorial transformations. We consider the approach of [6] where the model outputs both such a visualization (a heatmap) and additionally assigns the image pair to three categories: same images with non-editorial changes, same images with editorial changes or different images. The first goal of the attacker is to make the comparator classify an image with editorial changes to the first category. The second goal is to make the comparator output a misleading heatmap describing editorial changes (see Fig. 3).

3.1 Adversarial Attack Scope

We consider attacks within the following scope. First, adversarial perturbations should be imperceptible such as perturbations bounded within a small - or -norm. Imperceptibility is a crucial property as, from an attack perspective, a user should not realize an image has been manipulated in any way. This requirement makes, e.g., patch-based [7], - or -bounded perturbations not relevant in our case since they render tampering visually obvious. Second, we consider a white-box knowledge model: all details of the model are assumed known. This prevents the so-called security-by-obscurity where the security or robustness of a system relies on such detail (e.g., model architecture or parameters) being secret. In practice, such detail can be leaked or reverse engineered, particularly for models deployed on edge devices or client-side (as is advocated in emerging standards [1]). Third, we focus on fully realizable attacks, i.e., after generating adversarial examples, we save them as valid JPEG files and only then evaluate the model on them. This differs from most prior works, where imperceptible adversarial images are generated such that they have arbitrary real values in the range [53, 42]. Instead, after saving them as JPEG files, the pixel values are quantized to 8-bit and the image is compressed introducing further non-editorial changes. We show in Sec. 5 that even standard attacks described below (as opposed to robust attacks [3]) are sufficient for the considered attack scenario as we can successfully reduce many performance metrics to zero.

3.2 Implementation of adversarial attacks

We generate adversarial attacks using projected gradient descent (PGD) [42] under -norm constraints since the gradients are available111We use differentiable image resizing to make sure that the whole image preprocessing pipeline is differentiable. in the white-box setting and the -norm is a useful proxy for the imperceptibility requirement.

Image fingerprinting attack. The goal of an untargeted attack is to make the resulting adversarial example have an IF sufficiently different from the original one so that gets matched to an incorrect image. For this, we can maximize the -distance between an IF of and the IF of the original :

max (1)

In Fig. 2 (upper), we show results of such an untargeted attack on Black et al. [6]. The method attempts to match query images exhibiting both editorial and non-editorial change, back to an original. The attack successfully defeats this goal.

By contrast, in Fig. 2 (lower) the model of Nguyen et al. [46] aims to avoid matching an edited image to an original, in order to avoid corroborating a manipulated image with the provenance of its original. We perform a targeted attack which attempts to match a query image to the original image :

min (2)

We note here that when producing adversarial examples for the OSCAR-Net of [46], we assume the object detector’s output to be fixed (see the details in the sup. mat).

Image comparator attack. To attack the prediction module of an IC model [6] (Fig. 3

), we minimize the probability

of the ground truth class (e.g., ‘same images with editorial changes’) over the perturbation added only to :

min (3)

For the heatmap attack, we minimize the cosine similarity between the predicted heatmap

and ground truth since this is the loss used for training in [6]:

min (4)

Moreover, targeted attacks on heatmaps may trick a user into distrusting an original image. To generate targeted adversarial examples, we can just flip the minimization to maximization and use some target label or heatmap.

4 Adversarially Robust Image Attribution

In this section, we propose a robust contrastive learning algorithm applicable to both fingerprinting approaches described in Nguyen et al. [46] and Black et al. [6], including the image comparator of the latter.

4.1 Robust contrastive learning

We propose a method that adapts contrastive learning with the SimCLR loss [13] to be robust to imperceptible adversarial examples. Denote by the SimCLR loss defined on a batch of paired positive examples, where -th and -th examples correspond to the same images but with different transformations:


These transformations are of two kinds: non-editorial

(e.g., affine transformations and ImageNet-C

[30]) and editorial (e.g., available as paired images in the PS-Battles dataset [29]). Black et al. [6] use both of them as positive examples while Nguyen et al. [46] only treats images with non-editorial changes as positive.

Then to train adversarially robust IF models, we change the objective following the robust optimization framework [42] by adding an inner loop to maximize the loss on adversarially perturbed images :


where denotes the model parameters and the data distribution. We note that although adversarial training [42] is an established technique for image classification, IF models are trained differently. Thus, it is not clear in advance if the findings and pitfalls of robust training of image classification models (e.g., such as catastrophic or robust overfitting [62, 49]) transfer to the IF setting.

To solve the inner maximization problem, we use a few iterations of projected gradient ascent (in practice, up to 3) for the inner maximization problem, where each iteration requires an evaluation of the input gradient via backpropagation. Using a few iterations of the attack comes out to be sufficient to prevent the catastrophic overfitting problem [62, 2] which also manifests itself in training IF models, as we observe in the experimental part.

The final objective that we use combines the robust version of the SimCLR loss [13] with the hashing term from [46] for large-scale search has the following form:


For [6], which proposes no end-to-end hashing, we mostly report the models trained without the hashing term. In practice, we approximate the expectations using mini-batches, and we apply the hashing term on the same examples as the main loss. We do not use projection layers on top of the target embeddings as in Chen et al. [13] since we found this leads to worse performance.

4.2 Robust image comparator network

Next we discuss how to make the image comparator model from Black et al. [6] robust to adversarial attacks. First, we note that the image comparator performs a classification task (both for the prediction and heatmap modules) for which there are well-described solutions in the literature on how to improve their robustness [42, 65, 23]. Thus, we use the multi-task objective of Black et al. [6] with the classification loss and heatmap loss (we use the same losses and weights as in Black et al. [6], i.e. the cross-entropy and cosine similarity and ), but we add an inner maximization operator to it:


where is the class label, is the ground-truth heatmap. Note that we add adversarial perturbations only to the corrupted query image as we assume that the image comparator operates on the images from the database which contains non-adversarial original images. As corrupted images, we use either original images under a non-editorial distortion (see the experimental setup below), manipulated images from PSBattles or simply a different image under a non-editorial distortion. We use SGD for the outer loop and a few steps of PGD to approximate the maximization. We note that the image comparator model is fully differentiable, including the geometric alignment RAFT module [55] which is a part of their model. Thus, we both train and generate adversarial examples completely end-to-end for this network.

5 Experimental evaluation

In this section, we provide a detailed description of the experimental setup, describe the main results and show ablation studies that give more insights in the proposed method.

5.1 Experimental setup

Performance metrics.

For the image retrieval model of Black et al. [6], we similarly consider recall-based metrics assuming that for all queried images, the originals have been indexed: (1) standard recall: which is the probability of retrieving a correct image under some image corruption (non-editorial transformation, manipulation or both), (2) adversarial recall: which is the probability of retrieving a correct image under non-editorial transformations and an adversarial perturbation . For the image comparator model [6], we use the same metrics as in their paper: average precision (AP) for the classification module and the intersection over union (IoU) over the images with editorial changes for the heatmap module. We use all images for the former and only images with editorial manipulations for the latter.

We note that unlike [6], OSCAR-Net [46] is trained to distinguish non-editorial transformations from editorial manipulations. Thus, we follow their metrics using standard mAP and top-1 recall (R@1) for the non-editorially transformed query set, also inverse mAP (imAP) and inverse recall (iR@1) for the tamper query set. For all metrics, the higher is better.

Training details.

For training the robust image retrieval models we use the ResNet-50 architecture and Behance1M; a subset of 1M images sourced from a public digital art portfolio website ( will publish the URL list of these images upon acceptance. We use Behance1M for training since it is significantly larger than PSBattles [29] and more diverse, containing not only photos but also graphics. As a starting set of parameters, we use the model from Black et al. [6]

pre-trained for 2 epochs with standard contrastive training. We train the model to be robust not only to adversarial examples (after two epochs of standard training) but also to

non-editorial transforms via data augmentation, for which we use the beacon_aug library [41]

to apply ImageNet-C corruptions


then resizing, rotation, padding, cropping, horizontal flips, and JPEG compression.

Evaluation details.

We evaluate robustness on Behance and on PSBattles [29]; we use the ‘hard’ subset of the latter defined in Black et al. [6]. As in past work [6, 46], we add distractor images from stock photography (Adobe Stock thumbnails). We use the FAISS library [37] for efficient image retrieval. We generate adversarial attacks using PGD with 50 iterations using , step size decayed by a factor of 2 at 25%, 50%, and 75% of iterations. Unless specified otherwise, we evaluate the image retrieval models on full PSBattles with 2M and 100K Adobe Stock distractors, following the settings in [6] and [46], respectively. We apply adversarial attacks in the original pixel space with differentiable resizing to and JPEG-90 compression after it (thus, it is a fully realizable attack).

Top-1 and top-100 recall for different query sets
Non-editorial distortions Editorial manipulations Editorial + non-editorial
No attack adversarial No attack adversarial No attack adversarial
Existing models R@1 R@100 R@1 R@100 R@1 R@100 R@1 R@100 R@1 R@100 R@1 R@100
Standard supervised, ImageNet [48] 20.8 39.6 0.0 0.1 87.2 95.4 0.0 0.2 15.1 33.2 0.0 0.1
DeepAugment + AugMix supervised, ImageNet [33] 47.9 66.9 0.1 0.3 86.7 95.3 0.0 0.1 36.5 58.0 0.0 0.3
Robust supervised, , ImageNet [51] 39.4 53.0 10.2 20.8 86.3 95.0 31.2 56.1 30.8 46.5 5.5 15.2
Undefended contrastive, PSBattles [6] 74.1 93.7 0.0 0.0 80.1 92.9 0.0 0.0 55.3 83.5 0.0 0.0
Our new models
Undefended contrastive, Behance1M (ours) 97.8 98.8 1.3 12.4 88.6 92.4 0.4 4.5 84.7 89.7 1.0 9.5
ARIA contrastive + hashing, , Behance1M 93.5 96.2 74.7 81.9 87.5 92.8 76.4 86.5 79.2 87.6 56.3 70.7
ARIA contrastive + hashing, , Behance1M 89.0 93.5 80.7 83.4 87.0 92.5 80.6 87.9 74.8 84.7 57.9 72.5
ARIA contrastive, , Behance1M 97.3 98.2 79.0 82.1 90.8 93.6 81.0 86.4 87.3 91.0 63.0 71.0
ARIA contrastive, , Behance1M 96.4 97.3 83.0 85.7 91.6 93.9 85.1 89.9 86.7 90.3 69.7 77.0
ARIA contrastive, , Behance1M 94.2 96.0 83.7 87.4 90.4 93.3 85.5 89.9 83.1 88.0 69.3 77.0
Table 1: Standard and adversarial () top-1 and top-100 recall for different ResNet-50 models evaluated on PSBattles [29]. The database contains original images from PSBattles and 2M distractor images from Stock indexed using the IVF1024, PQ16 index from FAISS library following Black et al. [6]. We use three query sets based on PSBattles: (1) non-editorial distortions (ImageNet-C and affine) on original images, (2) editorial manipulations but no distortions, (3) editorial manipulations with non-editorial distortions.

5.2 Robust retrieval: Black et al. [6] approach

Large-scale robustness evaluation.

The main evaluation results for the robust image retrieval models trained on Behance1M are presented in Table 1 where we measure recall with 2M distractor images using the IVF1024, PQ16 index from FAISS library following Black et al. [6] (we include non-FAISS results also in the sup.mat., omitted as too slow to be practical). We report models trained with different since we want to show models with a different robustness-accuracy tradeoff [58]. We measure adversarial recall under perturbations of size since this is the most commonly used perturbation size in the literature [42, 15]. Compared to the existing IF models trained contrastively on PSBattles [6] and in a supervised way on ImageNet [48, 33, 51], our proposed method allows us to train IF models which are robust to imperceptible adversarial perturbations in addition to being highly accurate. For example, our robust model trained with achieves 91.6% standard and 85.1% adversarial recall compared to 80.1% and 0.0% achieved by the main baseline from [6]. In addition, Table 1 shows that robust contrastive learning with leads to a higher recall for manipulated images compared to the standard model even despite having a worse recall on non-editorial distortions. We note that this shows another interesting case when adversarial training can improve the performance on real-world distribution shifts in addition to the benefits of adversarial training known before on, e.g., common corruptions [20, 63, 39] or transfer learning [51, 59]. Finally, the models trained with the hashing term (see Eq. (7)) perform slightly worse than the models producing real-valued IFs, but there can still be interesting use-cases for them such as faster search, lower memory requirements, and a potential storage in key-value data structures.

Adversarial recall
Models Recall
Undefended, PSBattles [6] 92.2% 0.0% 0.0% 0.0%
ARIA, , Behance1M 99.6% 84.2% 38.8% 2.8%
ARIA, , Behance1M 98.4% 84.0% 50.6% 9.8%
ARIA, , Behance1M 97.8% 83.4% 60.0% 15.4%
Table 2: Standard and adversarial top-1 recall for attacks unseen during training. We use evaluation on a query set of non-editorial distortions of Behance1M images (500 distractors).

Robustness to unseen adversarial perturbation.

In Table 2, we show the robustness results for perturbations which were unseen during training such as -bounded perturbations () and -perturbations of a larger radius compared to those used for training (). We can see that robustness indeed generalizes to these other types of perturbations: e.g., is sufficient to reduce the adversarial recall to for the undefended model of Black et al. [6] while all our -trained robust models achieve adversarial recall.

(a) Original image
(b) , standard model
(c) , robust model
Figure 4: Visualization of the hash inversions (see Eq. (9)) for two original images (left) for a standard model (middle) and ARIA model with (right), both trained on Behance1M.

Plausible hash inversions.

Here we show that adversarially robust image hashing models output plausible images under the hash inversions attacks. This type of attacks has recently received a lot of attention as a significant weakness of neural hashing models [52]. The goal of this attack is to compromise the validity of the hashing model by finding some irrelevant images that lead to the same binary hash as some target image . The formulation is similar to that of targeted adversarial attacks but without the -norm constraint on the perturbation magnitude and starting from some arbitrary (we use a constant gray image). We take the robust model from Table 1 trained to output binary hashes via the sign function for which we use a differentiable approximation (tanh function with a parameter ):


We solve the formulation using 1’000 iterations of PGD ensuring the exact hash match in each case. We note that similar formulations in the context of image classification have been studied in [43, 17, 18] from the interpretability perspective where they focused on inverting real-valued embeddings of a deep network instead of hashes.

As we can see from Fig. 4, hash inversion attacks on standardly trained hashing models tend to produce obscure high-frequency patterns. At the same time, our robust image hashing models tend to focus more on shapes of objects which are approximately recovered under hash inversions. This behaviour is closely related to the adversarial vulnerability problem: the attacker can use non-robust features [35] to arbitrarily manipulate the model’s hash. However, once we fix this problem via robust training, hash inversions start to be more related to the original images.

Figure 5: Standard and adversarial () top-1 recall at different epochs for models trained with (1) different numbers of PGD iterations (using ) and (2) different perturbation radii (using 3 iterations of PGD) . We evaluate a query set of non-editorial distortions of Behance1M images (500 distractors).

Hyperparameter importance.

Here we analyze the hyperparameters of ARIA training: the number of PGD iterations and the perturbation radius

. We train multiple models on Behance1M and report their baseline (i.e. no attack) and adversarial recall (i.e. for an attack with budget ). Fig. 5 (top) suggests that, similar to image classification, catastrophic overfitting [62, 2] leads to unstable performance for robust contrastive learning, so training with a single iteration of PGD should be avoided. We observe 2-3 iterations of PGD to be sufficient: which leads to the slowdown factor of and , respectively. Fig. 5 (bottom) shows that the model trained with starts to overfit in terms of adversarial recall after 7 epochs. This suggests that training with a too large may be problematic.

No attack adversarial
Undefended [46] 78.66 66.35 72.83 81.05 37.82 36.48 15.55 22.87 12.98 17.01
ARIA, (ours) 54.52 39.63 78.64 85.00 32.20 27.03 34.05 42.84 20.96 20.59
ARIA, (ours) 55.86 43.38 79.62 85.81 32.83 28.81 35.79 45.17 21.81 22.13
ARIA, (ours) 52.18 38.11 78.89 85.88 31.41 26.40 55.92 66.28 26.99 24.20
Table 3: Metrics for no attack and adversarial (

) attack for OSCAR-Net models, using queries from PSBattles. For mAP and R@1, queries have only non-editorial transforms applied. For iMAP and iR@1 digitally manipulated images with no distortions are used, both with and without adversarial perturbations. F-scores are calculated based on the appropriate mAP/R@1 following

Nguyen et al. [46].

5.3 Robust retrieval: Nguyen et al. [46] approach

We present our attack and defence results on OSCAR-Net [46] in Table 3. Following the evaluation protocol of the attribution benchmark in [46], we report mAP/R@1 scores for the non-editorially transformed query set and imAP/iR@1 scores for the editorially transformed query set, together with the harmonic F score balancing the two terms ( and ). Additionally, we perform adversarial attacks on the editorial query set using , creating a new query set called adversarial. We trained OSCAR-Net models with to defend against such attacks, using Eq. (2) and report its performance on both the no attack and adversarial attack scenarios.

Whilst baseline OSCAR-Net works well, it performs poorly on adversarial, with 57% drop in imAP and iR@1 scores; the baseline model is easily fooled by the attack. In contrast, all 3 defence models outperform OSCAR-Net by significant margins ( improvement on imAP and iR@1 for the best model). These models also perform better at imAP and iR@1 scores on the standard benchmark, at the cost of performance reduction on the non-editorial set. We note this trade-off is already observed in the original OSCAR-Net [46] and is associated with feature generalization versus discrimination. The trade-off is steered towards boosting the discrimination of editorial changes because the models are trained to defend against adversarial attacks on such changes. The value of can be used to determine the defence strength, with yielding the best performance, closest to performance on the standard benchmark. This is consistent with subsec. 5.2 when defending [6].

5.4 Robust image comparator

No attack adversarial
Models AP IoU AP IoU
Undefended ICN [6] 96.4% 58.1% 0.6% 5.1%
ARIA ICN, 96.4% 61.5% 65.0% 37.9%
ARIA ICN, 95.9% 59.3% 83.1% 43.7%
ARIA ICN, 95.5% 55.9% 90.7% 44.9%
Table 4: The average precision (AP) and intersection over union (IoU) between the predicted and ground truth editorial heatmaps for the image comparator network (ICN) with/without adversarial perturbations of radius .

We fine-tune the model from Black et al. [6] for 40 epochs using 3 iterations of PGD attack for training using different radii () and show the results in Table 4. We benchmark robust image comparator models separately since they are trained independently of image fingerprinting models and provide complementary information about the presence of an editorial change and its location.

Classification module.

First, we observe that our robust training method substantially improves the classification precision under untargeted adversarial examples. E.g., for the model trained with , we preserve the same precision as the model from [6] () but substantially improve the adversarial precision (from to ). The most robust model is the one trained with which achieves 90.7% adversarial precision which, however, sacrifices

of precision. We benchmark targeted attacks and show confusion matrix over classes in the sup. mat.

Heatmap module.

Quantitatively, we improve the heatmap performance in terms of the adversarial IoU significantly: from to for the most robust model (trained with ) which has a comparable IoU: vs. . However, if we consider the robust model trained with , it has both better baseline IoU ( vs. ) and adversarial IoU ( vs. ). Qualitative results for a robust heatmap module are visualized in Fig. 3: a robust comparator correctly highlights the same edited area regardless of whether adversarial perturbations are added. At the same time, the comparator from [6] can be fooled also in a targeted way to highlight any area (e.g., right bottom corner) as manipulated.

6 Conclusions

We began by showing that the current state-of-the-art image attribution models (both image fingerprinting and comparison), are not robust to imperceptible adversarial attacks. This is concerning since these attacks are fully realizable and can be applied directly in a digital format where the attacker has white-box access to the model. To bridge this vulnerability, we proposed a simple and effective training technique for image attribution that significantly improves robustness to various adversarial perturbations including the ones which were unseen during training. We applied this to two fingerprinting approaches [46, 6]: those seeking to match manipulated content and those seeking to avoid matching such content. Finally, we also showed how a recent manipulation localization approach [6] can be trained robustly including both its classification and heatmap modules.

Overall, we think that adversarial vulnerability of image attribution models presents a significant negative societal impact, and that our proposed method provides a well-motivated and practical way to solve it. Our solution, ARIA, is particularly timely given the emergence of content authenticity standards that advocate for use of image attribution [1]. ARIA has two limitations: the increased training time (however, only ) and the trade-off for some models between a significant improvement in robustness and minor loss of accuracy (see the OSCAR-Net and image comparator experiments). Future work can further address the latter issue and additionally explore adversarial defenses for ‘blind’ manipulation detection models [9], complementary to image attribution approaches.


Organization of the appendix

In Sec. A we present additional training and evaluation details. In Sec. B, we provide further implementation details for the attacks and defences both for OSCAR-Net and Black et al. [6] models. In Sec. C, we show multiple additional experiments such as accuracy of the retrieval with exact nearest neighbour search, additional hash inversion visualizations, robustness of OSCAR-Net to unseen adversarial perturbations, accuracy over classes and targeted attacks on the heatmaps for ICN models.

Appendix A Training and evaluation details

Training details.

For the models trained according to the approach of Black et al. [6], we use the learning rate , SimCLR temperature , 3 steps of PGD for training using step sizes for , respectively.

For the OSCAR-Net [46] models, we use the default hyperparameters except the learning rate which is set to and SimCLR temperature of . For ARIA training, we use 3 steps of PGD with the step size .

For the image comparator models, we use the default training hyperparameters with 3 steps of PGD for training using step sizes for , respectively.

Evaluation details.

For the attacks unseen during training, we use iterations of PGD (we increase it from iterations used throughout the paper to account for larger perturbation radii) using the step size of for -perturbations and for -perturbations.

For hash inversions, we use 1000 iterations of PGD with the step size , and the approximation parameter .

Training time.

Standard training of the Black et al. [6] model on Behance1M takes 34.3 hours while ARIA training takes 72.8 hours (i.e., factor slowdown) on two NVIDIA V100 GPUs for 20 epochs.

Standard OSCAR-Net training on PSBattles takes 31.6 hours while ARIA training takes 65.1 hours (i.e., factor slowdown) on a single NVIDIA GeForce RTX 3090 GPU for 10 epochs.

We note that for both models, ARIA uses 3 steps of PGD for training but the slowdown factor is less than which is due to more effective GPU utilization for robust training.

Examples of non-editorial transformations.

In Fig. 6 and Fig. 7, we show images with non-editorial changes from PSBattles which we used for the “Editorial + non-editorial” query sets for evaluation of the OSCAR-Net models and models of Black et al. [6].

OSCAR-Net [46]
Black et al. [6]
Figure 6: Examples of non-editorial changes applied to the same image from PSBattles according to the query sets used to evaluate the OSCAR-Net [46] and Black et al. [6] approaches.
Figure 7: Additional examples of non-editorial changes applied to the images from PSBattles.
Top-1 and top-100 recall for different query sets
Non-editorial distortions Editorial manipulations Editorial + non-editorial
No attack adversarial No attack adversarial No attack adversarial
Existing models R@1 R@100 R@1 R@100 R@1 R@100 R@1 R@100 R@1 R@100 R@1 R@100
Standard supervised, ImageNet [48] 45.1 59.3 0.0 0.2 98.3 99.6 0.1 0.3 37.3 52.9 0.0 0.3
DeepAugment + AugMix supervised, ImageNet [33] 75.2 84.5 0.2 2.0 98.5 99.6 0.0 0.6 67.8 80.7 0.0 0.3
Robust supervised, , ImageNet [51] 57.3 66.1 30.3 44.0 97.4 99.2 79.7 92.4 51.2 62.0 22.4 38.0
Undefended contrastive, PSBattles [6] 86.2 96.7 0.0 0.0 87.7 95.5 0.0 0.0 70.0 89.5 0.0 0.0
Our new models
Undefended contrastive, Behance 99.2 99.9 4.8 25.3 94.4 97.6 0.9 9.8 91.9 96.8 2.6 16.1
ARIA contrastive + hashing, , Behance 96.8 98.7 83.8 89.3 92.1 96.7 85.2 93.8 87.1 94.5 69.2 82.7
ARIA contrastive + hashing, , Behance 93.5 96.5 84.1 90.8 91.4 96.0 87.0 93.9 82.8 91.1 69.7 82.4
ARIA contrastive, , Behance 99.5 100.0 87.7 90.7 96.1 98.6 91.6 96.9 94.8 98.1 78.6 87.3
ARIA contrastive, , Behance 99.4 99.9 90.5 92.7 96.1 98.4 93.4 97.3 94.7 97.9 83.3 90.4
ARIA contrastive, , Behance 98.6 99.7 94.5 95.4 95.5 98.3 93.2 97.1 92.8 97.2 82.9 90.9
Table 5: Standard and adversarial () top-1 and top-100 recall for different ResNet-50 models evaluated on PSBattles [29]. The database contains original images from PSBattles and 2M distractor images from Stock indexed using the exact nearest neighbour search (unlike Table 1 in the main part that used the approximate IVF1024, PQ16 index). We use three query sets based on PSBattles: (1) non-editorial distortions (ImageNet-C and affine) on original images, (2) editorial manipulations but no distortions, (3) editorial manipulations with non-editorial distortions.

Appendix B Further details on the attack and defence scope on OSCAR-Net and Black et al. models

A model needs to be differentiable with respect to the input image in order to perform an effective adversarial attack (and defence) on it. In other words, our main prerequisite is that we should be able to back-propagate the gradient of the loss to the original input. Despite being complex attribution models, we show that OSCAR-Net [46] and Black et al. [6] both can meet this requirement.

OSCAR-Net consists of an object detection module (Mask-RCNN [27]

) to decompose an image into a set of objects, followed by 3 sub-networks to learn the global image features, object-level features (including object CNN visual, shape and geometry features) as well as the relation features between objects. These features are pooled via a fully-connected graph transformer network to produce a compact binary embedding. Note that OSCAR-Net does not aim to learn object detection (the Mask-RCNN module weights are not updated during training), and we do the same. Here we focus on attacking and defending the multi-branch feature extraction and aggregation which are learnable in OSCAR-Net. Thus, we apply our perturbations to the full image after the object detection step, i.e. we treat the output of the object detector as constant. We note that there exists adversarial attack and defence approaches on object detection

[10] and integrating those on OSCAR-Net could be a topic of future work.

Black et al. consists of two distinct models that are trained separately: an image retrieval model insensitive to both editorial and non-editorial changes, followed by an image comparator (IC) model distinguishing editorial from non-editorial transformations. Given a query, the image retrieval model returns top-k candidate images which are brought to the IC model to determine if there exists a ‘matched’ image among the candidates and whether the query has editorial or non-editorial changes. The IC model also outputs an editorial heatmap if editorial change is predicted on a query-candidate pair. The retrieval model has a simple ResNet-50 architecture and is trained with SimCLR loss [6], hence is fully differentiable. The IC model is more complex with a dewarping unit to align the query with the candidate image, followed by a CNN-based feature extraction module to output the editorial prediction and heatmap. Both sub-modules are differentiable with respect to the input image pair and we have demonstrated that adversarial attacks could be performed on both prediction and heatmap in our main paper, as well as an adversarially robust training method to defend against such attacks.

We refer to [6, 46] for more details on the architecture and training strategies of the two above approaches.

Appendix C Additional experiments

adversarial, adversarial, adversarial,
Undefended [46] 7.69 11.08 7.01 9.50 5.84 8.18 5.44 7.28 38.04 45.37 25.64 26.95
ARIA, (ours) 22.64 29.93 16.00 17.05 17.09 23.07 13.01 14.58 54.30 61.55 27.21 24.11
ARIA, (ours) 21.04 27.97 15.29 17.01 16.76 22.36 12.89 14.76 47.46 55.44 25.66 24.34
ARIA, (ours) 41.85 49.56 23.22 21.54 40.43 47.09 22.78 21.06 42.14 51.52 23.31 21.91
Table 6: Performance metrics for attacks unseen during training for OSCAR-Net models, using queries from PSBattles. Evaluation is on a query set of digitally manipulated images with no distortions.
Average precision, no attack Average precision, adversarial attack
Models All Non-editorial Edit. + non- Different All Non-editorial Edit. + non- Different
classes changes edit. changes images changes changes edit. changes images
Undefended ICN [6] 96.4% 98.2% 91.4% 99.6% 0.6% 0.0% 0.1% 1.6%
ARIA ICN, 96.4% 91.8% 97.7% 99.7% 65.0% 21.6% 84.9% 85.6%
ARIA ICN, 95.9% 91.6% 97.0% 99.3% 83.1% 67.6% 87.1% 93.9%
ARIA ICN, 95.5% 92.2% 95.5% 98.5% 90.7% 86.6% 88.7% 96.2%
Table 7: The average precision for the image comparator network (ICN) with/without adversarial perturbations of radius over three different classes (depending on the query image that can be either the same image with non-editorial changes, the same image with editorial and non-editorial changes, or a different image).

Retrieval with exact nearest neighbour search for Black et al. [6] models.

First of all, we note that exact nearest neighbour search reported in Table 5 is not practical for databases that contain millions of images and we report it so that we can analyze the performance drop which occurs due to approximate image retrieval. Table 5 suggests that overall the trends and rankings between different methods are the same as in Table 1 from the main part of the paper. At the same time, as expected, the absolute numbers are higher: e.g., standard top-1 recall for the ARIA model trained with is 99.5% compared to 97.3% with the approximate indexing reported in the main part. Such performance drop is uniform over different methods. We can also see that ImageNet-trained models perform well on images with editorial changes. However, we note that the ImageNet models use the embedding dimension of which is much larger the used by our contrastively trained models and leads to even slower search time.

Robustness of OSCAR-Net models to unseen adversarial perturbations.

Table 6 shows the robustness results of OSCAR-Net for perturbations which were unseen during training. These are -bounded perturbations () and -perturbations of a larger radius compared to those used for training ().

The robustness generalises very well to the larger -perturbations: e.g. with perturbations of size the score for the undefended model of Nguyen et al. [46] is reduced to 5.44%, but for all our defended models it is at least 12.89%. In the case of our best defended model it is 22.78%. The perturbations with are not very successful at attacking the OSCAR-Net model, so it is not possible to draw conclusions about robustness in this case. We think that for perturbations treating the object detector’s output as constant can be suboptimal but we leave better attacks tailored to the OSCAR-Net architecture to future work.

Image comparator models: accuracy over classes.

We show the results in Table 7 where we report the average precision over three classes depending on the query image that can be either the same image with non-editorial changes, the same image with editorial and non-editorial changes, or a different image. We can see that the standard precision is approximately uniform over different classes but the adversarial precision can be non-uniform. For example, the ARIA ICN model trained with has only 21.6% adversarial precision on the same images with non-editorial changes. However, using a higher for ARIA fixes this problem, e.g., for we get 86.6% adversarial precision.

Image comparator models: targeted attacks on heatmaps.

We show the results of targeted attacks on the image comparator models in Table 8. For the attack, we target a random cell of a heatmap by maximizing the cosine loss. We note that unlike other metrics, a lower targeted intersection over union (IoU) is better as it implies a smaller overlap of the predicted heatmap with the wrong target heatmap. We can observe that ARIA training successfully reduces the success rate of the attack in terms of IoU from 48.3% (undefended ICN) down to 3.9% (ARIA training with ).

No attack adversarial
Models IoU Targeted IoU
Undefended ICN [6] 58.1% 48.3%
ARIA ICN, 61.5% 10.0%
ARIA ICN, 59.3% 5.4%
ARIA ICN, 55.9% 3.9%
Table 8: The average intersection over union (IoU) between the predicted and ground truth editorial heatmaps for the image comparator network (ICN) with/without targeted adversarial perturbations of radius . Note that unlike other metrics, a lower targeted IoU is better as it implies a smaller overlap of the predicted heatmap with the wrong target heatmap.

Hash inversion visualizations.

Additional hash inversions for randomly chosen images from PSBattles can be found in Fig. 8. We can observe that in many cases hash inversions for the robust model (trained with ) recover the shapes of original images. This is in contrast with the high-frequency noise which is observed for the standard model.

(a) Original image
(b) , standard model
(c) , robust model
(d) Original image
(e) , standard model
(f) , robust model
Figure 8: Additional visualizations of the hash inversions for twelve original images (left) for a standard model (middle) and ARIA model with (right), both trained on Behance1M.