Image inpainting is to fill the missing region of an image with plausible contents. It has a wide range of applications in the field of image processing and computer vision, e.g., repairing damaged photos and removing unwanted objects. Nevertheless, image inpainting techniques might also be exploited maliciously to alter and delete contents, making them powerful tools for creating forged images. The trust issues and security concerns regarding the malicious use of image inpainting techniques have been attracting increasing attention in recent years; for instance, using inpainted images in court as evidences, removing key objects to report fake news, erasing visible copyright watermarks, just name a few. The situation becomes even worse when the deep learning (DL)-based inpainting methods have becoming prevalent. As shown in Fig.1 (a)-(b), a malicious attacker can very easily change the facial content or erase the key objects/watermarks by using latest DL-based inpainting methods through their online website 111https://www.nvidia.com/research/inpainting/ or open resources. Therefore, it is imperative to study how to accurately detect and locate the inpainted regions for fighting against the inpainting forgeries.
The detection and localization of processed regions have always been a hot research topic in the field of information forensics. Many methods were proposed to detect the forged regions through their specific artifacts, e.g., compression artifacts , noise pattern , color consistencies , EXIF consistencies , and copy-move traces . However, few researches have been done on the detection of inpainting manipulations, especially the latest DL-based ones. As mentioned in , the inpainting manipulations are more sophisticated and complex than, e.g., copy-move forgery because the source of copied information could be non-continuous. Furthermore, DL-based inpainting schemes are capable of creating completely novel semantic contents, which impose great challenges for the detection task. The pioneer study on the deep inpainting detection was conducted by Li and Huang 
, showing that it is feasible to train a deep model for detecting specific deep inpainting artifacts if the inpainting scheme is known. However, with the rapid progress of DL-based inpainting, it is very challenging to know the employed inpainting scheme for a given image; sometimes more than one inpainting schemes could be adopted to edit one single image. It is therefore very desirable if we can find a generalizable forensic approach for detecting various inpainting manipulations, not only traditional inpainting schemes but also DL-based ones. This problem, though challenging, seems to be viable because nowadays convolutional neural networks usually produce common artifacts when generating images, as discovered in.
In this work, we tackle the challenge of providing a forensic solution that can generalize well to accurately detect various unseen inpainting manipulations. More specifically, we propose a novel end-to-end Generalizable Image Inpainting Detection Network (GIID-Net), to detect the inpainted regions at pixel accuracy. The proposed GIID-Net consists of three sub-blocks: the enhancement block, the extraction block and the decision block. The enhancement block aims to enhance the inpainting traces by using hierarchically combined special layers. The extraction block, automatically designed by Neural Architecture Search (NAS) algorithm, is targeted to extract features for the actual inpainting detection tasks. In order to further optimize the extracted latent features, we integrate global and local attention modules in the decision block, where the global attention reduces the intra-class differences by measuring the similarity of global features, while the local attention strengthens the consistency of local features. Furthermore, we thoroughly study the generalizability of our GIID-Net, and find that different training data could result in vastly different generalization capability. By carefully examining 10 popular inpainting methods, we identify that the GIID-Net trained on one specific deep inpainting method exhibits desirable generalizability, namely, the obtained GIID-Net can accurately detect and localize inpainting manipulations for unseen (not only DL-based but also traditional) inpainting methods as well. Extensive experimental results are presented to validate the superiority of the proposed GIID-Net, compared with the state-of-the-art competitors. Our results would suggest that common artifacts are shared across diverse image inpainting methods. Finally, we build a public inpainting dataset of 10K image pairs for the future research in this area. An example of the detection result of GIID-Net is shown in Fig. 1 (c), which is the direct output of our model without any post-processing by using Fig. 1 (b) as input. Here we would like to emphasize that none of the original images in Fig. 1 (a) or the corresponding inpainting methods [33, 38, 46] were involved during the training of GIID-Net.
Our major contributions can be summarized as follows:
We propose the GIID-Net, a novel end-to-end network for the generalizable inpainting detection, where the NAS algorithm is used for designing appropriate network architecture and newly proposed attention modules are incorporated to further optimize latent features.
We construct a diverse-inpainting test dataset with 10K images, based on 10 different inpainting methods, each contributing 1K images. Among them, six (GC , CA , SH , EC , LB  and RN ) are DL-based, and the remaining (TE , NS , PM , and SG ) are traditional ones. This could serve as a publicly accessible dataset for standardized comparisons of inpainting detection approaches.
We show that the forensic model trained on a specific deep inpainting method exhibits excellent generalizable detection capability to other inpainting methods, no matter DL-based or traditional ones. This validates that common detectable traces are left by various inpainting manipulations.
Ii Related Works
Ii-a Inpainting Methods
Image inpainting provides a means for the reconstruction of missing regions, and has been studied for decades (see [8, 4, 14, 7, 9, 11, 42, 22, 21, 13, 35] and references therein). Bertalmio et al.  introduced an approach that uses ideas from classical fluid dynamics to propagate isophote lines continuously from the exterior into the missing regions. Based on the fast matching method for level set applications, Telea 
proposed a simple and fast inpainting algorithm by propagating an image smoothness estimator along the image gradient. More recently, Huanget al.  showed that image inpainting can be substantially improved by automatically guiding the low-level synthesis algorithm using mid-level structural analysis of the known region. Herling and Broll  presented a combined pixel-based approach that not only allows for even faster inpainting, but also improves the overall image quality significantly. However, as these texture synthesis based inpainting methods essentially assumed that the missing region shares the same structural features with the known one, they cannot create novel contents for the challenging cases where the missing region involves complex structures (e.g., faces) and high-level semantics [54, 28].
To address these limitations, many DL models have been proposed for image inpainting in recent years. By utilizing large-scale datasets to learn semantic representations of images, DL-based inpainting methods are able to generate completely novel contents and achieve the state-of-the-art inpainting performance. Pathak et al. 
pioneered the research in this direction by training deep generative adversarial networks for inpainting large holes in images. However, the proposed networks cannot satisfactorily maintain global consistency and tends to produce severe visual artifacts. Iizukaet al.  designed a generative network with two context discriminators to encourage global and local consistency. Instead of merely using the features of latent layers, some works [52, 54] introduced attention mechanism, which jointly uses the existing features to estimate the missing features. To further improve the attention mechanism, Wang et al.  suggested a multi-stage image contextual attention learning strategy to deal with the rich background information flexibly while avoiding abuse them. Meanwhile, several works [33, 53] adopted partial or gated convolutions to reduce the color discrepancy and blurriness, where the convolutions are masked, renormalized, and operated only on the known region. With the recent trend of using two-stage networks, Nazeri et al.  and Wu et al.  respectively proposed to use edge/LBP generator at the first stage, followed by a second image completion network to further improve the inpainting performance.
Ii-B Inpainting Forensics
As the other side of the coin, many inpainting forensic methods [48, 10, 2, 43, 31, 60] have been proposed to fight against the malicious usage of inpainting manipulations. One common principle of these methods is to search similar blocks within a given image, where the blocks with high matching degrees are suspected to be forged. Specifically, Wu et al.  proposed a blind detection method based on zero-connectivity feature and fuzzy membership. Chang et al. 
designed a two-process algorithm to first find the suspicious regions, use a similarity vector field to remove the false positives caused by uniform area, and then apply multi-region relation to identify the forged regions from the suspicious ones. Further, Lianget al.  presented an efficient forgery detection algorithm which integrates central pixel mapping, greatest zero-connectivity component labeling and fragment splicing detection. More recently, Zhu et al.  built an encoder-decoder network which is supervised by a label matrix and weighted cross-entropy to capture the manipulation traces. Unfortunately, these forensic approaches can only detect exemplar-based inpainting manipulations, while not diffusion-based ones, as the latter type will not generate similar blocks in the inpainted regions . To remedy this issue, Li et al. 
suggested to detect diffusion-based inpainting by analyzing the local variance of image Laplacian along the isophote direction. In addition, to detect complicated combinations of forgeries (including inpainting), Wuet al.  proposed MT-Net, a more general forgery localization network, which first extracts image manipulation trace features and then identifies anomalous regions by assessing how different a local feature is from its reference features. However, for some challenging cases, e.g., when forged features dominate the image, MT-Net could fail completely.
Since DL-based inpainting methods can use learned high-level semantic information to generate more complex structures and even novel objects, they may leave completely different artifacts in the inpainted regions, causing very poor detection performance of the aforementioned forensic approaches [28, 59]. To improve the detection accuracy, Li and Huang  designed the HP-FCN, a DL-based method to locate the image regions manipulated by deep inpainting. A high-pass pre-filtering module is employed to suppress image contents and enhance the differences between the inpainted and untouched regions. Their experimental results showed that HP-FCN can effectively locate the inpainting forgeries when the training set created from the same inpainting method is available. It should be noted that the generalizability to unseen inpainting methods, though important in practice, has not been investigated in .
Ii-C Neural Architecture Search (NAS)
The achievements of deep neural networks in various tasks depend on their exquisite architecture design, which requires tremendous amount of domain knowledge and is usually time-consuming. Zoph and Le 
introduced the NAS, an idea of using recurrent neural networks to search for an appropriate network architecture with highest validation accuracy. Along this line, many works proposed to adopt advanced techniques to aid the search process, e.g, reinforcement learning[3, 19], evolution  and surrogate model . However, searching and training thousands of models are almost infeasible for a single practitioner . To significantly reduce the computational cost of the architecture search, weight sharing mechanism  and one-shot NAS  were utilized. In this work, we adopt the one-shot NAS strategy as the backend search algorithm, due to its speediness and flexibility .
The core idea of the one-shot NAS is to use the same weights to evaluate different sampled architectures, thereby achieving the cost reduction of an order of magnitude. Specifically, instead of training several separate models, we can train a single model (the one-shot model) containing all the potential operations. Then at the evaluation stage, some operations’ outputs are selectively zeroed out, in order to determine which operation contributes most to the prediction accuracy. It should be noted that the one-shot models are only used to rank different architectures in the search space; retraining the candidate model with the highest evaluation accuracy is still needed. For more details regarding the one-shot NAS, please refer to .
In this section, we present the details of the GIID-Net for detecting the inpainting manipulations, not only for DL-based but also for traditional ones. The schematic diagram of the GIID-Net is shown in Fig. 2. As can be seen, the GIID-Net consists of three main blocks, namely, the enhancement block, the extraction block and the decision block. The enhancement block involves a hierarchically combined input layers for enhancing the inpainting traces (see Section III-A). The following extraction block, composed of a series of cell units searched by one-shot NAS, is designed to extract high-level features that are suitable for distinguishing multiple forgeries (see Section III-B). Eventually, the decision block outputs the final detection result, with the assistance of global and local attention modules (see Section III-C).
At the training stage, we first sample a pristine 3-channel (RGB) color image and a corresponding binary mask (1’s are assigned to the inpainted regions and 0’s elsewhere). Then an input image can be synthesized as
where the operator means element-wise multiplication and the function denotes the employed inpainting algorithm. The GIID-Net takes as input, and outputs the predicted mask (1’s are assigned to the predicted inpainted regions and 0’s elsewhere). During this process, the pair of
are used by a fused loss functionto update the parameters of the network . At the inference stage, similar procedure can be performed to obtain the predicted inpainting mask.
Here we would like to emphasize that the inpainted regions indicated by the binary mask can be of any shape and appear anywhere, which better reflects the true situation of forgery operations. The masks are selected from a dataset generated by , where some examples are given in Fig. 3. Compared with the HP-FCN  which only uses rectangular masks within a fixed range, our training preparation allows the network to learn more about the diversity of the inpainting forgery, leading to better detection accuracy.
Now we are ready to explain the aforementioned three main blocks in the GIID-Net, and also the loss function for its optimization.
Iii-a Enhancement Block
As RGB channels are not sufficient to tackle all the different cases of manipulations , we propose to enhance the inpainting traces through adding several pre-designed input layers. The potential input layers that can be incorporated include Steganalysis Rich Model (SRM) layer , Pre-Filtering (PF) layer , Bayar layer , convolution (i.e., Conv) layer, and combinations of them.
More specifically, SRM layer utilizes the local noise distributions of the image to provide additional evidence . For a 3-channel input , SRM layer extracts the corresponding features by using a kernel , namely,
and represents the convolutional operation.
PF layer is designed to get image residuals for enhancing inpainting traces, as the inpainted regions are more distinguishable from the pristine ones in the residual domain. The effectiveness of the PF layer may be because inpainting methods focus on producing visually realistic image contents, while ignoring the inherently imperceptible high-frequency noise in natural images. Practically, PF layer can then be initialized with a first-order derivative high-pass filter :
which are determined by analyzing the transition probability matrices of untouched/inpainted image patches. The high-passed featurescan be obtained by
It should be noted that the filter kernels of PF layer are set as learnable so that they can be fine-tuned during the learning process.
Instead of relying on pre-determined kernels, we also incorporate the Bayar layer to adaptively learn low-level prediction residual features for detecting inpainting traces. It is implemented by adding specific constraints to the standard convolutional kernels. For simplicity, we use to represent the th channel of the weights in the Bayar layer, and the central values of each channel are denoted by a spatial index . Then the following constraints are enforced on each channel of before each training iteration:
Finally, the constrained features can be obtained by
In order to find an appropriate combination of these layers, we have conduced intensive experiments by evaluating the inpainting detection performance of various combinations. We have found that the combination of Conv + Bayar + PF
gives the best detection performance. We thus use this kind of combination as our first layer in the enhancement block. Next, we concatenate the enhanced features in the channel dimension and use two standard convolutions to initially process the enhanced features (i.e., decrease in resolution and increase in the number of channels) in preparation for the subsequent high-level features extraction. The experimental justifications of such combination will be provided in SectionIV-D.
Iii-B Extraction Block
After the enhancement block, we also design the extraction block to extract the high-level features for the inpainting detection. Instead of adopting the commonly-used ResNet  or DnCNN  as backbone, we propose to use an adjustable cell and fine-tune it with the one-shot NAS, so as to better fit the requirement inpainting detection. In the following, we successively introduce the cell architecture, i.e., search space, and the search algorithm.
Iii-B1 Search Space
One of the core components of NAS is how to design a reasonable search space for the adjustable cell, as different architecture could lead to diversified results. To describe the search space symbolically, we adopt the notational convention that each cell is represented as a directed acyclic graph with nodes. Each node indicates the th latent feature when using as the input, and each edge means a transformed operation chosen from a pre-defined operation pool
which includes candidate operations. For conveniently representing the selective edges in the search space, control parameters are introduced:
where 1 means the corresponding edge is activated and 0 otherwise. Each latent feature in the graph can be calculated by using its predecessors, i.e.,
Compared with the traditional NAS, our search space is tailored to the inpainting detection and is mainly different in two aspects: 1) Selective operations in the pool are pruned, remaining only three kinds of separable convolutions and identity transformation. This can reduce the computational cost while preserving the diversity of sampling models [51, 18], and 2) We limit the minimum number of transformations in the cell block, i.e., some edges are manually fixed. In particular, the cell is composed of and
separable convolutions, where batch normalization
and ReLU activation are embedded appropriately. A skip connection is introduced from beginning to end as a shortcut. As these operations have been proven to be very effective in helping the network learn feature representations , they are explicitly added to enhance the initial performance of the cell block. As for the remaining selective edges, we package them as choice blocks. The diagram of the cell architecture is shown in Fig. 4.
Iii-B2 Search Algorithm
Similar to [61, 34, 18], we search the network architectures based on the one-shot NAS algorithm. Specifically, we first train a supernet containing all possible network architectures by enabling all selective edges in each choice block. Once the supernet is well-trained, we can sample a candidate architecture by simulating that each choice block contains only one kind of selective edge, and zeroing out the remaining ones. Recall that we have 3 choice blocks with 4 operations in the , and totally 10 cells in the network, which results in a search space with the complexity . At the evaluation time, we randomly sample 1000 candidate architectures from the search space and evaluate their performance on a validation set mixed with multiple inpainting forgeries. The best-performing one among these 1000 candidates is chosen as the final architecture. Noted that the candidate architectures (i.e., the one-shot model) generally perform unsatisfactorily, as they have only been trained with a few steps; further fine tuning is hence needed.
Iii-C Decision Block
The role of the decision block is to transform the learned high-level features into low-level discriminative information, i.e., inpainting detection result. Apparently, at pixel level, the detection result can be divided into two classes: positive class (inpainted pixel) and negative class (pristine pixel). During this process, misclassified pixels may be generated to form inaccurate detection, due to the ineffectiveness of convolutional neural networks in modeling long-term feature correlations . To track this problem, many attention modules have been proposed and used recently in the decision phase of networks [52, 54, 46]. The main idea of attention module is simple but very efficient, which is to optimize specific features with the assistance of other features. Along this line, we propose to integrate novel attention modules (i.e., global attention and local attention) into the decision block for better generating the detection result. The global attention aims to reduce the number of misclassified pixels through a very effective technique in classification task: minimizing the intra-class variance. Motivated by the observation that generally surrounding pixels are of the same class as the center pixel, we use local attention to improve the consistency of features within a specific region for generating more accurate detection result.
Iii-C1 Gloabl Attention
The global attention is motivated by an essence method for improving classification performance: reducing intra-group distances. Practically, we re-generate each feature with its several most similar features, so as to reduce the differences within the same class. Let be the feature map of the latent layer in the decision block when using as the input. We extract all patches from and group them into a set . For each patch
, its intra-cosine similarities withincan be computed as
Upon computing all ’s, we can set a similarity threshold to select the top- most similar patches for from . Let record all the indexes of these top- most similar patches. Then we have
In practice, the process of similarity search can be conducted via a modified convolutional layer to reduce the computation burden caused by loop operations, as explained in [52, 46]. We then propose to update each via the average of its corresponding top- most similar patches:
Therefore, the updated will increase the intra-class similarity along with the training processes, benefiting the ultimate inpainting detection task. A toy example is given in Fig. 5 where the feature map is a simple matrix (two colors represent two different classes) and the parameter .
However, there are two potential problems when applying this global attention mechanism: 1) how to make sure that the top- most similar patches belong to the same class; and 2) how to set an appropriate value for the parameter (or equivalently the threshold ), since the proportion of the inpainted and untouched regions could be varying. To answer these two questions, we take a data-driven approach by analyzing the the statistics of 1000 training images. In Fig. 6 (a), we plot the relationship between the employed threshold and the probability that the corresponding selected top- patches belong to the same class. As can be seen, with the increasing , the probability of belonging to the same class tends to increase as well. Meanwhile, in Fig. 6 (b), we also show how the value varies with respect to the threshold . It can be observed that when is relatively significant, increasing leads to the decrease of . In other words, when is very large, then the number of selected similar patches would be very small. Therefore, it is crucial to set the threshold appropriately by balancing two factors: 1) the probability of belonging to the same class should be high enough, and 2) the number of selected patches should be sufficiently large as well. According to the above experiments, we empirically set , which corresponds to the case that the probability of belonging to the same class is around 0.9 and .
Iii-C2 Local Attention
Inspired by the observation that adjacent pixels (features) are often highly correlated, we now propose a local attention module to better maintain the local consistency. Similar to the process of the global attention, we update each feature with its surrounding features in a weighted manner. To reflect the local correlation, the surrounding features in a small local window are exploited. Specifically, we define a weight matrix of size , where , and convolve it with the patch to obtain the updated feature . Namely,
To determine an appropriate weight matrix for the generalizable inpainting detection, we again adopt a data-driven approach by exploiting the 1000 training images used in the global attention. We calculate the average similarity matrix of size over these 1000 images, and have
where each element is computed according to (11). Upon having the similarity matrix , the weight matrix can be naturally determined by transforming the via a softmax activation:
Iii-D Fused Loss Function
We use the binary cross-entropy (BCE) loss to supervise the training of GIID-Net, as its objective is to detected the inpainted/pristine region, which is essentially a binary classification task. More specifically, for a pair of ground-truth and predicted inpainting masks , the BCE loss can be defined as
where (similarly for ) denotes the th element of with a resolution .
However, in most of the inpainting-based forgeries, the inpainted regions are relative smaller than the pristine ones, resulting in a class imbalance problem caused by the above loss function. Such imbalance would lead to a serious problem that the trained model tends to more likely classify the samples as pristine. To address this issue, we propose to incorporate the focal loss into the BCE loss, forming a fused loss function. The idea of the focal loss is to add a modulating term to the standard cross entropy loss, so as to focus learning on hard examples and down-weight the numerous easy negatives. Typically, an -balanced variant of the focal loss can be defined as:
where and are predefined parameters commonly set as 0.75 and 2 respectively.
Thus, the fused loss function can be written as
Instead of directly using the above fused loss function as the objective for optimizing GIID-Net, we propose to apply a median filter to before calculating the loss functions (17) and (III-D). Median filter is a non-linear statistical filter, often used to remove impulse noises. The intuition behind applying median filter to is that we hardly tamper with only one or two isolated pixels in reality; namely, the inpainting area is usually continuous within a certain area. Hence, median filtering is a natural choice for “denoising” the isolated regions, which could boost the inpainting detection performance. Finally, the total loss function of GIID-Net can be expressed as:
where is a standard median filter kernel.
Iv Training Data Selection and Experimental Results
The proposed GIID-Net is implemented using the PyTorch framework. The training is performed on a desktop equipped with an Intel(R) Xeon(R) Gold 6130 CPU and three GTX 2080 GPUs. Adam
with default parameters is adopted as the optimizer. We set the batch size to 24 and 2000 batches per epoch, and use the pixel-level Area Under the receiver operating characteristic Curve (AUC) as the evaluation criteria (higher is better). We train the network in an end-to-end manner with an initial learning rate 1e-4. The learning rate will be halved if the lossfails to decrease for 10 epochs until the convergence. All the images used in the training phase are cropped to a size of , while there is no size limit for the inference phase. To embrace the concept of reproducible research, the code of our paper is available at https://github.com/HighwayWu/InpaintingForensics.
Iv-a Training Data Selection and Generalizability Evaluation
The training data selection is crucial to the success of the GIID-Net, especially for the generalizability to unseen inpainting approaches. For the generation of training data, Places  (JPEG lossy compression) and Dresden  (NEF lossless compression) datasets are used as base images , and the masks are randomly sampled from . The training dataset contains a total of 48K images (around 3 Gigabyte), half of which are randomly sampled from Places and the remaining half are randomly selected from Dresden. It should be noted that we keep the inpainting method unchanged when generating the training dataset, and regenerate the entire training dataset if is changed. In other words, we only use the training dataset generated by one inpainting method at a time in the actual training process. As for testing images, we further introduce additional datasets, CelebA  and ImageNet , to increase the data diversity. Besides, we randomly generate a series of basic shapes, e.g., rectangles, circles, ellipses and polylines, as additional test masks, which can locate at any positions. These additional masks occupy approximately the same proportions as the masks generated from . Regarding the inpainting methods, we here consider totally 10 representative ones, among which 6 are DL-based ones proposed in recent years, namely, GC , CA , SH , EC , LB  and RN . The remaining 4 methods are traditional (non DL-based), which include TE , NS , PM , and SG . A brief introduction of these inpainting methods has already been presented in Section II-A. Specifically, we build up a test dataset of 10K pairs of inpainted images and the corresponding ground-truth masks, where a variety of image categories and mask shapes are incorporated. In addition, each of the 10 aforementioned inpainting method contributes 1K inpainted images. The whole test dataset is downloadable from https://github.com/HighwayWu/InpaintingForensics, serving as a useful resource of our research community for fighting against the inpainting-based forgeries. Here we emphasize that the training dataset and test dataset have no overlap.
In Fig. 7, we report the inpainting detection performance of our proposed GIID-Net, where the 10K-sized test dataset is used. Here, each row shows the test results (percentage of pixel-level AUC) of the GIID-Net trained on a specific inpainting method. For instance, the first row represents different test datasets (1K images for each) evaluated by the network trained on GC. The diagonal AUC values represent the detection results when the inpainting methods at the training and testing stages are the same, i.e., the scenario where the utilized inpainting method is known. In such a scenario, GIID-Net achieves very desirable AUC performance ( 95%) for all cases. Meanwhile, the off-diagonal elements in Fig. 7 demonstrate the generalizability of GIID-Net trained using different training data. It can be observed that the generalizability of GIID-Net is vastly different when different training data are utilized. The best generalizability is achieved when GC is adopted at the training phase, with the average AUC 98.23%. In fact, the networks trained on DL-based inpainting methods usually have more favorable generalizability than the ones trained on classic inpainting methods. It is safe to conclude that the DL-based and traditional inpainting algorithms leave somewhat common detectable traces that can be distinguished from untouched images. As the GIID-Net trained with GC achieves the best generalizability, the training data generated with GC will be adopted in the following evaluations.
|DL-based Inpainting||Traditional Inpainting||Mean|
|GC ||CA ||SH ||EC ||LB ||RN ||TE ||NS ||PM ||SG |
|Conv + Bayar||✓|
|Conv + Bayar + PF||✓||✓||✓||✓||✓||✓|
|Conv + Bayar + PF + SRM||✓|
|Extraction Block||w/o NAS||✓||✓||✓||✓||✓||✓||✓|
|Decision Block||w/o Att.||✓||✓||✓||✓||✓||✓||✓||✓|
|Global & Local Att.||✓||✓|
Iv-B Quantitative Comparisons
For comparison purpose, we adopt three state-of-the-art inpainting forensic approaches: LDI , MT-Net  and HP-FCN . LDI is a traditional forensic approach that designs discriminative features for identifying the inpainted regions and uses post-processing for refining the detection results. MT-Net uses the powerful learning ability of neural networks to classify anomalous features of an input image, and attains good generalizability to various conventional manipulation types, including inpainting operations. HP-FCN is a high-pass fully convolutional network for locating the forged regions generated by deep inpainting. For fairness, we not only compare our proposed GIID-Net with the pre-trained models officially released by competitors, but also with the models retrained on our training dataset. The quantitative comparisons in terms of the pixel-level AUC performance are presented in Table I.
As can be observed, the detection performance of LDI on the traditional inpainting methods is relatively better than on the DL-based ones. On average, the AUC value is 50.20%, which is close to random guessing. This phenomenon is probably due to the fact that the manually designed features are not reliable, especially for unseen inpainting approaches. In contrast, the learning-based detection methods (MT-Net, HP-FCN, and GIID-Net) achieve much better AUC performance. More specifically, the original MT-Net obtains 90.59% mean AUC value, meaning that the pre-trained MT-Net is already able to (relatively) accurately detect the inpainted regions created by various inpainting algorithms. Surprisingly, the retrained MT-Net achieves a bit worse AUC performance (81.06%), compared to the pre-trained model. This may be because the network architecture of MT-Net is specially designed according to their original training set. Furthermore, the pre-trained HP-FCN performs relatively unsatisfactory (53.12%), mainly because this model is overfitted with a specific inpainting method and fixed inpainting mask. However, the re-tained HP-FCN performs much better, with 93.64% mean AUC. Thanks to the adopted NAS algorithm for the architecture search and the global/local attention mechanisms, our proposed GIID-Net leads to very accurate and consistent inpainting detection, with 98.23% mean AUC.
Iv-C Qualitative Comparisons
In addition to the quantitative comparisons, we also compare different models qualitatively, as shown in Fig. 8. More specifically, Fig. 8 gives several representative examples of using inpainting as a powerful tool to remove objects or even change the semantic meaning of an image. Due to the space limit, only the best-performing version of each competing model is shown (i.e., the pre-trained MT-Net and the retrained HP-FCN). It can be seen that LDI only performs relatively well in detecting the NS-based inpainting manipulations (the last third row); but its detection performance degrades severely for other deep and traditional inpainting algorithms. For the pre-trained MT-Net, it can locate the forged regions well in some test sets; but cannot achieve a consistent performance across all testing datasets (the second, third and forth rows). The retrained HP-FCN generally can produce pretty good detection results; but the inaccurate, broken or blurred detection results can be observed (the forth, seventh and tenth rows). Compared with these models, our proposed model can learn more reasonable high-level semantics and generate more precise predicted mask, primarily thanks to the carefully designed architectures as well as attention modules.
Iv-D Ablation Studies
We now conduct the ablation studies of our proposed GIID-Net by analyzing how each component (e.g., Bayar/PF layers, the NAS and the attention mechanisms) in the blocks contributes to the final inpainting detection results. To this end, we first prohibit the use of additional components in each block, and then evaluate the performance of different retrained models with appropriate settings. The obtained results are shown in Table II.
For the enhancement block, the pre-designed input layers (e.g., SRM , PF  and Bayar  layers) achieves better performance comparing with the traditional convolution (Conv). This is mainly because these input layers can enhance the inpainting traces, providing additional evidence for the subsequent detection. Among these input layers, SRM layer gives the worst performance, possibly because it uses fixed weights and cannot better adapt to generalizable inpainting traces. In addition, the combination of Conv + Bayar + PF could offer the best detection performance, and that is the reason of adopting such combination as the input layer in GIID-Net.
For evaluating the performance without using NAS, we initialize the extraction block with a random architecture sampled from (see (9)). It is found that the network designed by using NAS strategy performs much better (over 4% AUC gain) than that without NAS. This implies that NAS can better design network structures, which improve the capability of generalizable inpainting detection.
For further improving the performance, we embed the pre-designed attention mechanisms in the decision block. From Table II, we can also observe that both global and local attention modules indeed can bring positive improvements. This is because attention mechanisms can further optimize high-level features, boosting the eventual inpainting detection performance.
Iv-E Robustness Evaluations
We would also like to evaluate the robustness of our GIID-Net in detecting inpainting manipulations. This is very critical in real-world detection scenarios, because many post-processing operations, such as noise addition, blurring, and/or compression, could be applied to potentially hide the inpainting traces. To this end, we apply these post-processing operations with different types and magnitudes to the test sets and report the statistical detection results (pixel-level AUC) in Fig. 9. It is observed that the overall performance is good when the intensity of perturbations is relatively low , e.g., the performance is reduced by around 10% when a blurring is applied. But with the increase of the perturbation intensity, the performance drops almost linearly. Such phenomenon agrees with the observations from [49, 41, 45]. The robustness evaluation results mentioned above indicate that our GIID-Net exhibits desirable robustness against the perturbations with small or medium magnitudes. Of course, when the perturbation intensity becomes further larger, the inpainting evidences will be destroyed, causing severe detection errors. But meanwhile, strong perturbations also lead to severely degraded images, which deviates the purpose of performing inpainting.
Iv-F Challenging Cases
Before ending this section, we further evaluate the performance of our proposed GIID-Net and other competing schemes under several challenging cases. One particular challenge arises when multiple regions in a single image are manipulated differently, e.g., by different inpainting algorithms. As indicated in , MT-Net would fail in such cases. To this end, we give an example by first inpainting an original image with a mask via RN , and the inpainting result is called “Forgery 1”. We then perform inpainting again on top of “Forgery 1” according to another mask via EC , and generate the “Forgery 2”. The two-round inpainting process is shown in Fig. 10 (a).
We now examine the inpainting detection performance of GIID-Net and the competitors (MT-Net , LDI  and HP-FCN ) by using “Forgery 1” and “Forgery 2” as inputs, respectively. The detection results of these methods are demonstrated in Fig. 10 (b)-(c). As can be observed, LDI fails completely in both “Forgery 1” and “Forgery 2”. For MT-Net, it can only detect one of these inpainted regions at certain accuracy, while missing the other one. This phenomenon is consistent with the observation in . One possibility is that the addition of the second type of inpainting changes the distribution of anomalous features, thereby affecting the discriminative capability of MT-Net. HP-FCN can detect both rounds of inpainting manipulations, but with severe detection errors. In contrast, our proposed GIID-Net gives much more accurate detection result not only in single inpainting case, but also in multiple inpainting case. We also have tested some other examples with different inpainting methods and more original images; similar conclusions can be drawn.
Another challenging case is that the whole image is completely regenerated by a DL-based network, i.e., the whole image is inpainted. For instance, a recent work  showed that an image can be reconstructed from Scale Invariant Feature Transform (SIFT) descriptors. Since the synthesis is totally generated by DL-based networks, it can also be regarded as a “global inpainting”, i.e., the ground truth should be fully positive (white). We demonstrate the inpainting detection results of different methods in Fig. 11. It can be noticed that MT-Net can hardly locate the inpainted regions, mainly because its working principle relies on finding the anomalous features relative to the dominating ones. However, such relative dominance does not hold when the synthesis is completely composed of “anomalous” features. Fortunately, this limitation does not exist in our model, and the GIID-Net gives almost perfect detection result even under this challenging case, as shown in Fig. 11 (f). The results of LDI and HP-FCN are also presented for comparison purpose.
In this paper, we propose the GIID-Net, a novel DL-based forensic model for the detection of various image inpainting manipulations. The proposed model is designed with the assistance of the NAS algorithm and the embedded attention modules to optimize the latent high-level features. Experimental results are provided to not only demonstrate the superiority of our model against state-of-the-art competitors, but also verify that common artifacts are shared across diverse DL-based and traditional inpainting methods, allowing the forensic approaches to generalize well from one method to unseen ones without extensive retraining.
-  (2017) Localization of jpeg double compression through multi-domain convolutional neural networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn. Workshop, Cited by: §I.
-  (2013) A jump patch-block match algorithm for multiple forgery detection. In Proc. Int. Multi-Conf. Autom., Comput., Commun., Control Compress. Sens., pp. 723–728. Cited by: §II-B.
-  (2017) Designing neural network architectures using reinforcement learning. In Proc. Int. Conf. Learn. Representations, Cited by: §II-C.
Filling-in by joint interpolation of vector fields and gray levels. IEEE Trans. Image Process. 10 (8), pp. 1200–1211. Cited by: §II-A.
-  (2018) Constrained convolutional neural networks: a new approach towards general purpose image manipulation detection. IEEE Trans. Inf. Forensics and Security 13 (11), pp. 2691–2706. Cited by: §III-A, §IV-D.
-  (2018) Understanding and simplifying one-shot architecture search. In Proc. Int. Conf. Mach. Learn., pp. 550–559. Cited by: §II-C, §II-C.
-  (2001) Navier-stokes, fluid dynamics, and image and video inpainting. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn., pp. 355–362. Cited by: 2nd item, §II-A, Fig. 8, §IV-A, TABLE I.
-  (2000) Image inpainting. In Proc. Conf. Comput. Graph. Inter. Tech., pp. 417–424. Cited by: §II-A.
-  (2003) Simultaneous structure and texture image inpainting. IEEE Trans. Image Process. 12 (8), pp. 882–889. Cited by: §II-A.
-  (2013) A forgery detection algorithm for exemplar-based inpainting images using multi-region relation. Image Vis. Comput. 31 (1), pp. 57–71. Cited by: §II-B.
-  (2004) Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13 (9), pp. 1200–1212. Cited by: §II-A.
-  (2009) ImageNet: a large-scale hierarchical image database. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn., pp. 248–255. Cited by: §IV-A.
Image inpainting using nonlocal texture matching and nonlinear filtering. IEEE Trans. Image Process. 28 (4), pp. 1705–1719. Cited by: §II-A.
-  (2001) Image quilting for texture synthesis and transfer. In Proc. Conf. Comput. Graph. Inter. Tech., pp. 341–346. Cited by: §II-A.
-  (2015) Image splicing detection with local illumination estimation. In Proc. IEEE Int. Conf. Image Proc., pp. 2940–2944. Cited by: §I.
-  (2012) Rich models for steganalysis of digital images. IEEE Trans. Inf. Forensics and Security 7 (3), pp. 868–882. Cited by: §III-A.
-  (2010) The dresden image database for benchmarking digital image forensics. J. of Digit. Forensic Pract. 3 (2-4), pp. 150–159. Cited by: §IV-A.
-  (2020) When nas meets robustness: in search of robust architectures against adversarial attacks. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn., pp. 631–640. Cited by: §II-C, §III-B1, §III-B2.
-  (2019) Irlas: inverse reinforcement learning for architecture search. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn., pp. 9021–9029. Cited by: §II-C.
-  (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn., pp. 770–778. Cited by: §III-B.
-  (2014) High-quality real-time video inpaintingwith pixmix. IEEE Trans. Vis. Comput. Graph. 20 (6), pp. 866–879. Cited by: 2nd item, §II-A, Fig. 8, §IV-A, TABLE I.
-  (2014) Image completion using planar structure guidance. ACM Trans. Graph. 33 (4), pp. 1–10. Cited by: 2nd item, §II-A, Fig. 8, §IV-A, TABLE I.
-  (2018) Fighting fake news: image splice detection via learned self-consistency. In Proc. Eur. Conf. Comput. Vis., pp. 101–117. Cited by: §I.
-  (2017) Globally and locally consistent image completion. ACM Trans. Graph. 36 (4), pp. 1–14. Cited by: §II-A.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. Int. Conf. Mach. Learn., pp. 448–456. Cited by: §III-B1.
-  (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §IV-A.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV.
-  (2019) Localization of deep inpainting using high-pass fully convolutional network. In Proc. IEEE Int. Conf. Comput. Vis., pp. 8301–8310. Cited by: 3rd item, §I, §II-A, §II-B, §III-A, §III, Fig. 11, Fig. 8, §IV-B, §IV-D, §IV-F, TABLE I.
-  (2017) Localization of diffusion-based inpainting in digital images. IEEE Trans. Inf. Forensics and Security 12 (12), pp. 3050–3064. Cited by: 3rd item, §II-B, Fig. 11, Fig. 8, §IV-B, §IV-F, TABLE I.
-  (2019) Fast and effective image copy-move forgery detection via hierarchical feature point matching. IEEE Trans. Inf. Forensics and Security 14 (5), pp. 1307–1322. Cited by: §I.
-  (2015) An efficient forgery detection algorithm for object removal by exemplar-based image inpainting. J. Vis. Commun. Image R. 30, pp. 75–85. Cited by: §II-B.
-  (2017) Focal loss for dense object detection. In Proc. IEEE Int. Conf. Comput. Vis., pp. 2980–2988. Cited by: §III-D.
-  (2018) Image inpainting for irregular holes using partial convolutions. In Proc. Eur. Conf. Comput. Vis., pp. 85–100. Cited by: Fig. 1, §I, §II-A, Fig. 3, §III, §IV-A.
-  (2019) Darts: differentiable architecture search. In Proc. Int. Conf. Learn. Representations, Cited by: §II-C, §III-B2.
-  (2018) Structure-guided image inpainting using homography transformation. IEEE Trans. Image Process. 20 (12), pp. 3252–3265. Cited by: §II-A.
-  (2014) Exposing region splicing forgeries with blind local noise estimation. Int. J. Comput. Vis. 110 (2), pp. 202–221. Cited by: §I.
-  (2010) Rectified linear units improve restricted boltzmann machines. In Proc. Int. Conf. Mach. Learn., pp. 807–814. Cited by: §III-B1.
-  (2019) EdgeConnect: generative image inpainting with adversarial edge learning. In Proc. IEEE Int. Conf. Comput. Vis. Workshop, Cited by: Fig. 1, 2nd item, §I, §II-A, Fig. 8, §IV-A, §IV-F, TABLE I.
-  (2016) Context encoders: feature learning by inpainting. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn., pp. 2536–2544. Cited by: §II-A.
-  (2019) Regularized evolution for image classifier architecture search. In Proc. AAAI Conf. Arti. Intell., pp. 4780–4789. Cited by: §II-C.
-  (2019) Faceforensics++: learning to detect manipulated facial images. In Proc. IEEE Int. Conf. Comput. Vis., pp. 1–11. Cited by: §IV-E.
-  (2004) An image inpainting technique based on the fast marching method. J. of Graph. Tools 9 (1), pp. 23–34. Cited by: 2nd item, §II-A, Fig. 8, §IV-A, TABLE I.
-  (2014) Blind inpainting forgery detection. In Proc. IEEE Global Conf. Signal Inf. Process, pp. 1019–1023. Cited by: §I, §II-B.
-  (2019) MUSICAL: multi-scale image contextual attention learning for inpainting. In Proc. Int. Jt. Conf. AI, Cited by: §II-A.
-  (2020) CNN-generated images are surprisingly easy to spot… for now. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn., pp. 8695–8704. Cited by: §I, §IV-E.
-  (2020) Deep generative model for image inpainting with local binary pattern learning and spatial attention. preprint arXiv:2009.01031. Cited by: Fig. 1, 2nd item, §I, §II-A, §III-C1, §III-C, Fig. 8, §IV-A, TABLE I.
-  (2020) Privacy leakage of sift features via deep generative model based image reconstruction. preprint arXiv:2009.01030. Cited by: §IV-F.
-  (2008) Detection of digital doctoring in exemplar-based inpainted images. In Proc. Int. Conf. Mach. Learn. Cybernetics, Vol. 3, pp. 1222–1226. Cited by: §II-B.
-  (2019) ManTra-net: manipulation tracing network for detection and localization of image forgeries with anomalous features. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn., pp. 9543–9552. Cited by: 3rd item, §II-B, §III-B1, Fig. 11, Fig. 8, §IV-B, §IV-E, §IV-F, §IV-F, TABLE I.
-  (2018) Progressive neural architecture search. In Proc. Eur. Conf. Comput. Vis., pp. 19–35. Cited by: §II-C.
-  (2019) Exploring randomly wired neural networks for image recognition. In Proc. IEEE Int. Conf. Comput. Vis., pp. 1284–1293. Cited by: §III-B1.
Shift-net: image inpainting via deep feature rearrangement. In Proc. Eur. Conf. Comput. Vis., pp. 1–17. Cited by: 2nd item, §II-A, §III-C1, §III-C, Fig. 8, §IV-A, TABLE I.
-  (2019) Free-form image inpainting with gated convolution. In Proc. IEEE Int. Conf. Comput. Vis., pp. 4471–4480. Cited by: 2nd item, §II-A, Fig. 8, §IV-A, TABLE I.
-  (2018) Generative image inpainting with contextual attention. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn., pp. 5505–5514. Cited by: 2nd item, §II-A, §II-A, §III-C, Fig. 8, §IV-A, TABLE I.
-  (2020) Region normalization for image inpainting. In Proc. AAAI Conf. Arti. Intell., pp. 12733–12740. Cited by: 2nd item, Fig. 8, §IV-A, §IV-F, TABLE I.
-  (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 26 (7), pp. 3142–3155. Cited by: §III-B.
Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40 (6), pp. 1452–1464. Cited by: §IV-A.
-  (2018) Learning rich features for image manipulation detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recogn., pp. 1907–1915. Cited by: §III-A, §IV-D.
-  (2019) Forensic detection based on color label and oriented texture feature. In Int. Conf. Brain Inspired Cog. Syst., pp. 383–395. Cited by: §II-B.
-  (2018) A deep learning approach to patch-based image inpainting forensics. Signal Proc. Image Commun. 67, pp. 90–99. Cited by: §II-B.
-  (2017) Neural architecture search with reinforcement learning. In Proc. Int. Conf. Learn. Representations, Cited by: §II-C, §III-B2.