Detecting Colorized Images via Convolutional Neural Networks: Toward High Accuracy and Good Generalization

02/17/2019
by   Weize Quan, et al.
Grenoble Institute of Technology
6

Image colorization achieves more and more realistic results with the increasing computation power of recent deep learning techniques. It becomes more difficult to identify the fake colorized images by human eyes. In this work, we propose a novel forensic method to distinguish between natural images (NIs) and colorized images (CIs) based on convolutional neural network (CNN). Our method is able to achieve high classification accuracy and cope with the challenging scenario of blind detection, i.e., no training sample is available from "unknown" colorization algorithm that we may encounter during the testing phase. This blind detection performance can be regarded as a generalization performance. First, we design and implement a base network, which can attain better performance in terms of classification accuracy and generalization (in most cases) compared with state-of-the-art methods. Furthermore, we design a new branch, which analyzes smaller regions of extracted features, and insert it into the above base network. Consequently, our network can not only improve the classification accuracy, but also enhance the generalization in the vast majority of cases. To further improve the performance of blind detection, we propose to automatically construct negative samples through linear interpolation of paired natural and colorized images. Then, we progressively insert these negative samples into the original training dataset and continue to train the network. Experimental results demonstrate that our method can achieve stable and high generalization performance when tested against different state-of-the-art colorization algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 6

page 7

page 8

page 9

page 10

page 13

02/09/2019

An Algorithm Unrolling Approach to Deep Blind Image Deblurring

Blind image deblurring remains a topic of enduring interest. Learning ba...
05/07/2021

Human-Aided Saliency Maps Improve Generalization of Deep Learning

Deep learning has driven remarkable accuracy increases in many computer ...
05/10/2021

Examining and Mitigating Kernel Saturation in Convolutional Neural Networks using Negative Images

Neural saturation in Deep Neural Networks (DNNs) has been studied extens...
09/01/2018

Data Dropout: Optimizing Training Data for Convolutional Neural Networks

Deep learning models learn to fit training data while they are highly ex...
03/09/2018

Learning a Discriminative Prior for Blind Image Deblurring

We present an effective blind image deblurring method based on a data-dr...
06/12/2020

Early Detection of Retinopathy of Prematurity (ROP) in Retinal Fundus Images Via Convolutional Neural Networks

Retinopathy of prematurity (ROP) is an abnormal blood vessel development...
05/01/2019

Sex-Prediction from Periocular Images across Multiple Sensors and Spectra

In this paper, we provide a comprehensive analysis of periocular-based s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the increasing popularity and sophistication of image editing technologies, it is now relatively easy to create edited images that are visually plausible. For example, current advanced colorization algorithms, more or less leveraging the powerful capacity of deep neural networks, can automatically colorize the grayscale images to obtain the high-quality color images. Fig. 1 shows a pair of images, the right one is the original color image used for comparison, and the left one is a colorized image produced by a fully automatic colorization algorithm [1]

that takes the grayscale version of the right one as input. Obviously, it is difficult to distinguish which one is colorized image by naked human eyes. Although this technique brings convenience to people’s live, it may also be maliciously used and potentially lead to security issues, such as confounding object recognition or scene understanding 

[2]. Therefore, distinguishing between natural images (NIs) and colorized images (CIs) has become an important research problem in image forensics.

Fig. 1: Pair of images. The left one is a colorized image generated by the colorization method proposed in [1]

, and the right one is a natural image taken from ImageNet 

[3].

Very recently, Guo et al. [2]

first proposed two approaches to solve this new forensic problem. On the basis of the statistical difference between NIs and CIs in the hue, saturation, dark, and bright channels, two methods, namely, histogram-based and Fisher-encoding-based, were designed to catch the difference. After having obtained the discriminant feature vectors, they trained the support vector machine (SVM) classifiers to identify fake colorized images. In fact, the classification performance of their methods still has some space for improvement. Furthermore, in the challenging scenario of

blind detection, i.e., no training sample is available from “unknown” colorization method that we may encounter during the testing phase of forensic detectors, the performance of their methods in general decreases. Hereafter, we call this blind detection performance as generalization performance. In the meanwhile, although not being very rigorous, we choose to use the term “classification accuracy/performance” to indicate the detection performance on testing data in which CIs are generated by a same colorization method known by the training procedure.

Nowadays, convolutional neural network (CNN) has obtained obvious performance improvement compared with traditional handcrafted-feature-based methods, not only in computer vision and pattern recognition 

[4, 5, 6], but also in multimedia security [7, 8, 9, 10, 11]. A well-known reason is that it can automatically extract useful information from (complex) data and thus has powerful learning capacity. In addition, its unified optimization framework, i.e., in the “end-to-end” manner, may be superior than the multi-step pipeline of conventional methods which often have separate stages of extracting handcrafted features (somehow reflecting human prior knowledge on the problem) and training classifiers. In this work, we propose a CNN-based method to identify colorized images. Specifically, we propose two ways to improve the forensic performance, especially the generalization capability. We design in the first place a base network and then improve its architecture, with the objective to obtain better performance in terms of classification accuracy and generalization. Afterwards, in order to better cope with the challenging scenario of blind detection, we introduce a simple yet effective method, namely, inserting additional auto-constructed negative samples into the original training dataset and then carrying out enhanced training of the network for a better generalization performance.

Our main contributions are summarized below:

  • We design and implement a base “end-to-end” deep model based on CNN to identify NIs and CIs, which obtains better classification accuracy and generalization capability (in most cases) compared with state-of-the-art methods [2]. We also consider and compare three different design choices about the activation of the network’s first layer.

  • We improve the original base network via inserting a new branch, which analyzes the smaller regions of the extracted features of the first layer, to enrich the learned features and enhance the discrimination capacity of network. This enhanced network can not only increase the classification accuracy, but also improve the generalization performance in the vast majority of cases.

  • We introduce a simple yet effective method to further improve the generalization performance of the proposed network. In practice, we construct negative samples via linear interpolation of paired natural and colorized images available in the training dataset, and iteratively add them into the original training dataset for additional and enhanced CNN training. This procedure is fully automatic, and can allow us to obtain stable and high generalization performance when conducting tests against colorization algorithms that are “unknown” during the training stage.

The rest of this paper is organized as follows. Section II reviews relevant existing work. Section III discusses the motivation of every step of our work, and presents the details of the proposed method. Section IV reports the performance evaluations for our method and comprehensive comparisons with state-of-the-art methods. Section V draws the conclusions and proposes some future working directions.

Ii Related Work

Ii-a Colorized Image and Its Identification

Image colorization adds color to a monochrome image and obtains a realistic color image. Existing colorization algorithms mainly consist of three categories: scribble-based [12, 13, 14, 15, 16], reference-based [17, 18, 19], and fully automatic [20, 1, 21, 22] approaches.

Scribble-based methods require user-specific scribbles and propagate the color information to the whole grayscale images. This kind of method is usually accompanied by trail and error to obtain satisfactory results, and thus is rather time-consuming. Reference-based (or example-based) approaches mainly exploit the color information of a reference image that is (semantically) similar to the input grayscale image. The core idea is to model a matching relationship between these two types of images. However, the selection of suitable reference image may be burdensome.

In contrast, recently researchers have developed fully automatic methods that do not need user interaction or example color images, and that are usually working in the data-driven manner. Cheng et al. [20] proposed the first deep neural network based image colorization method. Their method performed pixel-wise prediction, however the input of deep model was pre-extracted handcrafted features. Iizuka et al. [1] proposed a novel fully “end-to-end” network for the task of image colorization. The input was a grayscale image and its output was the chrominance, which was combined with the input image to produce the color image. Their network jointly learned global and local features from an image, and at the same time, they also exploited classification labels of the grayscale images to improve the performance. Different from previous methods, Larsson et al. [21] proposed a deep model that predicted a color histogram, instead of a single color value, at every image pixel. Zhang et al. [22] took into account the nature of uncertainty of this colorization task and introduced class-rebalancing method to increase the diversity of color of resultant image. These CNN-based methods lead to the very high visual quality of colorized images, often plausible enough to deceive the human perception.

As shown in Fig. 1, visually realistic colorized image (the left one), which is generated by the state-of-the-art colorization algorithm [1], is difficult to distinguish compared with corresponding natural image (the right one). Very recently, Guo et al. [2] first proposed handcrafted-feature-based methods to detect the fake colorized images. On the basis of the observation that the colorized images tend to possess less saturated colors, they analyzed the statistical difference between NIs and CIs in the hue and saturation channels. In addition, they also found that there are differences in certain image priors. In practice, they exploited the extreme channels prior (ECP) [23], i.e, the dark channel prior (DCP) [24] and the bright channel prior (BCP). They proposed two approaches, i.e., histogram-based and Fisher-encoding-based, to extract statistical features, and then trained SVMs for classification. We believe that this new and important forensic problem deserves further studies because the results shown in the pioneer work [2] could be improved in terms of classification accuracy and generalization performance – both are important metrics for moving forensic algorithms towards practical applications. The same as in Guo et al.’s work [2], in our study, we also consider high-quality colorized images generated by three state-of-the-art colorization algorithms, hereafter denoted respectively by Ma [21], Mb [22], and Mc [1].

Ii-B CNN for Multimedia Security

Inspired by the notable success of CNN, in the multimedia security community, a number of researchers have used CNN for image forensics [8, 25, 26, 27, 28, 29, 10, 30, 31, 11] and steganalysis [7, 32, 33, 34, 9].

Concerning CNN-based image forensics, different research problems have been considered. Chen et al. [8] first proposed to use CNN to detect median filtering, and obtained significant performance improvement compared with traditional methods. Tuama et al. [25] and Bondi et al. [26] utilized CNN to accomplish the task of source camera identification. This powerful tool was also employed to distinguish between natural and computer graphics images [29, 10], and to detect image forgery [30, 31]. In addition, Bayar et al. [11] developed a so-called constrained convolutional neural network to solve general purpose image manipulation detection problem.

Most of previous CNN-based methods mentioned above use conventional single-stream networks to complete their tasks [8, 7, 25, 32, 26, 31, 10, 11]. Different from this conventional design, other design choices have been considered, for example injecting additional knowledge to CNN [34] and utilizing multi-stream inputs (i.e., multiple representations of the same input image in different domains) [27, 28]. Chen et al. [34] introduced JPEG-phase knowledge into the CNN architecture to detect modern JPEG steganography. Barni et al. [27]

designed the CNN-based model for aligned and nonaligned double JPEG compression detection. Their networks took three inputs: original images, noise residuals, and discrete cosine transform (DCT) histograms (with an additional sub-network to compute DCT histograms), respectively. They fused the output of DCT-based CNN and noise-based CNN as feature vector, and then trained a random forest to improve the accuracy in the mixed case of aligned and misaligned compression. Different from this “hard” fusing strategy, Amerini

et al. [28]

fused deep features of two networks with different inputs,

i.e., original images and DCT histograms, using fully connected layer, and thus the two-stream network can be trained in an “end-to-end” manner. In our work, we propose a two-branch network. Unlike previous networks mentioned above, the input of our network is only the image under forensic examination, without any additional knowledge or a different representation of the image. In addition, our feature fusion locates at the middle of network, and there are several convolutional layers after feature fusion to further learn hierarchical and discriminative representation for detecting colorized images.

At last, to the best of our knowledge, there is no existing work that considers the “generalization” capability yet for CNN-based image forensics. In fact, this is a highly challenging scenario because no training samples of the “unknown” colorization algorithms are available. In other words, we want the trained network to be able to successfully detect colorized images generated by new colorization methods that remain unknown during the training of CNN. In this work, we solve this challenging generalization problem through a simple yet effective approach, i.e., inserting additional negative samples that are automatically constructed from available training samples, in order to carry out an enhanced training of CNN.

Iii Proposed Framework

Fig. 2: Architecture of our networks named respectively by BaseNet (architecture excluding the part within the red dotted rectangle, Section III-B) and DecNet (whole architecture, Section III-C). The network input is a RGB image, and output is the class scores. For each convolutional layer, k is the kernel size and n is the number of feature maps. The two-branch outputs of conv4 have same size and are directly concatenated as the input of conv5. “W/O Act” means “with or without activation”, and “FC(2)” stands for a fully-connected classifier layer with a 2-dimensional output of class scores.

Iii-a Motivation

Our study is inspired by Guo et al.’s work [2], where they first proposed histogram-based and Fisher-encoding-based fake colorized image detection methods, and obtained decent performance. In fact, these two handcrafted features to some extent are based on the prior knowledge observed from data, and thus may be the non-optimal discriminant features for this complex identification task. The classification accuracy shown in [2] can support this point as well. Furthermore, the generalization performance could be further improved as discussed in [2]. More specifically, the forensic performance of their methods sometimes drops when the training images and the testing images are produced by different colorization algorithms. Therefore, an “end-to-end” framework based on CNN could be a good solution to automatically learn informative and generic characteristics between natural and colorized images. In our approach, we consider two aspects: (1) designing a suitable CNN architecture to learn discriminative and enriched features for this forensic problem with good classification accuracy and generalization performance, and (2) constructing additional training data, i.e., the so-called negative samples, to obtain an appropriate decision boundary for this classification problem and thus further improve the generalization capability of our network.

Iii-B Our Network - Base Architecture

For this forensic problem, we first design and implement a base CNN. Except for the components enclosed by red dotted rectangle in Fig. 2, the remaining part is the proposed base network (called “BaseNet”) with a conventional single-branch structure. Our network consists of 8 convolutional layers and a fully-connected classifier layer (in total 9-layer deep). Inspired by the recent network designs of computer vision tasks [35, 36, 6]

, our network ends with a 2-way fully-connected layer instead of traditional stacked multi-layer perceptrons and thus has less parameters. The input of BaseNet is an RGB image. After the first layer (conv1), it is expected that much useful information is extracted from the original input image. The next three layers (conv2-4) are designed to analyze the extracted features of the first layer. Then a somehow high-level abstraction and reasoning is applied via the remaining layers (conv5-8). Finally, the 2-dimensional score vector of the class label is output by a fully-connected classifier [FC(2) with (2) standing for output dimension]. All convolutional kernel sizes in BaseNet are

. For conv1-7, each convolutional layer (Conv) is with the zero-padding of 1, which ensures that the input and output of Conv have the same size. The loss function of our network is cross entropy loss, which is most commonly used for classification tasks. Given the training dataset of images

, each associated with a label , where and (0: CI and 1: NI), the loss function can be described as:

(1)

where means the -th element of the class score vector .

In our network, each Conv is equipped with the batch normalization (BN) layer. BN 

[37]

explicitly forces the output of Conv to take on a unit Gaussian distribution. At the same time, a pair of parameters of

shift and scale

is applied to guarantee that the transformation can represent the identity transform. This layer increases the stability of the network training and reduces the potential overfitting due to its slight regularization effects. All max-pooling layers (Max)

[38] in BaseNet have the same kernel size of

and a stride of 2. Max-pooling reports the maximum output within a local window of feature maps, and essentially is a down-sampling operation. This operation brings two benefits: reducing the number of parameters within the model by decreasing the spatial size of processed feature maps; and making the representation approximately invariant to small translations 

[39]. Many multimedia security researchers argue that the extracted low-level features of the first layer are crucial for the success of their tasks [8, 32, 11]

. In this work, we pay attention to the activation function of the first layer, and we consider and compare three different design choices (reflected by the box of “W/O Act” in Fig. 

2

): without activation, with rectified linear unit (ReLU) activation 

[40], and with hyperbolic tangent (TanH) activation. The input-output relation of ReLU activation is , and that of TanH is .

Compared with state-of-the-art methods using handcrafted features [2], this BaseNet already obtains better classification accuracy and generalization performance (in most cases), and detailed results are given in Section IV-D.

Fig. 3:

The local region sizes in the original image space “seen” by a neuron at output of two branches, respectively. Note that, this neuron locates at the concatenation stage of feature maps of conv4. Here, four columns (from “conv1” to “conv4”) correspond to the first four layers of our network shown in Fig. 

2, and we explicitly illustrate this correspondence, for example, “conv1(k3n32)”. A blue square stands for a feature map, and the numbers close to it denote its size. A group of two yellow squares stands for convolutional operation, and a group of two red squares stands for max-pooling.

Iii-C Our Network - Enhanced Architecture

In order to enhance the learning capacity of the BaseNet, we improve its architecture and the corresponding inspiration is borrowed from ensemble learning. Ensemble learning combines multiple predictions of a set of individually trained classifiers, and then gives the final decision [41]. Empirically, more variety among the base classifiers makes the ensemble more powerful [42, 43]. In our work, loosely speaking, we try to apply this idea by slightly adjusting the base network’s architecture. Practically, we design a new branch which is different from the base network, and insert it in the middle of BaseNet to jointly analyze the extracted features of the first layer (conv1) from a multi-scale perspective. This enhanced network is denoted by DecNet (Detection colorization Network). The new branch is highlighted by red dotted rectangle in Fig. 2. The convolutional kernels in this new branch have the same size, i.e., , which is different from the kernel of the other branch. For the Convs with same position in two branches (conv2-4), the sizes of their outputs are same because the former uses kernel and the latter uses kernel with zero-padding of 1. In the meantime, the settings of Max in these two branches are also consistent. Hence, the analysis results of the two branches have the same size and can be directly concatenated as the input of the fifth layer (i.e., conv5) of DecNet.

Due to different architectures of two branches, i.e., different kernel sizes, a neuron in the output of these two branches (i.e., the output of conv4) corresponds to regions of different sizes in the input image space. This difference is analyzed and shown in Fig. 3. The region size “seen” by a neuron at the output of conv4 of the new branch is (the top row of Fig. 3), which is almost quarter of that of the base network (the corresponding region size is , as shown in the bottom of Fig. 3), because the new branch has smaller convolutional kernel size. Therefore, the difference between two branches in terms of local region size “seen” in the original image space can introduce some level of variety into the process of feature analysis (conv2-4). Then, we utilize several convolutional layers (conv5-8) to efficiently fuse the analysis results of two branches, intending to make good use of this potential variety. Consequently, this enhanced network further increases the classification accuracy and improves the generalization performance in the vast majority of cases. Quantitative results are reported in Section IV-B.

Iii-D Negative Sample Insertion

Fig. 4: The deep feature visualization with t-SNE [44]. The model is trained on the original dataset where CIs are generated by Mb. “C” means colorized images and “N” means natural images. “C-X” means the colorized images produced by X colorization method, for example, “C-Ma” corresponds to CIs generated by Ma colorization algorithm. “Y-pred” means that the predicted label of CNN is Y. The network is DecNet using TanH in the first layer. We randomly select 900 natural images from validation dataset splitting them into three equal subsets of 300 images, and then we construct corresponding colorized images using Ma, Mb, and Mc for every 300 images. The deep feature is the output of conv8, and the dimension is 512. In addition, (d) is the combination of (a), (b), and (c).

According to our observation, there is a certain degree of performance deviation in the challenging blind detection scenario, not only for traditional handcrafted-feature-based methods [2], but also for our CNN-based approach, although the latter has better performance. In details, for a traditional or CNN-based model trained on dataset constructed by one specific colorization algorithm, the test performance on datasets constructed by other colorization algorithms is sometimes rather limited for colorized images. The possible reason of this performance drop is that colorized images produced by a specific colorization algorithm tend to be equipped with a particular internal property, but CIs of different colorization algorithms are very likely to have different properties.

To clearly illustrate the encountered problem with an example, we train the DecNet on the dataset constructed by colorization method Mb [22], and test on the datasets constructed by Ma [21] and Mc [1], respectively. It should be noted that Ma and Mc are the “unknown” colorization algorithms, and thus the corresponding samples of Ma and Mc are not used in the training process. We use t-distributed stochastic neighbor embedding (t-SNE) [44] to project the high-dimensional deep features (the output of conv8 of DecNet, and its dimension is 512) of testing data constructed by above three colorization methods onto the two-dimensional map, and detailed visualization results are shown in Fig. 4. Comparing Fig. 4

(a), (b) and (c), we find that the distributions of NIs (red squares) are relatively stable with a rather high “intra-class” variation, which is somehow expected; in the meanwhile, CIs (blue symbols) are more tightly clustered for each colorization algorithm but their locations change a lot for different methods [please compare the CIs in (a), (b) and (c)]. This is reasonable because the different colorization methods tend to have not exactly the same internal characteristics and hence the corresponding CIs have different locations in the feature space. When the features of CIs produced by “unknown” colorization algorithms (here Ma and Mc whose samples are not used for training) are near the decision boundary of the CNN (which is trained by using NIs and CIs produced by a “known” colorization algorithm, here Mb), and at the same time the decision boundary is relatively close to colorized images, there are high probabilities to misclassify the “unknown” CIs. For instance, many CIs in Fig. 

4(b) (blue circles with red + in the figure) are wrongly predicted as NIs.

We would like to find a simple yet effective method to solve the encountered problem. The idea is that we make use of the available training samples (and only these samples) to construct an appropriate decision boundary which can lead to better generalization performance. A feasible and intuitive solution is to add negative samples (with same labels as CIs) near the initial decision boundary of the CNN, so as to make the CNN be more “strict” about the predictions of CIs and somehow push the classification boundary towards NIs. As such, it is expected that the “unknown” CIs located close to the initial decision boundary [e.g., those shown in Fig. 4(b)] have more chance to be correctly classified with the new classification boundary which would be closer to NIs. More precisely, we construct negative sample through linear interpolation between paired NI and CI which share the same grayscale version and only differ in chrominance components. The corresponding formulation is shown below:

(2)

where is the negative sample, is the natural image, is the corresponding colorized image, and is the interpolation factor. This actually makes sense, as negative samples are in fact forensically negative (i.e., considered as CIs), especially for our chosen weight values among (i.e., negative samples are closer to CIs than NIs). When increases, the negative samples are progressively getting closer to the natural images and it is expected that the decision boundary is further moving towards NIs after enhanced training.

As analyzed above, adding negative samples and conducting additional training will push the classification boundary towards NIs. Thus, the classification accuracy on the NIs will gradually decrease as more and more negative samples are inserted. The classification accuracy of network on validation dataset also slightly decreases because the CIs are almost all correctly classified and this accuracy mainly depends on the classification accuracy on the NIs. However, in the meanwhile the CIs constructed by “unknown” colorization algorithms are expected to be classified more correctly, implying a better generalization capability. Obviously, there is a trade-off between the classification accuracy (on data similar to the training samples) and generalization performance (mainly on “unknown” CIs) for our network. Therefore, without being able to directly measure the generalization during training of network, we consider the classification accuracy on NIs (on the so-called natural validation dataset ) as a measure to select the final model in the process of additional training with negative sample insertion. In our work, we design a threshold-based model selection criterion. This threshold () essentially determines the degree of final classification accuracy that can be accepted by user or current task. Generally speaking, larger means that the selected model has less high classification accuracy, but better generalization performance. Basically, we set , where is a user defined parameter and is the classification error rate (in , measured on natural validation dataset ) of the CNN model trained with the original training dataset before negative sample insertion. This criterion simply defines the maximum tolerable value of the relative increase of error rate on induced by enhanced training. In our experiments, we set . One exception is that when is very small (less than ), we set , meaning that we can slightly relax the constraint on classification error rate to obtain relatively large improvement of generalization performance.

Algorithm 1 illustrates the training process with negative sample insertion. It is worth noting that we only use CIs of a “known” colorization method but in a better way to construct a more appropriate decision boundary. In our experiments, this insertion is an iterative process with four iterations, i.e., the is increased from 0.1 to 0.4 with step of 0.1. Given a CNN model trained by using original dataset , and some basic settings for CNN training, such as initial learning rate and epochs for each insertion, we first compute on and then the threshold , which are used for final model selection. For each round of negative sample insertion, we construct negative samples and insert them into the dataset . Then, we update the parameters of model using new training dataset, and compute the error rate on starting from the second half of training process (i.e., from -th epoch for each insertion, where is the integer ceiling operator), because from that time the model becomes relatively stable. After each insertion, we test the negative samples produced by previous iteration. If a negative sample is misclassified, i.e., the predicted label is NI and not consistent with its ground-truth label, then we stop using the corresponding pair to construct negative sample (i.e., we remove corresponding pair from as described in line 9 of Algorithm 1). In fact, this operation can slightly reduce the amount of negative samples, and does not weaken the performance of our network. After four iterations of insertion, we select the final CNN model. It is worth mentioning that when , the negative samples will be close to NIs, and this is likely to have more impact on the classification of NIs. We take a conservative and experimentally effective approach, i.e., stopping the negative sample insertion process after four iterations.

Input: , , , , and the set of corresponding natural and colorized image pairs constructed from .
Output: final model after enhanced training.
Initialization: current learning rate , negative samples , set of error rates on of candidate CNN models .

1:  compute of .
2:  compute .
3:  for all  do
4:     construct negative samples from using Eq. (2) and insert them into .
5:     update training dataset: .
6:     update the parameters of for epochs. In the second half of training process, compute error rate on for each model, and insert this value at the end of .
7:     for all  do
8:         if  is misclassified then
9:            remove corresponding pair from .
10:         end if
11:     end for
12:     set .
13:     update current learning rate: .
14:  end for
15:  select -th model which satisfies .
Algorithm 1 Enhanced training of CNN model with negative sample insertion

The complete training process of proposed method includes two stages: (1) using the original training dataset to train the deep model from scratch until convergence; (2) iteratively adding new negative samples into the original training dataset and continuing to train the model as summarized in Algorithm 1. Fig. 5 shows the learning curves of a complete training process. In the first stage, the error rates on and CIs produced by Mb obviously decline in the first 20 epochs and the network reaches the stability after about 50 epochs, as shown in Fig. 5(a) and (b). With the negative sample insertion, the error rate on slightly increases, which can be found from the second part of Fig. 5(a). However, the generalization performance of network has a significant improvement on CIs produced by Ma [Fig. 5(c)] and a small improvement on Mc [Fig. 5(d)]. More numerical and visual results (including t-SNE visualization after enhanced training) are given in Section IV.

(a)
(b) Mb
(c) Ma
(d) Mc
Fig. 5: Learning curves of a complete training of DecNet (with TanH in the first layer). The network is trained on Mb [22], and tested on Ma [21] and Mc [1]. The error rates (in ) on CIs produced by these three methods are shown in (b), (c), and (d), respectively. The error rate on is shown in (a). Black dotted line separates two training stages, where the first part uses the original training dataset for 60 epochs and the second part is the enhanced training with negative sample insertion (15 epochs for each insertion and in total 60 epochs). The green circle in (a) stands for the final selected model.

Iv Experimental Results

Iv-a Implementation Details

Our networks are implemented with PyTorch 0.3.1 

[45]. The GPU version is GeForce® GTX 1080Ti of NVIDIA® corporation. All images in our experiments are resized to using bicubic interpolation, and for each image, we rescale its pixel values to

. Stochastic gradient descent (SGD) with a minibatch of 20 is used to train CNN models. Each minibatch contains 10 natural images and 10 colorized images. We randomly shuffle the order of training dataset after each epoch. For SGD optimizer, the momentum is 0.9 and the weight decay is 1e-4. The base learning rate is initialized to 1e-4. In our work, a complete training process includes two stages, respectively without and with negative sample insertion. In the first stage, we divide the learning rate by 10 every 20 epochs, and the training procedure stops after 60 epochs. In the second stage, the learning rate is continued to be divided by 10 every 15 epochs (it is enough to guarantee the convergence after new negative sample insertion), and the training procedure stops after 60 epochs,

i.e.

, 4 iterations of negative sample insertion. For BN, we keep a running estimate of computed mean and variance in the training stage, and this running mean and variance is used for normalization in the testing stage 

[37].

Following [2], we also employ the half total error rate (HTER) to evaluate the performance of the proposed method. The HTER is defined as the average of misclassification rates (in ) of NIs and CIs. In this work, all reported results of our method are the average of 7 runs.

Arc. No activation TanH ReLU
Ma Mb Mc Ma Mb Mc Ma Mb Mc
BaseNet 0.66 0.32 0.87 0.56 0.19 0.72 0.63 0.26 0.77
BaseNet+ 0.69 0.33 0.87 0.58 0.24 0.69 0.69 0.27 0.78
DecNet 0.60 0.29 0.69 0.55 0.16 0.55 0.62 0.20 0.61
TABLE I: The classification performance (HTER, in , lower is better) of different network architectures on three datasets constructed by Ma [21], Mb [22], and Mc [1], respectively. “BaseNet+” is an augmented version of “BaseNet” with more feature maps from the second to fourth layers (conv2-4). “Arc.” stands for “Architecture”.
(a) Train on Ma, test on Mb and Mc
(b) Train on Mb, test on Ma and Mc
(c) Train on Mc, test on Ma and Mb
Fig. 6: The generalization performance (in HTER, lower is better) of different architectures on three different settings. From left to right, the CIs of training datasets are generated by Ma [21], Mb [22], and Mc [1], respectively. “Avg HTER” means average HTER of testing on datasets constructed by other two colorization methods. For example, (a) means training on dataset constructed by Ma, and testing on dataset constructed by Mb and Mc.

Iv-B Validation of Network Architecture Design

Before evaluating the proposed method, we provide the details of datasets used in our experiments. Following [2], three state-of-the-art colorization algorithms, Ma [21], Mb [22], and Mc [1] are adopted for producing CIs. NIs come from ImageNet dataset [3]. We use 10,000 natural images from ImageNet validation dataset to construct training dataset and validation dataset, and the ratio is 4:1. The exact indexes of these images are reported in [21], and they are used for parameter selection and validation in [2]. Then, we remove the 899 grayscale images and 1 CMKY (cyan, magenta, yellow, and black) image from the remaining 40,000 images of ImageNet validation dataset (the total number of images in this dataset is 50,000), and obtain 39,100 natural images to construct testing dataset. Note that, the magnitude of testing dataset is far larger than the setting reported in [2]. We employ the three colorization methods mentioned above to produce the corresponding colorized images.

We first validate our network architecture design in terms of the classification performance and generalization capability of networks. As illustrated in Fig. 2, we first design a base network (BaseNet), which already has a good performance, and then improve our design by inserting a new branch into BaseNet, to obtain the final network (DecNet) that achieves further performance improvement. In order to verify that the performance improvement is not due to the increase of model parameters, we increase the number of feature maps from the second to fourth layers (conv2-4) of BaseNet (before: 64, 128, 256; after: 96, 192, 384) to obtain the augmented single-branch model BaseNet+. In total, the number of parameters of BaseNet, BaseNet+, and DecNet are 6.88M, 7.65M and 7.52M, respectively. Table I reports the classification performance (in HTER, a measure of misclassification rate, so lower is better) of three network architectures on three different datasets (all with training and testing on the same colorization method), and each network considers three activation choices. We can find that all the three networks have very low misclassification rates, and that DecNet outperforms BaseNet and BaseNet+ for all the nine settings. In addition, all the three networks with TanH in the first layer have the best classification performance, and those without activation have the highest HTER. Two possible reasons are: (1) The non-linearity of TanH and ReLU help increase the approximation/learning capability of networks; (2) Different from ReLU, TanH keeps the sign of features which may provide useful information for classification of NIs and CIs.

Arc. No activation TanH ReLU
Ma Mb Mc Ma Mb Mc Ma Mb Mc
DecNet 0.60 8.15 0.29 11.44 0.69 5.55 0.55 7.36 0.16 14.83 0.55 7.61 0.62 9.49 0.20 19.36 0.61 9.44
DecNet-i 1.11 5.24 0.96 2.51 1.10 2.29 1.03 4.61 0.85 3.01 0.98 2.30 1.14 6.23 0.92 3.71 1.02 3.47
TABLE II: Performance of negative sample insertion of different network architectures on different datasets. Each network considers three activation types in the first layer. “DecNet” stands for network trained on original training dataset, and “DecNet-i” stands for network after enhanced training of the previously trained model (DecNet), with negative sample insertion. Starting from the second column, each consecutive two columns form a group. The former is the classification error rate (HTER, tested on the same colorization method as given in the second row), and the latter is the average generalization performance (Avg HTER, in italics, tested on the other two colorization methods).

Fig. 6 shows the generalization performance of three network architectures when trained and tested on datasets constructed by different colorization algorithms. For each of the three test settings shown in Fig. 6, when looking at all the nine combinations between three network architectures and three activation choices, it is always DecNet, combined with a certain activation choice, that has the lowest HTER, and recall that in the meanwhile DecNet has always the best classification performance, regardless of the activation type, when tested on the same colorization method (as shown in Table I). If we check in Fig. 6 the performance separately for each activation choice, in general, DecNet has the best generalization performance except for two cases (out of nine) when combined with ReLU [(b) and (c)]; however ReLU is apparently not a suitable activation function in terms of generalization performance as shown in the figure. A possible explanation is that ReLU sets the output of some neurons of the first layer to be zero and thus destroys to some extent the extracted useful information. TanH and “no activation” have better generalization performance (with the latter slightly outperforming the former), implying that preservation of extracted information at first layer is helpful for generalizing better.

To summarize, our network DecNet can stably increase the classification accuracy regardless of activation in the first layer and improve the generalization performance in most cases (especially for TanH and “no activation”). This improvement is attributed to the new branch, which can enhance the discrimination and variety of learned features. DecNet achieves a better trade-off than BaseNet+; the latter weakly improves the generalization performance but slightly decreases the classification accuracy. This implies that the performance improvement of CNN model is more dependent on suitable network architecture design, instead of simply increasing the number of feature maps. Concerning the activation type, TanH is considered a good choice, achieving a very satisfying compromise between classification accuracy and generalization. In contrast, although the classification performance of networks with ReLU in the first layer is good enough (though slightly worse than networks with TanH in the first layer, see Table I), the generalization performance is the worst among three activation choices. This is probably due to the fact that ReLU destroys part of the initial information directly extracted from the input image, reflecting the importance of preserving richness of extracted features of the first layer. Networks without activation in the first layer have the lowest classification accuracy as shown in Table I, but they can well preserve the extracted features at first layer, which then contributes to the good generalization performance (see Fig. 6). Although the generalization capability of our network DecNet with different activation choices have small difference, this can be stably improved by our negative sample insertion method and the detailed results are given in the next subsection.

Iv-C Effect of Negative Sample Insertion

In this paper, we propose negative sample insertion to further improve the generalization performance of our network. As described in Section III-D, this enhanced training uses natural validation dataset to select the final model, and we randomly select 20,000 NIs from ImageNet test dataset [3] to construct . Table II reports the performance of our network before (the row of “DecNet”) and after (the row of “DecNet-i”) negative sample insertion111For the sake of clarity, generalization performance is presented in italics in Table II, and this is the same for subsequent tables.. From Table II, we can see that the effect of negative sample insertion, i.e., improving the generalization of network, is consistently stable for different activation choices. The negative sample insertion leads to slight decrease of the classification accuracy, however, the generalization performance of network usually has apparent improvement. For example, the initial generalization error of DecNet with ReLU trained on Mb is , and then reduces to after enhanced training using negative samples, with a slight increase of classification error from to . When the initial generalization error is relatively small, like of DecNet without any activation and trained on Mc, negative sample insertion still further decreases this value to , while the classification error changes from to . This is also consistent with previous analysis (Section III-D) that there is a trade-off between the classification and the generalization performance, and our negative sample insertion method can achieve a satisfying trade-off.

In addition, we also visualize deep features of DecNet-i with TanH using t-SNE [44], and the results are shown in Fig. 7. Here, deep features are the output of conv8 of DecNet-i, and its dimension is 512. The corresponding visualizations of the model before negative sample insertion are shown in Fig. 4. The testing data is also the same in Fig. 7 and Fig. 4. By comparing the border of correctly classified CIs, i.e., blue symbols with a blue + inside, in Fig. 4(d) and Fig. 7(d), we can find that the latter has fewer misclassified CIs, and the classification boundary is pushed towards NIs. The CIs generated by “unknown” colorization algorithms, especially Ma [21], are in consequence less misclassified, and this can be clearly observed by comparing Fig. 4(b) with Fig. 7(b). This confirms that our negative sample insertion scheme can push the decision boundary towards NIs to some extent and then improve the generalization performance.

Fig. 7: The deep feature visualization of DecNet-i with t-SNE [44]. The model is obtained through enhanced training of the previously trained model (used in Fig. 4) with negative sample insertion. The meaning of symbols is same as that of Fig. 4. It is worth noting that in t-SNE the transformation used for dimension reduction and the obtained visualization depend on the input data. Therefore, transformation and visualization in this figure are different from those of Fig. 4.

Iv-D Comparison with State-of-the-Art

We experimentally compare the performance of our method (all networks with TanH in the first layer) with that of the state-of-the-art methods [2]. We take the network with TanH as example for detailed comparison with [2], for the sake of brevity. But our method with other activation types has in general consistently good performance; in particular the performance of the final network DecNet-i with different activations is very similar as shown in the last row of Table II, and outperforms the methods in [2] as described below.

Method Ma Mb Mc
Ma Mb Mc Ma Mb Mc Ma Mb Mc
FCID-HIST [2] 22.50 28.00 33.95 26.95 24.45 41.85 38.15 43.55 22.35
FCID-FE [2] 22.30 23.65 31.70 25.10 22.85 34.25 38.50 36.15 17.30
BaseNet 0.56 10.57 10.62 31.65 0.19 6.16 13.93 1.91 0.72
DecNet 0.55 7.62 7.09 26.12 0.16 3.53 13.09 2.12 0.55
DecNet-i 1.03 5.09 4.13 4.41 0.85 1.60 2.83 1.77 0.98
TABLE III: Comparison of the performance (HTER, in , lower is better) of our method with that of the state-of-the-art methods [2] on ImageNet validation dataset [3]. “FCID-HIST” and “FCID-FE” are proposed in [2]. Note that, the results of [2] are obtained by testing on 2,000 images, and those of our method are from testing on 78,200 images. The generalization performance results are in italics.

We first compare classification accuracy and generalization performance on ImageNet validation dataset [3], and all testing results are shown in Table III. We can find that the classification accuracy (numbers not in italics in Table III) of our base architecture BaseNet is much improved compared with the two methods in [2] respectively denoted by FCID-HIST and FCID-FE, and the generalization performance (numbers in italics) is also better than that of two existing methods except for one case (, i.e., training on Mb and testing on Ma). For DecNet (the second last row), the trend is almost same. There is a significant improvement of generalization performance through the negative sample insertion compared with [2], and this can be observed by comparing the numbers in italics in same column of the rows of FCID-HIST, FCID-FE and DecNet-i of Table III. Furthermore, the results of [2] are obtained by testing on 2,000 images (1,000 ImageNet images and corresponding 1,000 CIs) whose exact indexes remain unknown, and those of our method are testing on 78,200 images (39,100 ImageNet images and corresponding 39,100 CIs). In order to confirm the reliability and rationality of comparison, as an example, Table IV reports the statistical results of models trained on Ma (these models are same as those in the first group, i.e., from the second to fourth columns, in Table III). Practically, we run tests for 500 times and each time on 2,000 images (1,000 pairs of NIs and CIs) randomly selected from 78,200 images. We can see that the performance of our method is stably superior than that of [2] (comparing the three groups of results in Table IV and the first group of the rows of FCID-HIST and FCID-FE in Table III).

Arc. Maximum Mean Minimum
Ma Mb Mc Ma Mb Mc Ma Mb Mc
BaseNet 0.98 12.19 12.21 0.55 10.56 10.61 0.15 9.02 8.50
DecNet 1.08 9.67 8.16 0.55 7.62 7.08 0.14 6.02 5.57
DecNet-i 1.64 6.81 5.16 1.03 5.06 4.14 0.54 3.75 2.83
TABLE IV: Multiple statistics of HTER of testing on 2,000 images (1,000 pairs of NIs and CIs) randomly selected from 78,200 images. The model is trained on Ma. We run 500 times and compute the maximum, mean, and minimum of HTER. The generalization performance results are in italics.

We then compare the performance of the cross-dataset test of our method and [2], and the corresponding results are reported in Table V. The same as in [2], we also consider two cases: train on ImageNet validation dataset [3] and test on Oxford building dataset [46] (called Oxbuild, which consists of 5,063 images); train on Oxbuild and test on ImageNet validation dataset. For each case, we consider three colorization methods: Ma [21], Mb [22], and Mc [1]. The results of [2] are obtained by testing on 2,000 images whose exact indexes remain unknown, and those of our method are from testing on 78,200 images for ImageNet validation dataset [3] and 10,126 images for Oxbuild [46]. It can be observed from Table V that the classification performance of cross-dataset test of our method is much better than that of two methods proposed in [2]. In addition, the statistical results of multiple testings of our method on 2,000 images (like previous experiment of Table IV) have the consistently good appearance as well (for the sake of brevity we do not present these results here). A possible reason of good performance of our method on cross-dataset test is that the CNN model can to some extent find the essential difference between NIs and CIs from data and decrease the potential interference of image content.

Dataset ImageNet Oxbuild Oxbuild ImageNet
Ma Mb Mc Ma Mb Mc
FCID-HIST [2] 22.85 21.50 30.95 43.45 30.75 36.60
FCID-FE [2] 51.40 22.70 20.20 49.80 30.25 23.15
DecNet 1.15 0.11 1.73 2.04 1.88 2.85
TABLE V: Comparison of HTER (in , lower is better) of our method with that of the state-of-the-art methods [2] (“FCID-HIST” and “FCID-FE”) on cross-dataset test. Note that, the results of [2] are obtained by testing on 2,000 images, and those of our method are from testing on 78,200 images for ImageNet validation dataset [3] and 10,126 images for Oxbuild [46]. “ImageNet Oxbuild” means training on ImageNet validation dataset and testing on Oxbuild, and vise versa.

Iv-E Qualitative Analysis and Misclassified Cases

In the following, we first conduct qualitative analysis through the visualization of the convolutional kernels and feature maps of well-trained models. Fig. 8 visualizes the convolutional kernels of the first layer of our network. For conventional computer vision classification tasks, the appearance of the first-layer filters of CNN models well trained on natural images is often common, in the sense that some of these filters are similar to Gabor filters and others resemble color blobs [4, 47, 48]. By contrast, the first-layer kernels of our method are almost all color sensitive and have no apparent orientation, which is reasonable for a network designed for detecting colorized images. In other words, the filters of the first layer of CNNs for computer vision tasks tend to take the image content as clues for image classification while our network rather considers the local color patterns as useful information for the classification of NIs and CIs.

The visualization of feature maps of CNN is often used to gain intuition of CNN models [47, 49]. Fig. 9 visualizes the first 64 feature maps of two branches at conv4. The left groups of three columns [(a), (c), and (e)] correspond to NIs and the right groups [(b), (d), and (f)] correspond to CIs. We can see that there is obvious difference between the feature maps of NIs and CIs for each branch, for example, Fig. 9(a) and (b). In addition, the difference also exists between two branches of each test image. These differences imply that our network learns discriminative and enriched features through two-branch architecture. We observe that some feature maps in the middle and right of Fig. 9(b) have strong response corresponding to the right region of CI in Fig. 9(b) (i.e., the border of the building, and highlighted by red rectangle), where the CI has slight color bleeding. The similar phenomenon can also be found in the boundary region between grass and dog in Fig. 9(f) (highlighted by two red rectangles), although here the color bleeding is to some extent masked by textures when compared with (b). For the middle and right of (e), i.e., the feature maps of two branches, high response in the grass is observed due to richness and naturalness of color, but low response in the counterparts of (f) is observed because the color is relatively monotonous in CI. Similarly, some feature maps of the new branch [the middle of Fig. 9(c)] have high response for the gate and grass of NI in (c), whereas low response is found for (d). To summarize, these observations give the hint that our network can grab some useful clues, such as color bleeding and monotonous color, for classification of NIs and CIs.

Fig. 8: Visualization of the convolutional kernels of the first layer of our network. The filters are organized in groups of three (in columns) corresponding to the three color channels R, G and B. Brighter pixels stand for larger values.
Fig. 9: Visualization of feature maps of conv4. We visualize the first 64 feature maps for each branch. The left groups of three columns correspond to NIs, and the right groups correspond to CIs. Each group consists of the RGB image (left), the feature map of the new branch (middle), and that of the base network (right). Hotter color stands for stronger activation. Red rectangle highlights the color bleeding region (best viewed with zooming on a big screen).

At last, several misclassified examples of our method are shown in Fig. 10. For (a), we can see that the color of first image is less saturated, and the color of the second image is sea blue for most of the area and somehow monotonous. Therefore, our network misclassifies these NIs as CIs. Conversely, the first image in (b) has relatively saturated color and clear boundary, and the second image has plausible and rich color, such as the grass and tank. These cues may have misled the classification decision of our network.

V Concluding Remarks

In this paper, we proposed an “end-to-end” framework based on the convolutional neural network to distinguish between natural images and colorized images. We first designed a base CNN model, which outperformed state-of-the-art methods (in most cases) in terms of both classification and generalization performance. Afterwards, we designed and added a new branch to the base network, leading to a CNN with enhanced architecture and enriched features. This well-designed network not only improves the classification accuracy but also the generalization performance. Furthermore, we considered the challenging blind detection scenario and proposed an effective method based on negative sample insertion to further improve the generalization capability of our CNN model. Consequently, our network’s generalization performance is obviously and stably improved while decreasing very slightly the classification accuracy. We plan to share the source code of our method to the research community.

Fig. 10: Misclassified cases. (a): NIs misclassified as CIs, and (b): CIs misclassified as NIs.

In the future, we are willing to apply our two-branch network architecture to other image forensic problems, where enriched features may help improve the forensic performance, and also to optimize the architecture in a task adaptive manner. For example, we could optimize in a rigorous way some meta parameters including the number of feature maps (width) and layers (depth) of each stage of the CNN for each task. We also plan to employ the proposed negative-sample-based enhanced training to improve the generalization performance of other kinds of forensic methods whenever applicable. Concerning the classification of natural and colorized images or other related problems, it would be interesting to further improve the CNN architecture via modeling high-level semantic information or imitating the human perception process of the given task. Furthermore, an attractive research direction is to explore other approaches to understanding and enhancing the generalization performance of neural networks.

Acknowledgment

We would like to thank Dr. G. Larsson, Dr. R. Zhang and Dr. S. Iizuka for kindly sharing the source code of their colorization algorithms respectively described in [21], [22] and [1], Dr. L. van der Maaten for making the t-SNE tool [44] publicly available, and Dr. Y. Guo for detailed and helpful discussions about their work [2].

References

  • [1] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification,” ACM Transactions on Graphics, vol. 35, no. 4, pp. 1–11, 2016.
  • [2] Y. Guo, X. Cao, W. Zhang, and R. Wang, “Fake colorized image detection,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 8, pp. 1932–1944, 2018.
  • [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “ImageNet: A large-scale hierarchical image database.” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
  • [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
  • [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
  • [6] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2261–2269.
  • [7] Y. Qian, J. Dong, W. Wang, and T. Tan, “Deep learning for steganalysis via convolutional neural networks,” in Proceedings of the IS&T/SPIE Electronic Imaging, vol. 9409, 2015, pp. 94 090J1–94 090J10.
  • [8] J. Chen, X. Kang, Y. Liu, and Z. J. Wang, “Median filtering forensics based on convolutional neural networks,” IEEE Signal Processing Letters, vol. 22, no. 11, pp. 1849–1853, 2015.
  • [9] J. Ye, J. Ni, and Y. Yi, “Deep learning hierarchical representations for image steganalysis,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 11, pp. 2545–2557, 2017.
  • [10] W. Quan, K. Wang, D.-M. Yan, and X. Zhang, “Distinguishing between natural and computer-generated images using convolutional neural networks,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2772–2787, 2018.
  • [11] B. Bayar and M. C. Stamm, “Constrained convolutional neural networks: A new approach towards general purpose image manipulation detection,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2691–2706, 2018.
  • [12] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using optimization,” ACM Transactions on Graphics, vol. 23, no. 3, pp. 689–694, 2004.
  • [13] Q. Luan, F. Wen, D. Cohen-Or, L. Liang, Y.-Q. Xu, and H.-Y. Shum, “Natural image colorization,” in Proceedings of the Eurographics Conference on Rendering Techniques, 2007, pp. 309–320.
  • [14] K. Xu, Y. Li, T. Ju, S.-M. Hu, and T.-Q. Liu, “Efficient affinity-based edit propagation using K-D tree,” ACM Transactions on Graphics, vol. 28, no. 5, pp. 118:1–118:6, 2009.
  • [15] X. Chen, D. Zou, Q. Zhao, and P. Tan, “Manifold preserving edit propagation,” ACM Transactions on Graphics, vol. 31, no. 6, pp. 132:1–132:7, 2012.
  • [16] J. Pang, O. C. Au, K. Tang, and Y. Guo, “Image colorization using sparse representation,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 1578–1582.
  • [17] T. Welsh, M. Ashikhmin, and K. Mueller, “Transferring color to greyscale images,” ACM Transactions on Graphics, vol. 21, no. 3, pp. 277–280, 2002.
  • [18] R. Irony, D. Cohen-Or, and D. Lischinski, “Colorization by example,” in Proceedings of the Eurographics Conference on Rendering Techniques, 2005, pp. 201–210.
  • [19] R. K. Gupta, A. Y.-S. Chia, D. Rajan, E. S. Ng, and Z. Huang, “Image colorization using similar images,” in Proceedings of the ACM International Conference on Multimedia, 2012, pp. 369–378.
  • [20] Z. Cheng, Q. Yang, and B. Sheng, “Deep colorization,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 415–423.
  • [21] G. Larsson, M. Maire, and G. Shakhnarovich, “Learning representations for automatic colorization,” in Proceedings of the European Conference on Computer Vision, 2016, pp. 577–593.
  • [22] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in Proceedings of the European Conference on Computer Vision, 2016, pp. 649–666.
  • [23] Y. Yan, W. Ren, Y. Guo, R. Wang, and X. Cao, “Image deblurring via extreme channels prior,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6978–6986.
  • [24] K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 12, pp. 2341–2353, 2011.
  • [25] A. Tuama, F. Comby, and M. Chaumont, “Camera model identification with the use of deep convolutional neural networks,” in Proceedings of the IEEE International Workshop on Information Forensics and Security, 2016, pp. 1–6.
  • [26] L. Bondi, L. Baroffio, D. Güera, P. Bestagini, E. J. Delp, and S. Tubaro, “First steps toward camera model identification with convolutional neural networks,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 259–263, 2017.
  • [27] M. Barni, L. Bondi, N. Bonettini, P. Bestagini, A. Costanzo, M. Maggini, B. Tondi, and S. Tubaro, “Aligned and non-aligned double JPEG detection using convolutional neural networks,” Journal of Visual Communication and Image Representation, vol. 49, pp. 153–163, 2017.
  • [28] I. Amerini, T. Uricchio, L. Ballan, and R. Caldelli, “Localization of JPEG double compression through multi-domain convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 1865–1871.
  • [29] N. Rahmouni, V. Nozick, J. Yamagishi, and I. Echizen, “Distinguishing computer graphics from natural images using convolution neural networks,” in Proceedings of the IEEE International Workshop on Information Forensics and Security, 2017, pp. 1–6.
  • [30] J. Bunk, J. H. Bappy, T. M. Mohammed, L. Nataraj, A. Flenner, B. S. Manjunath, S. Chandrasekaran, A. K. Roy-Chowdhury, and L. Peterson, “Detection and localization of image forgeries using resampling features and deep learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 1881–1889.
  • [31] J. H. Bappy, A. K. Roy-Chowdhury, J. Bunk, L. Nataraj, and B. S. Manjunath, “Exploiting spatial structure for localizing manipulated image regions,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4980–4989.
  • [32] G. Xu, H. Z. Wu, and Y.-Q. Shi, “Structural design of convolutional neural networks for steganalysis,” IEEE Signal Processing Letters, vol. 23, no. 5, pp. 708–712, 2016.
  • [33] L. Pibre, J. Pasquet, D. Ienco, and M. Chaumont, “Deep learning is a good steganalysis tool when embedding key is reused for different images, even if there is a cover source-mismatch,” in Proceedings of the IS&T/SPIE Electronic Imaging, 2016, pp. 1–10.
  • [34] M. Chen, V. Sedighi, M. Boroumand, and J. Fridrich, “JPEG-phase-aware convolutional neural network for steganalysis of JPEG images,” in Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, 2017, pp. 75–84.
  • [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [36] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5987–5995.
  • [37] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in

    Proceedings of the International Conference on Machine Learning

    , 2015, pp. 448–456.
  • [38] Y. T. Zhou and R. Chellappa, “Computation of optical flow using a neural network,” in Proceedings of the IEEE International Conference on Neural Networks, 1988, pp. 71–78.
  • [39] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.   MIT Press, 2016, http://www.deeplearningbook.org.
  • [40]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in

    Proceedings of the International Conference on Machine Learning, 2010, pp. 807–814.
  • [41] D. Opitz and R. Maclin, “Popular ensemble methods: An empirical study,”

    Journal of Artificial Intelligence Research

    , vol. 11, no. 1, pp. 169–198, 1999.
  • [42] T. G. Dietterich, “Ensemble methods in machine learning,” in Proceedings of the International Workshop on Multiple Classifier Systems, 2000, pp. 1–15.
  • [43] L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review, vol. 33, no. 1, pp. 1–39, 2010.
  • [44]

    L. van der Maaten and G. E. Hinton, “Visualizing high-dimensional data using t-SNE,”

    Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
  • [45] (visited on 2019-01-15). [Online]. Available: https://pytorch.org/
  • [46] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.
  • [47] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 818–833.
  • [48] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Proceedings of the Advances in Neural Information Processing Systems, 2014, pp. 3320–3328.
  • [49] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, “Understanding neural networks through deep visualization,” in Proceedings of the International Conference on Machine Learning Workshop on Deep Learning, 2015, pp. 1–12.