I Introduction
Digital watermarking was originally introduced in 1979 for anticounterfeit purposes [1] to distinguish between the original and counterfeit documents. Since then it has been applied for identification of image ownership and protection of intellectual property by hiding data such as logos and proprietary information in images, videos and audios [2]. Another application is the patient identification and medical procedure matching by hiding patients’ personal information in their medical images [3]. Other applications have been proposed for watermarking such as broadcast monitoring [4], copy control [5], device control [6] and legacy enhancement [7]. The most wellknown challenge in waterming is that watermarked image which contains hidden data is vulnerable to image processing algorithms for enhancement, transformations like image compression and format conversion and undesired artifacts such as transmission noises. Furthermore, watermarked images are prone to intentional attacks which strive to change or corrupt the hidden watermark data.
Despite the extensive amount of research to battle these problems, robustness, and imperceptibility are still the two key challenges in watermarking algorithms. In another word, one major concern in modern watermarking is to preserve the hidden data as safe as possible in the presence of attacks (robustness), while introducing subtle and undetectable changes during the watermarking process, so that the watermarked image would be indistinguishable from the original image (imperceptibility). Another problem which has attracted a great deal of research over the last two decades is the blindness of watermarking algorithms, which increases their complexity and may negatively affect their robustness and imperceptibility. However, the blind watermarking methods [8, 9] are practically preferable to informed/nonblind methods [10, 11], since the informed methods require various sideinformation about watermark or cover image or embedding parameters for extraction.
Nowadays the application of machine learning tools in watermarking is growing very rapidly, because of their effective solutions to embedding and extraction processes [12, 13, 14, 15, 16, 17]
. Nevertheless, most of them generally utilize machine learning tools such as Support Vector Machine (SVM)
[18], Support Vector Regressio (SVR) [19], Radial Basic Function Neural Network (RBFNN) [20], and K Nearest Neighbor (KNN)
[21] for specific parts of watermarking procedure such as parameter optimization [12, 13], prediction of transform domain coefficients [14, 15, 16]and attack estimation
[17]. Among all the machine learning tools, deep networks and Convolutional Neural Nets (CNN), have gained the most widespread attention in a large variety of computer vision applications such as pattern recognition
[22], image classification [23] and object detection [24]. Very recently a few works have emerged about the application of deep networks in watermarking [25, 26, 27]. Kandi et al.[25] proposed CNN based autoencoder structures to hide watermark data in their feature maps. However, their proposed watermarking method is nonblind, and a predefined embedding algorithm is applied for embedding in autoencoder feature maps. In [26], an endtoend watermarking network is introduced. However, the watermark data is embedded in single blocks of the image, which leads to a uniform local embedding similar to traditional methods.Parallel to Li et al.[27], who have presented a unified system for watermarking and steganography based on CNNs and Generative Adversarial Networks [28], in this paper we introduce an endtoend blind watermarking framework (ReDMark) using Fully Convolutional Neural Networks (FCN), which surpass stateoftheart (including [27]) in terms of robustness and imperceptibility. The proposed system consists of two FCNs for embedding and extraction, along with a differentiable attack layer which simulates wellknown attacks. Introducing differentiable attack layer as part of the network makes an endtoend training scheme feasible. It also leads to robust watermarking due to training the network in the presence of attacks. To this end, we suggest a differentiable approximate model for JPEG attack with the adjustable quality factor.
Dominant watermarking approaches use fixed methods such as swapping coefficients in a transform domain. Nevertheless, ReDMark is capable of learning several embedding patterns/masks in different transform domains and in presence of various attacks. Consequently, the network explores suitable solutions customized for the suggested transform domain and required attacks. On the other hand, only the constructed network can embed and extract the watermark data based on the discovered watermarking patterns. Hence, the proposed system introduces a secure method for hiding the watermark data so that the recognision or replacement of the secret data along the communication channel is not easy.
Another important characteristic of the suggested system, which leads to improved security and robustness, is its capability to diffuse watermark data among a relatively wide area of the image. In other words, the network explores diffusion watermarking masks to share watermarking data among several image blocks, rather than simply swapping in a single block. Thus, the watermarked image demonstrates impressive robustness against several heavy attacks. Even if a meaningful part of image is corrupted or removed, the extraction network is still able to extract the hidden watermark.
The last elegant feature of ReDMark is a strengthfactor for controlling the strength of the watermark patterns within the image, thanks to the novel structure of the embedding network which is inspired by ResNet [29]. This valuable feature enables us to control the tradeoff between the robustness and imperceptibility depending on situation and application requirements.
To make a long story short, our major contributions in the introduced deep watermarking network include: 1. Proposing a residual watermarking framework with specialized robustness against several specific attacks. 2. Introducing a strength factor tuner for controlling the tradeoff among robustness and imperceptibility. 3. Introducing a new differentiable approximation of JPEG attack with any quality factor. 4. Proposing a novel diffusion watermarking framework built on circular convolutional layers which leads to exceptional robustness against various attacks.
The remainder of the paper is arranged as follows: In section II we review related works in the literature. Technical details of the proposed framework are discussed in section III. The experimental results are presented in section IV. Finally, we conclude the paper with a short discussion in section V.
Ii Related Work
Since the advent of digital watermarking, an enormous amount of research has been invested in developing new watermarking schemes with improved capacity, robustness, imperceptibility, fidelity and security. Early methods embedded the watermark data in the spatial domain by directly manipulating image pixels to represent watermark data. Embedding watermark bits in LBS of image pixels is an example of such methods [30]. For improving robustness, the mainstream in literature is to embed the watermark data in a transform domain by manipulating specific transform coefficients. Some popular transform domains suggested for watermarking are DCT [31], Wavelet [32], Hadamard [33], Contourlet [34] or a mixture of transforms [35]. Sadreazami et al.[36]
proposed a multiplicative embedding method in Contourlet domain, which requires statistical analysis for data extraction. They simulate the behavior of watermarked image as normalized inverse Gaussian distribution and propose a maximum likelihood detector for data extraction. Makbol
et al.[37]utilize integer Wavelet transform and singular value decomposition for watermarking. Another research
[38] suggested a reversible transform based on an overcomplete dictionary for watermarking. The authors in [39] introduced quaternion Hadamard transform domain for watermarking in color images by using Schur decomposition. Liu et al.[40] proposed a transform domain called fractional Krawtchouk transform for watermarking. They use Dither modulation method [41] for embedding in this domain.Following increased applications of machine learning tools for various tasks, some researchers started to apply these tools for different parts of embedding and extraction in the watermarking process. For example, Heidari et al.[17] presented a framework for blind image watermarking by the redundant embedding of watermark data in multiple zones of the DCT spectrum. For extracting watermark data from the attacked image, they apply SVM to recognize the least distorted zone of the spectrum. In [12], KNN regression method is utilized for estimating the optimum value of embedding strength parameter in DCT domain for improving robustness and imperceptibility. ZhiMing et al.[13], suggest to use RBFNN for optimization of embedding strength in blocks of the cover image. The embedding strength for each block is determined separately based on the block features obtained by RBFNN. Authors in [14] use Wavelet domain for watermarking. They predict some coefficients by SVR and use an embedding rule based on the predicted values versus the real ones. Likewise, in [15], Lagrangian support vector regression is utilized for prediction of coefficients and embedding process in lifting Wavelet transform space [42]
. Furthermore, a scaling factor is assigned for embedding strength in each block which is estimated by genetic algorithm. A similar embedding method is utilized by
[16], which propose Extreme Learning Machine (ELM) for prediction in the DWT domain. In spite of vast research proposals for the application of machine learning tools in watermarking, none of the abovementioned methods propose a unified watermarking framework based on machine learning approaches. In other words, all of them apply machine learning tools on specific parts of the watermarking process.Among all machine learning tools, CNNs have gained extreme popularity in the last decade. However, their assistive application for the watermarking process is more recent. For example, Kandi et al.[25]
applied two autoencoder CNN structures for feature extraction to be separately used for positive and negative embedding. The same autoencoder networks are used at the receiver side to obtain the feature maps and extract the watermark data. In
[26], two CNN networks are trained for embedding and extraction in the presence of attacks. However, the watermarking networks are designed for hiding a onebit watermark in a single block. The most relevant research to ours is the endtoend trainable framework, HiDDeN [27] which is proposed for data hiding in color images based on CNNs and GANs and may be applied to watermarking and steganography. A noise layer is proposed in the network for simulating attacks during the endtoend training phase.Iii Proposed Watermarking Framework
In this paper, we propose an adaptive diffusion watermarking framework (ReDMark) composed of two Fully Convolutional Networks with residual connections. The proposed networks are trained endtoend to conduct a blind secure watermarking for grayscale images in the desired transform space. The framework is customizable for the level of robustness vs. imperceptibility. It is also adjustable for the tradeoff between capacity and robustness. The adaptive and flexible nature of the framework makes it easy to choose any linear transform domain for embedding the secret watermark or to train the watermarking network for higher resistance to specific attacks. Employing a differentiable attack module as part of the network facilitates endtoend training and governs robust watermarking against various attacks. We elaborate on the technical details of the system modules and their functionalities in section
IIIA and discuss the endtoend training strategies in section IIIB.Iiia Network Structure
Fig. 1 illustrates a block diagram of the system composed of three main modules: CNN for embedding the watermark, differentiable attack layer for simulating popular attacks, and CNN for extraction of the hidden watermark.
IiiA1 Embedding Module
As shown in Fig. 2, the embedding network structure is composed of two transform layers, and five convolutional layers. For a grayscale cover image and binary watermark data, the embedding network embeds bits of watermark in a bigger cover image with pixels (). The embedding layers, compute the watermarking mask/pattern within the transform domain. Then the residual mask is calculated in the spatial domain by the inverse transform to be added to the original image with a Strength Factor () weight. Further technical details of the pipeline are discussed as follows.
Space to Depth (Reshape)
Let’s assume the cover image can be properly divided into blocks of size and each block will be hosting at least one watermark bit. Without loss of generality, we assume that the cover image has
blocks required for the watermark bit length. Hence, we reshape the cover image into a tensor of size
. Each column of the generated tensor is the vectorized form of image blocks.Transform Layer
This layer implements a reversible linear transform to change the representation basis of the image from the spatial domain to a new space such as a frequency domain. Although watermarking in spatial domain provides higher capacity with lower complexity, it is shown that watermarking in a transform domain is more secure and robust against intentional or random attacks and image processing techniques [43]. To perform the watermarking process in a new domain, we need two transform and inverse transform layers for input and output interface of the embedding network. The suggested transform layers can be fixed to any standard transform such as DCT, wavelet or Hadamard. However, it is also possible to preassign an arbitrary transformation to the layer and let the network finetune the transform layer throughout the training process within a specific training strategy. Considering the proposed rearrangement of the cover image in the previous paragraph, the transformation layer is simplified into a convolution layer, as we apply the transform on each block independently. Depending on the new space dimension, we need convolution masks in accordance to transformation basis. Hence, every block is reshaped to tensor and convolved with transformation masks. Output of each of masks is calculated by:
(1) 
where is the block tensor reshaped as a column vector, is the vectorized
convolution mask representing the weights of one neuron and
is the output of filter mask. represents the transform matrix, where its rows contain the corresponding neuron weights and demonstrates the transformed feature space, i.e., outputs of all neurons. There are no bias values for neurons.There is no obligation for using a fixed known transform in the transformation layers, i.e., any linear transform may be applied. To elaborate, the transform layers of the proposed framework may be initialized to any desired transform or even random values and released for training during endtoend training of the whole network. We only need to constrain the two input and output transform layers of the two embedding and extraction networks to be equivalent inverse linear transforms. In this way we expect the network to seek new transform domains for watermarking. However, in this work, we fix the transformation layers on DCT transform for proof of concept. In the following two lemmas, we demonstrate special cases of Equation (IIIA1), for DCT and Hadamard transforms.
Lemma 1
For DCT transform the matrix is obtained by the following equation:
(2) 
Lemma 2
For Hadamard transform, the matrix is obtained as below:
(3) 
Embedding layers
The output of the transformation layer is concatenated with the watermark image, shaping the input tensor of size for the embedding network. This network is composed of circular and normal convolutional layers, which are responsible for embedding the watermark patterns into the transformed image blocks. As shown in Fig. 2, some layers of the embedding network perform circular convolution, which lead to expanding the receptive field of neurons in the final layers. Hence, this innovative structure empowers diffusion watermarking, so that the watermark data is shared and distributed among adjacent blocks. Furthermore, the embedded watermark added to each block of the cover image is a superposition of symbols in the wide receptive field, i.e., the own block and its neighbors. This brilliant property improves security and robustness of the proposed framework against several attacks.
Skip Connection and Strength Factor
As displayed in Fig. 2, the watermarked image is produced by summing the output of the embedding network with the original cover image. This structure guides the network to produce the residual watermark data. This helps the embedding network to learn the additive watermarking symbols more efficiently, i.e., the network weights converge faster during the training stage. On the other hand, the proposed network structure empowers the framework to incorporate a Strength Factor to adjust the strength of generated symbols before summation with the cover image. Addition of this elegant tuning volume to a trained network enables the system to amplify or attenuate the generated symbol in the watermarked image and control the level of robustness vs. imperceptibility (PSNR/SSIM) based on our requirements. During the network training, strength factor is fixed to one.
IiiA2 Attack Layer
In this part, we elaborate on the structure of the attack layer as shown in Fig. 1. The proposed framework simulates various attacks as a differentiable network layer to facilitate endtoend training. Furthermore, keeping the attacks in the training loop guides the network to learn more robust watermarking patterns to resist realworld attacks in communication channels. A network trained in the presence of an attack will produce watermarked images that are more robust to that specific attack.
Interestingly, learning robust watermarking for specific attacks may lead to robustness against some other attacks, due to similarities in their natures. In this work, we train separate networks with wellknown attacks and analyze their robustness against many other attacks. We also train a network with multiple simultaneous attacks which is shown to resist a wider range of attacks simultaneously. Various attacks simulated for network training in this work are briefly explained below, and details of the training are discussed in section IV.
Noise Attack
This is a random white noise added to the watermarked image in every iteration of training. Hence, the backpropagated loss signal passes through the additive noise which is fixed for every iteration. We may apply various noise types such as uniform noise and Gaussian noise. Likewise, salt & pepper noise follows the same logic, where either of values 0 or 255 are assigned to random image pixels with a specific probability.
Random cropping attack
This improves watermarking performance in the presence of cropping attack since the network learns to redundantly embed watermark data in different regions. Random cropping is implemented by suppressing or turning off neurons in random block regions. The process is similar to dropout layers introduced in [44].
Smoothing Attack
We use a normalized unit mask (allones matrix) for smoothing windows, as shown in Equation (4).
(4) 
Thus, the attack layer, in this case, is a simple convolutional neuron with constant weights. We may use any convolution mask for various FIR (Finite Impulse Response) filters, to simulate any band filtering attack. For example, the layer mask can be set to Gaussian filter or a sharpening filter.
JPEG Attack
All the attacks discussed so far, are inherently differentiable and can be directly implemented into the network layer. However, JPEG coding involves some nondifferentiable operations. Thus, we need a differentiable approximation of the process to simulate the JPEG attack in a network layer. The JPEG compression includes the following steps: transfering image blocks to DCT domain, dividing by a quantization matrix determined by a quality factor and rounding the results to integer values. Similarly, JPEG decoding involves inverse operations as follows: Multiplying the coefficients by the same quantization matrix, then transforming to the spatial domain. A complete simulation of JPEG attack, as shown in Fig. 3, consists of all the stages mentioned above: DCT transform, division by quantization matrix, rounding, multiplication by quantization matrix, inverse DCT transform (see Transform Layer in section IIIA1. Among all the mentioned steps, rounding operation is nondifferentiable and needs to be approximated with a differentiable operation, to facilitate back propagation of the training gradients. We simulate rounding operation by a uniform noise in the range [0.5, o.5]. Mathematical rounding is practically a subgroup of the suggested approximation, in other words, we are simulating a larger family of distortions than normal JPEG. Equation (5) demonstrates the proposed rounding simulation method:
(5) 
where is the watermarked image in the DCT domain, is its approximated quantized version, is the quantization matrix of a required quality factor and is the uniform noise [0.5, o.5]. The equation implies that the rounding effect appears more strongly for larger elements of the quantization matrix, which are towards higher frequencies. This guides the network to gradually reduce the embedding strength in higher frequencies of the transform domain.
Mixture Of Attacks
Multiple attacks may be combined in the attack layer to train a robust watermarking network for the mixture of chosen attacks. The training procedure, in this case, is slightly different, as in each iteration of training the network randomly selects one of the attacks with a given probability. Hence, the back propagated gradients are passed through the selected attack layer. The switching mechanism of the multiattack layer is illustrated in Fig. 4. As shown in the figure, we model the random selection of attacks as a roulette wheel which assigns a probability to each attack. Thus, in each of the training iterations, only one type of attack is allowed to pass through the multiplexer to affect the training loss.
IiiA3 Extraction Module
As illustrated in Fig. 5, the extraction module is structurally simpler than the embedding network. This module is supposed to extract watermark data from the input image. Since the watermark data is embedded in the transform domain, the extraction module incorporates a copy of the transform layer utilized for the embedding module, to represent the watermarked image in the same basis. Other network layers learn to extract the watermark data in the transform domain.
IiiB Network Training and Evaluation Metrics
As can be seen in Fig. 1, the trainable network modules (embedding and extraction units) are trained together within an endtoend setup including the trainable and nontrainable network layers. The main training objectives are to establish embedding and extraction networks which can generate safe, highquality watermark images and robustly recover the hidden data from the watermarked images. In other words, each network has an independent objective function. The embedding network is supposed to generate a watermarked image with maximum quality and minimum distortion compared to the original image. On the other end, the extraction network is responsible for maximizing the extraction rate of the hidden watermark or equivalently minimizing Bit Error Rate (BER), as defined by Equation (6).
(6) 
where is the original binary watermark and is the extracted watermark. represents the watermark string length. This bits of watermark may be embedded redundantly in the cover image. Further disscusion about redundant embedding is in section IVB
. We utilize two metrics for evaluating the image quality in training and test stages. The Structural Similarity Index (SSIM) is a perceptual metric employed as a training loss function to quantify degradation of image quality caused by the watermarking process and transmission. SSIM estimates the structural variation of the two images through the Equation (
7):(7) 
where is the cover image, is the watermarked image, and are the mean values of and respectively, and
represent their variances and
is the covariance of . In this equation and are two constants of the metric which are set to and for our experiments.For comparing the watermarking quality of the ReDMark (imperceptibility of the produced watermarked images) against stateoftheart competitors, we exploit the wellknown PSNR metric (Peak SignaltoNoise Ratio), as shown by Equation (8):
(8) 
where and are the image dimensions, and stands for the maximum value of image pixels (for grayscale images, ).
Based on the above discussion, for the endtoend training of the multiobjective network, we employ a weighted combination of the embedding and extraction loss functions, as shown in Equation (9).
(9) 
where is the ratio of losses and are the loss functions of the embedding and extraction networks. simulates imperceptibility or quality of the watermark image, while represents the watermark extraction rate and robustness. We use SSIM metric for and binary cross entropy for :
(10) 
where values are the outputs of the extraction network representing the probability of watermark bits. Watermark data is generated by thresholding these values.
Since there is a tradeoff between the two lossfunctions (imperceptibility vs. robustness), the process of network training is a multiobjective optimization. The flow of gradients from the last layers of the extraction network to the first layers of the embedding network implies that only gradients backpropagate through the extraction network. However, for the embedding layers in the head of the pipeline, gradients of the combinatorial loss backpropagate to train the embedding network weights.
Iv Experimental Results
The proposed framework is implemented by Tensorflow [45] and executed on NVIDIA GeForce^{®} GTX 1080 Ti. We use CIFAR10 [46] and Pascal VOC [47] datasets for training the network. The dataset of 49 standard test images from the University of Granada [48] is used for most of the numerical analysis and comparative evaluations. We demonstrate the superiority of ReDMark by comparing with two stateoftheart watermarking systems [27], [38]. For a fair comparison against the concurrent work of Zhu et al.[27], we evaluate our system on COCO dataset [49].
To demonstrate the capabilities of our framework, we train three different networks with various attacks: 1) GaussianTrainedNetwork (GTNet) is trained under Gaussian noise attack (=3), 2) JPEGTrainedNetwork (JTNet) is trained under JPEG attack (quality=70), 3) MultiAttackTrainedNetwork (MTNet) is trained in the presence of multiple attacks with equal probabilities, including salt & pepper (4%), Gaussian noise (=3), JPEG (quality=70), and mean smoothing filter (33).
Implementation details and the network configuration are discussed in IVA. Training strategies and working with the trained networks are discussed in IVB. We analyze the imperceptibility and robustness of the trained networks under several attacks in IVC and IVD, then compare the results to the stateoftheart watermarking algorithms in IVE. Finally, some watermarking patterns of the trained networks and their technical characteristics are discussed in IVF. The source codes of the framework will be uploaded on Github shortly.
Iva Network Configurations
In our experiments, we set the block size to and use training image patches of size . So we have 16 blocks per input image which is used for embedding a watermark pattern. We use the DCT transform domain in our experiments, however any other linear transform may be applied. The embedding network (as shown in Fig. 2) consists of two interfacetransformlayers implemented by convolutional masks to perform the changeofbasis DCT operation and inverse DCT transform. The embedding network between the transform layers is composed of one convolution and four circular convolutional layers with Exponential Linear Unit (ELU) activation [50]. All the layers of embedding network contain 64 convolution filters. The basic structure of the extraction network is very similar to the embedding module. As demonstrated in Fig. 2, the first layer is a transform layer composed of 64 DCT filter masks ( convolutions). Then we have another convolution layer and three
convolution layers with 64 filters per layer and ELU activation functions. A final
convolutional layer consisting of one neuron with sigmoid activation generates the watermark probability map. A threshold is applied to this output to produce the final watermark. Strides of all filters in embedding and extraction modules are set to one. Width and height of images throughout the network are constant, as a result of using circular convolution.
IvB Experimental Setup
For training process, CIFAR10 [46] and Pascal VOC2012 [47] datasets are combined to shape our training set of cover images. CIFAR10 consists of 60000 tiny RGB images (), which is divided to training/test sets of size 50000/10000 images. We combine the 50000 CIFAR10 training set images with Pascal dataset and convert them to grayscale to be used as cover images for training. Since the Pascal dataset contains large, highresolution images, we extract
patches from random positions of dataset images. We want our dataset to contain a range of smooth to highfrequency patterns for better training. Hence, among the set of generated patches we select a subset which equally contains all intensity variances. The final training set is a combination of two datasets containing around 334K grayscale image patches. In the training phase, we assign random watermarks to image patches in every iteration. In this way, we avoid biasing the network with a specific watermark pattern. We utilize the stochastic gradient descent algorithm for optimization and training. Some training configuration and parameters are shown in Table
I. The training time for single attack networks is about 9 hours with 1000,000 iterations. For multiattack training, the number of iterations is doubled and consequently, its training is twice slower than the training of the single attack network.width=
Parameter  Value  description 

(M,N)  (8,8)  Size of network blocks 
(W,H)  (32,32)  Size of training image patches 
(w,h)  (4,4)  Size of watermark 
Iteration No.  1000000  Training iteration number 
LR  Learning rate  
mo  0.98  Momentum of optimization 
0.75 for GTNET and JTNET,  Relative loss function weights  
0.5 for MTNET 
For evaluating the trained networks, we embed 1024 bit () watermarks in grayscale images. The watermark is embedded with four times redundancy, as the cover image contains image blocks. For this purpose, a bit redundant plane is formed so that each bit of watermark is repeated four times in a regular pattern in this plane. We refer to it as watermark plane. Then the cover image is partitioned into subimages, and the watermark plane is partitioned into subwatermarks for feeding to the embedding network. The embedding network produces watermarked subimages which are tiled with each other to form the watermarked image. The watermarked image is then passed through several attacks to simulate the real world situation. For the extraction phase, we follow the same protocol. First, the attacked watermarked image is partitioned into subimages. These subimages are fed into the extraction network, which extracts patches of the watermark plane. Then these patches are tiled to form a redundant watermark plane. Finally, a voting procedure is applied on the corresponding bits to produce the 1024bit watermark data.
IvC Quantitative Results
To analyze the trained networks (GTNet, JTNet, and MTNet), we test them on all 49 images in the Granada dataset [48]. The imperceptibility of each watermarked image is presented in terms of PSNR and SSIM. Tables III and III demonstrate the PSNR and SSIM of the watermarked images produced by all the trained networks with three different strength factors (). PSNR and SSIM of a single image are calculated by averaging 20 watermarked images with different random watermarks. We follow the same process for all the 49 images of Granada dataset and present their average values as the network performance. To demonstrate the robustness of the proposed networks, BER of extracted watermarks under several attacks are calculated for three different strength factors. Similar to PSNR and SSIM, all the numeric results of both tables represent the average result over all images in the test dataset with 20 random watermarks per image.
Robustness of the networks is tested for three levels of each attack. The symbol N in the tables stands for Nonrobust and is used when the BER value is around
. Gaussian noise is applied with three different standard deviations. The parameter in salt and pepper and cropping attacks is the percentage of changed pixels. To show the strength of the proposed networks in diffusion watermarking and data sharing, a new attack is introduced called Grid cropping, in which random
blocks throughout the image are suppressed to zero. For the Gaussian blur and sharpening (/unmask) attacks, the attack parameter represents the filter’s radius. The Median filter parameter demonstrates the filter mask size. In resizing attack, the image is resized by the shown scale and resized back to original image based on bilinear interpolation. Similar to cropping and salt and pepper, we conduct the grid cropping attack in three different levels, representing the percentage of suppressed blocks. The BER results in
III declare that even if a meaningful number of image blocks are cropped, we can still extract the majority of the watermark data. The next rarely used attack is the patternedpixelelimination attack, in which a text is written on the watermarked image. The attack parameter represents the number of text lines in “Natural script” handwriting with the font size of 40. Similar to Grid cropping, this attack highlights the network’s capability in diffusing the watermark data throughout the cover image. Tables III and III demonstrate that the MTNET exhibits overall better robustness compared to the other two networks, at the expense of lower PSNR and SSIM as expected. This is similar in spirit to a general belief in MultiTask Learning (MTL) [51], which states training one neural network for multiple similar tasks leads to overall better performance compared to training separate networks for every single task. As shown in Tables III and III, regardless of some exceptions, MTNET demonstrates better extraction rate (lower BER) than the other two networks. This is even true for the attacks that JTNet and GTNet are trained for (JPEG attack and Gaussian noise attack).Another important characteristic of the system is that the robustness of the watermarking networks can be controlled by two means. In the training mode, when the network confronts more powerful attacks, the embedding module embeds stronger watermark symbols in the image. This strong embedding leads to more robustness at the expense of lower PSNR. In spite of that, in a trained network we can still control this tradeoff by tuning the watermarking strength factor. As shown in Tables III and III, for a given attack and a fixed network, increasing the strength factor () results in lower BER values.
For further details about the network behavior, boxplots of Fig. 6 display the range and variations of PSNR and SSIM for all the watermarked images. Tables III and III only display the average outcomes of the proposed algorithm on the Granada dataset. In Fig. 6, each box corresponds to a strength factor and shows the distribution of SSIM and PSNR for all of the test images. Each row in Fig. 6 is for one of the proposed networks. Increasing the strength factor lowers PSNR, and SSIM values.
IvD Qualitative Results



For visualization purposes, the watermarked images produced with the three networks and various strength factors are illustrated in Fig. 7. In Fig. 7 a random 128bit watermark is embedded in Barbara image from the Granada dataset [48] by using GTNet, JTNet, and MTNet. The absolute difference between the cover image and the watermarked image is illustrated to show the watermark pattern. For better visualization, this difference is multiplied by 10. Furthermore, a small area of the image is zoomed for better illustration. We may notice in the difference images that amplitudes of the produced artifacts vary in different areas of the cover image. These variations in the difference matrices imply that the watermark symbols are adaptively embedded based on the local features of the image.
Fig. 8 demonstrates some attacks on Barbara. For each attack, the attack level/parameter is displayed in parenthesis. The extraction BER for the attacked images are also reported using MTNET with .



IvE Comparison With Stateoftheart
In this section, we compare our network performance against [27] and [38]. The authors of HiDDeN framework [27] use COCO [49] for their experiments. For a fair comparison, we match our testing conditions to HiDDeN, i.e., 30bit random watermarks are embedded in color images. Hence, we redundantly embed in all channels of YUV space. In each channel, the parameter is adjusted so that PSNR of both systems are similar. Cropout and Dropout attacks are applied based on their definition. According to Table IV, although ReDMark is not trained on cropout and dropout attacks, BER values under these two attacks are comparable to HiDDen.
The other competitors is Random Matching Pursuit [38]. Since they have reported BER results for JPEG attack, we only present the relevant results for this attack. The comparisons are conducted with the two networks JTNet and MTNet against their best results on Granada dataset [48] which is used by them to report their results. 1024 bit random watermarks are embedded into dataset images, and BER values are reported. The strength factor of the networks is specifically set to adjust the SSIM to the same value of the competitor’s. Then BER values of the extracted watermarks are compared in similar watermark qualities in terms of SSIM metric, while our PSNR is still better. Table V shows that ReDMark outperforms Random Matching Pursuit in terms of robustness (BER).
IvF Diffusion pattern / data sharing
In this section, we demonstrate the ability of our networks in data sharing and diffusion among neighboring blocks. In our framework, we strategically use circular convolution to avoid zeropadding of the feature maps, to enhance the watermark strength and robustness. Fig.
9 demonstrates how the application of circular convolution leads to further diffusion and data sharing on the image borders. The circular convolution mask (white window) in Fig. 9 sweeps the opposite block edges, whenever it is on the borders. Consequently, all the neurons equally share the watermark data and the descending effect of zeropadding on watermarking will be avoided.To investigate the watermark patterns generated in neighboring blocks, we design a simple experiment. A watermark mask is formed with only one nonzero bit. Then we check the produced patterns by the embedding networks on a constant cover image, in which all the pixels are set to 128. However, this pattern alone does not give us enough insight without considering the effect of embedded zeroes. This is because the networks embed separate symbols for 0 and 1. Thus in the second step, we do the same experiment with another watermark mask of all zero bits. We call the difference of the produced watermarked images in the two mentioned experiments as diffusion pattern. The produced patterns are invariant to the location of embedded nonzero bit, i.e., the same block patterns are shifted according to the location of embedded 1. The diffusion patterns of the three trained networks are illustrated in Fig. 10.
As illustrated in the diffusion patterns, employing circular convolution leads to scattering the watermark data across the whole image, i.e., the watermark bits are diffused across the image block. The second row of Fig. 10, demonstrates frequency energy curves of the diffusion patterns, which are calculated by summing absolute values of DCT coefficients of all sixteen blocks of the diffusion pattern. The accumulated DCT coefficients are arranged on the horizontal axis in zigzag order of the DCT block, i.e., starting from DC coefficient towards the highest frequency . With a closer look to diffusion patterns and their frequency energy curves in the DCT domain, it can be concluded that GTNET embeds the watermark in high frequency coefficients of the cover image, JTNET embeds in low frequency coefficients. However, MTNET employs a more distributed embedding strategy with more concentration on low and middle frequency bands. It is worth to mention that the diffusion pattern frequency curves are invariant to the location of embedded 1. As demonstrated in Fig. 10, all the networks have learnt to avoid embedding in DC coefficient, due to destructive effects on image quality and PSNR.
V Conclusion
In this paper, we presented ReDMark, an adaptive diffusion watermarking framework, composed of two Fully Convolutional Neural Networks with residual connections. The networks were trained endtoend to conduct a blind secure watermarking within any desired linear transform domain. The framework can be customized for the level of robustness vs. imperceptibility by tunable parameters during training and test of the networks. The proposed framework simulates various attacks as a differentiable network layer to facilitate endtoend training. For instance, a differentiable approximation of JPEG attack is developed, which largely improves the watermarking robustness to this attack. In this work, we presented three network instances of ReDMark, each of which trained under different attacks. We demonstrated the different nature/behavior of the trained networks, where each of them embeds in particular spectral regions due to their different training strategies. An important characteristic of the suggested system, which leads to improved security and robustness, is its capability to diffuse/share watermark data among a relatively wide area of the cover image. We visually illustrated this data sharing and diffusionwatermarking using diffusion patterns. Furthermore, we proposed two attacks to experimentally prove the effect of this unique characteristic on improving the watermarking robustness. Comparative results against recent stateoftheart works demonstrate the superiority of ReDMark in terms of imperceptibility and robustness.
References
 [1] W. Szepanski, “A signal theoretic method for creating forgeryproof documents for automatic verification,” vol. 101, no. 109, p. 368, 1979.
 [2] I. Cox, M. Miller, J. Bloom, J. Fridrich, and T. Kalker, Digital watermarking and steganography. Morgan kaufmann, 2007.
 [3] A. Shehab, M. Elhoseny, K. Muhammad, A. K. Sangaiah, P. Yang, H. Huang, and G. Hou, “Secure and robust fragile watermarking scheme for medical images,” IEEE Access, vol. 6, pp. 10 269–10 278, 2018.
 [4] Y. Cheng, “Music database retrieval based on spectral similarity,” 2nd Int. Symposium on Music Information Retrieval (ISMIR), Oct., 2001, 2001.
 [5] A. Patrizio, “Why the DVD hack was a cinch,” Wired News, http://www. wired. com/news/technology/0, 1282, 32263, 00. html, 1999.
 [6] R. S. Broughton and W. C. Laumeister, “Interactive video method and apparatus,” 1989.
 [7] M. Hagmüller, H. Hering, A. Kröpfl, and G. Kubin, “Speech watermarking for air traffic control,” Watermark, vol. 8, no. 9, p. 10, 2004.
 [8] F. N. Thakkar and V. K. Srivastava, “A blind medical image watermarking: DWTSVD based robust and secure approach for telemedicine applications,” Multimedia Tools and Applications, vol. 76, no. 3, pp. 3669–3697, 2017.
 [9] O. Benrhouma, H. Hermassi, A. A. A. ElLatif, and S. Belghith, “Chaotic watermark for blind forgery detection in images,” Multimedia Tools and Applications, vol. 75, no. 14, pp. 8695–8718, 2016.
 [10] D. G. Savakar and A. Ghuli, “Nonblind digital watermarking with enhanced image embedding capacity using DMeyer wavelet decomposition, SVD, and DFT,” Pattern Recognition and Image Analysis, vol. 27, no. 3, pp. 511–517, 2017. [Online]. Available: http://link.springer.com/10.1134/S1054661817030257
 [11] G. Anbarjafari and C. Ozcinar, “Imperceptible nonblind watermarking and robustness against tone mapping operation attacks for high dynamic range images,” Multimedia Tools and Applications, vol. 77, no. 18, pp. 24 521–24 535, 2018.
 [12] A. M. Abdelhakim and M. Abdelhakim, “A timeefficient optimization for robust image watermarking using machine learning,” Expert Systems with Applications, vol. 100, pp. 197–210, 2018.
 [13] Z. ZhiMing, L. RongYan, and W. Lei, “Adaptive watermark scheme with RBF neural networks,” in Neural Networks and Signal Processing, 2003. Proceedings of the 2003 International Conference on, vol. 2. IEEE, 2003, pp. 1517–1520.
 [14] L. Sanping, Z. Yusen, and Z. Hui, “A waveletdomain watermarking technique based on support vector regression,” in Grey Systems and Intelligent Services, 2007. GSIS 2007. IEEE International Conference on. IEEE, 2007, pp. 1112–1116.
 [15] R. Mehta, N. Rajpal, and V. P. Vishwakarma, “Robust image watermarking scheme in lifting wavelet domain using GALSVR hybridization,” International Journal of Machine Learning and Cybernetics, pp. 1–17, 2015.
 [16] R. P. Singh, N. Dabas, V. Chaudhary, and Others, “Online sequential extreme learning machine for watermarking in DWT domain,” Neurocomputing, vol. 174, pp. 238–249, 2016.
 [17] M. Heidari, S. Samavi, S. M. R. Soroushmehr, S. Shirani, N. Karimi, and K. Najarian, “Framework for robust blind image watermarking based on classification of attacks,” Multimedia Tools and Applications, vol. 76, no. 22, pp. 23 459–23 479, 2017.

[18]
E. Pasolli, F. Melgani, D. Tuia, F. Pacifici, and W. J. Emery, “SVM active learning approach for image classification using spatial information,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 4, pp. 2217–2233, 2014.  [19] M. Narwaria and W. Lin, “Objective image quality assessment based on support vector regression.” IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, vol. 21, no. 3, pp. 515–9, 2010. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/20100674

[20]
W. W. Y. Ng, A. Dorado, D. S. Yeung, W. Pedrycz, and E. Izquierdo, “Image classification with the use of radial basis function neural networks and the minimization of the localized generalization error,”
Pattern Recognition, vol. 40, no. 1, pp. 19–32, 2007.  [21] L. Ma, M. M. Crawford, and J. Tian, “Local manifold learningbased nearestneighbor for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 48, no. 11, pp. 4099–4109, 2010.
 [22] Z. Zheng, L. Zheng, and Y. Yang, “A discriminatively learned cnn embedding for person reidentification,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 14, no. 1, p. 13, 2017.
 [23] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks.” vol. 1, no. 2, p. 3, 2017.
 [24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
 [25] H. Kandi, D. Mishra, and S. R. S. Gorthi, “Exploring the learning capabilities of convolutional neural networks for robust image watermarking,” Computers & Security, vol. 65, pp. 247–268, 2017.
 [26] S.M. Mun, S.H. Nam, H.U. Jang, D. Kim, and H.K. Lee, “A Robust Blind Watermarking Using Convolutional Neural Network,” arXiv preprint arXiv:1704.03248, 2017.
 [27] J. Zhu, R. Kaplan, J. Johnson, and L. FeiFei, “HiDDeN: Hiding Data With Deep Networks,” arXiv preprint arXiv:1807.09937, 2018.
 [28] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 [29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [30] R.Z. Wang, C.F. Lin, and J.C. Lin, “Image hiding by optimal LSB substitution and genetic algorithm,” Pattern recognition, vol. 34, no. 3, pp. 671–683, 2001.
 [31] S. A. Parah, J. A. Sheikh, N. A. Loan, and G. M. Bhat, “Robust and blind watermarking technique in DCT domain using interblock coefficient differencing,” Digital Signal Processing, vol. 53, pp. 11–24, 2016.
 [32] A. K. Singh, M. Dave, and A. Mohan, “Robust and secure multiple watermarking in wavelet domain,” Journal of medical imaging and health informatics, vol. 5, no. 2, pp. 406–414, 2015.
 [33] Z. Pakdaman, S. Saryazdi, and H. NezamabadiPour, “A prediction based reversible image watermarking in Hadamard domain,” Multimedia Tools and Applications, vol. 76, no. 6, pp. 8517–8545, 2017.
 [34] M. Rabizadeh, M. Amirmazlaghani, and M. AhmadianAttari, “A new detector for contourlet domain multiplicative image watermarking using Bessel K form distribution,” Journal of Visual Communication and Image Representation, vol. 40, pp. 324–334, 2016.
 [35] H. Fazlali, S. Samavi, N. Karimi, and S. Shirani, “Adaptive blind image watermarking using edge pixel concentration,” Multimedia Tools and Applications, vol. 76, no. 2, pp. 3105–3120, 2017.
 [36] H. Sadreazami, M. O. Ahmad, and M. N. S. Swamy, “Multiplicative watermark decoder in contourlet domain using the normal inverse Gaussian distribution,” IEEE Transactions on Multimedia, vol. 18, no. 2, pp. 196–207, 2016.
 [37] N. M. Makbol, B. E. Khoo, T. H. Rassem, and K. Loukhaoukha, “A new reliable optimized image watermarking scheme based on the integer wavelet transform and singular value decomposition for copyright protection,” Information Sciences, vol. 417, pp. 381–400, 2017.
 [38] G. Hua, L. Zhao, H. Zhang, G. Bi, and Y. Xiang, “Random Matching Pursuit for Image Watermarking,” IEEE Transactions on Circuits and Systems for Video Technology, 2018.
 [39] J. Li, C. Yu, B. B. Gupta, and X. Ren, “Color image watermarking scheme based on quaternion Hadamard transform and Schur decomposition,” Multimedia Tools and Applications, vol. 77, no. 4, pp. 4545–4561, 2018.
 [40] X. Liu, G. Han, J. Wu, Z. Shao, G. Coatrieux, and H. Shu, “Fractional Krawtchouk transform with an application to image watermarking,” IEEE Transactions on Signal Processing, vol. 65, no. 7, pp. 1894–1908, 2017.
 [41] B. Chen and G. W. Wornell, “Quantization index modulation: A class of provably good methods for digital watermarking and information embedding,” IEEE Transactions on Information Theory, vol. 47, no. 4, pp. 1423–1443, 2001.
 [42] I. Daubechies and W. Sweldens, “Factoring wavelet transforms into lifting steps,” Journal of Fourier analysis and applications, vol. 4, no. 3, pp. 247–269, 1998.
 [43] P. Dabas and K. Khanna, “A study on spatial and transform domain watermarking techniques,” International journal of computer applications, vol. 71, no. 14, 2013.
 [44] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[45]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, and Others, “Tensorflow: a system for largescale machine learning.” in
OSDI, vol. 16, 2016, pp. 265–283.  [46] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR10 dataset,” online: http://www. cs. toronto. edu/kriz/cifar. html, 2014.
 [47] M. Everingham, L. Van~Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,” http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.
 [48] “Dataset of standard 512512 grayscale test images.” [Online]. Available: http://decsai.ugr.es/cvg/CG/base.htm
 [49] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
 [50] D.A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs),” arXiv preprint arXiv:1511.07289, 2015. [Online]. Available: http://arxiv.org/abs/1511.07289
 [51] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, “Multitask learning for classification with dirichlet process priors,” Journal of Machine Learning Research, vol. 8, no. Jan, pp. 35–63, 2007.
Appendix A
Proof:
If is the two dimensional DCT transform of , it can be written as:
(11) 
By reshaping the matrices and to vectors of length , the element of the vector can be written by change of variables of , , and .
(12) 
in which is the element of the vector and so:
(13) 
in which:
(14) 
Proof:
Hadamard transform of an block is defined by , where is the Hadamard matrix. Elements of the transformed matrix are calculated by:
(15) 
where
(16) 
(17) 
Equation (A) is reshaped by change of variables , , and , as bellow:
(18) 
where:
(19) 