An Automated and Robust Image Watermarking Scheme Based on Deep Neural Networks

Digital image watermarking is the process of embedding and extracting a watermark covertly on a cover-image. To dynamically adapt image watermarking algorithms, deep learning-based image watermarking schemes have attracted increased attention during recent years. However, existing deep learning-based watermarking methods neither fully apply the fitting ability to learn and automate the embedding and extracting algorithms, nor achieve the properties of robustness and blindness simultaneously. In this paper, a robust and blind image watermarking scheme based on deep learning neural networks is proposed. To minimize the requirement of domain knowledge, the fitting ability of deep neural networks is exploited to learn and generalize an automated image watermarking algorithm. A deep learning architecture is specially designed for image watermarking tasks, which will be trained in an unsupervised manner to avoid human intervention and annotation. To facilitate flexible applications, the robustness of the proposed scheme is achieved without requiring any prior knowledge or adversarial examples of possible attacks. A challenging case of watermark extraction from phone camera-captured images demonstrates the robustness and practicality of the proposal. The experiments, evaluation, and application cases confirm the superiority of the proposed scheme.


page 1

page 6

page 7

page 8

page 9


A Robust Image Watermarking System Based on Deep Neural Networks

Digital image watermarking is the process of embedding and extracting wa...

A Deep Learning-based Audio-in-Image Watermarking Scheme

This paper presents a deep learning-based audio-in-image watermarking sc...

Security Consideration For Deep Learning-Based Image Forensics

Recently, image forensics community has paied attention to the research ...

Robust Watermarking using Diffusion of Logo into Autoencoder Feature Maps

Digital contents have grown dramatically in recent years, leading to inc...

BlessMark: A Blind Diagnostically-Lossless Watermarking Framework for Medical Applications Based on Deep Neural Networks

Nowadays, with the development of public network usage, medical informat...

A Robust Blind Watermarking Using Convolutional Neural Network

This paper introduces a blind watermarking based on a convolutional neur...

Robust Watermarking Using Inverse Gradient Attention

Watermarking is the procedure of encoding desired information into an im...

I Introduction

Digital image watermarking refers to the process of embedding and extracting information covertly on a cover-image. The data (i.e., the watermark) is hidden into a cover-image to create a to-be-transmitted marked-image. The marked-image does not visually reveal the watermark, and only the authorized recipients can extract the watermark information correctly. The techniques of image watermarking can be applied for various applications. Based on different target scenarios, the watermark information can be presented in different forms; for example, the watermark can be some random bits or electronic signatures for image protection and authentication [berghel1996protecting], or some hidden messages for covert communication [cox1999watermarking]. In addition, the watermark can be encoded for different purposes, such as added security with encryption methods or restoring information integrity with error correction codes during a cyberattack [shih2017digital].

For copyright protection, classic image watermarking research [cox1997secure] only focuses on single-bit extractions, where the output indicates whether an image contains a watermark or not. To enable a wide range of applications, modern image watermarking research primarily focuses on multi-bit scenarios that extract the entire communicative watermark information [shih2017digital, cox2007digital]. Typically, many factors should be considered in an image watermarking scheme, such as the fidelity of the marked-image and the watermark’s undetectability to computer analysis. The proposed image watermarking scheme not only satisfies these factors but achieves the robustness of its priority: the watermark should survive even if the marked-image is degraded or distorted. Ideally, a robust image watermarking scheme keeps the watermark intact under a designated class of distortion without any assistance of techniques. However, in practice, the watermark is extracted approximately in many attacking scenarios, and various encoding methods can be applied for its restoration [kang2003dwt]. Achieving robustness is a major challenge in a blind image watermarking scheme, where the extraction must be performed without any information about the original cover-image.

Due to the limited scope of manual design, the traditional image watermarking methods encounter difficulties; for example, extraction can only tolerate certain types of distortions in robust watermarking schemes, or the watermark itself can only resist a limited range of computer analysis in undetectable watermarking schemes [craver1998resolving]. To break through these drawbacks, incorporating deep learning into image watermarking has attracted increased attention in recent years [kandi2017exploring]

. Deep learning, as a representation learning method, has enabled significant improvements in computer vision through its ability to fit and generalize complex features. The major advantage of deep learning methodologies for image watermarking is that they perform image watermarking in a more adaptive manner by dynamically learning algorithms to extract both high- and low-level features for image watermarking from multiple instance big data.

Recent research on image watermarking tasks with deep neural networks has emerged [vukotic2018deep, li2019novel, fierro2019robust, papernot2016limitations], but there still exist challenging issues. For example, it is difficult to fully utilize the fitting ability of deep neural networks to automatically learn and generalize both the watermark embedding and extracting processes. Also, labeling the ground truth for an image watermarking task can be ill-defined or time-consuming. Finally, achieving robustness and blindness simultaneously without prior knowledge of adversarial examples [papernot2016limitations] remains unexplored.

To address the above challenges, we present an automated, blind, and robust image watermarking scheme based on deep learning neural networks. The contribution of this paper is threefold. First, the fitting ability of deep neural networks is exploited to automatically learn image watermarking algorithms, facilitating an automated system without requiring domain knowledge. Second, the proposed deep learning architecture can be trained in an unsupervised manner to reduce human intervention, which is suitable for image watermarking. Finally, experimental results demonstrate the robustness and accuracy of the proposed scheme without using any prior knowledge or adversarial examples of possible attacks.

The remainder of this paper is organized as follows. The related work is described in Section II. The proposed scheme is presented in Section III. Experiments and analysis are presented in Section IV. Applications of the proposed watermarking scheme are discussed in Section V. Finally, conclusions are drawn in Section VI.

Ii Related Work

This section provides a detailed analysis of recent reports. Table I shows the analytical comparison of our proposed scheme with the start-of-the-art deep learning–based image watermarking schemes.

In handcrafted watermarking algorithms, various optimization methods have been applied to adapt the embedding parameters, and this research direction has attracted attentions in recent years [huang2019enhancing, su2018snr, chen2018novel]. Consequently, exploring the optimization ability of deep learning models for adaptive and automated image watermarking is of great interest. However, compared to significant advancements on image steganography with deep neural networks [wang2018sstegan, baluja2017hiding], deep learning–based image watermarking is still in its infancy.

Kandi et al. [kandi2017exploring]

applied two convolutional autoencoders to reconstruct a cover-image. In a marked-image, the pixels produced by the first autoencoder indicate bits with the value of zero, and the pixels produced by the second autoencoder indicate bits with the value of one, hence developing a non-blind binary watermarking scheme. Vukotic

et al. [vukotic2018deep] developed a deep learning–based, single-bit watermarking scheme by embedding through designed adversarial images and extracting via the first layer of a trained deep learning model. Li et al. [li2019novel] embedded a watermark into the discrete cosine domain by traditional algorithms [cox1997secure] and applied convolutional neural networks to facilitate the extraction.

Besides the single-bit and multi-bit watermarking, attempts were also reported in special scenarios. For example, for zero-watermarking, where a master share is sent separately from the image, Fierro-Radilla et al. [fierro2019robust] applied convolutional neural networks to extract the required feature from the cover-image and linked these features with the watermark to create a master share. For the scenario of template-based watermarking, Kim et al. [kim2018convolutional]

embedded a handcrafted template by using a classic additive method and estimated possible distortion parameters by comparing the extracted template to the original one with the help of convolutional neural networks. Thus far, the existing deep learning–based image watermarking schemes do not fully apply the fitting ability of deep neural networks to learn and generalize the embedding and extracting algorithms.

Furthermore, due to the fragility of deep neural networks [papernot2016limitations], a modified image as an input to a trained deep learning model may cause failure. In other words, robustness is a major challenge in deep–learning based image watermarking because noise or modifications of the marked-image can result in extraction failures. Mun et al. [mun2019finding] proposed to solve this issue by proactively including noisy marked-images as adversarial examples in the training phase. However, enumerating all types of attacks and their combinations may not be practically feasible.

To the best of our knowledge, our proposed scheme is the first method that explores the ability of deep neural networks in automatically learning and generalizing both watermark embedding and extracting algorithms, while achieving robustness and blindness simultaneously.

Blind Robust
Kandi et al. [kandi2017exploring] No No Robust to common image processing attacks Multi-bit
Vukotic et al. [vukotic2018deep] Learning extraction Yes Robust to Rotation, JPEG, and Cropping Single-bit
Li et al. [li2019novel] No No No Multi-bit
Fierro-Radilla et al. [fierro2019robust] No No Robust to common image processing attacks Zero-watermarking
Kim et al. [kim2018convolutional] Assisting extraction No Focus on geometric attacks Template-based watermarking
Mun et al. [mun2019finding] Learning extraction Yes Robust to all enumerated attacks during training Multi-bit
Ours Yes Yes Robust to common image processing attacks Multi-bit
TABLE I: An analytical comparison between the proposed scheme and state-of-the-art image watermarking methods applying deep neural networks.

Iii Proposed Image Watermarking Scheme

We revisit the typical design of an image watermarking scheme and present the overview architecture of our scheme in Section III-A

. Then, we present the loss function design and the scheme objective in Section 

III-B. Finally, the detailed structure of the proposed model is described in Section III-C.

Iii-a The Overview Architecture of the Proposed Scheme

The traditional design of an image watermarking scheme is shown in Fig.1. A watermark (denoted as ) is embedded into a cover-image (denoted as ) to produce a marked-image (denoted as ) that looks similar to and is transported through a communication channel. Then, the receiver extracts the watermark data (denoted as ) from the received marked-image (denoted as ) that could be a modified version of if some distortions or attacks are occurred during the transmission.

Fig. 1: The traditional design of an image watermarking scheme.

To embed into , typically, the first step is to project into one of its feature spaces in spatial, frequency, or other domains. Next, is encoded and embedded into the feature space of . The embedded feature space is projected back into the cover-image space to create a marked-image . Inversely, the watermark extraction is to project the marked-image reception to the same feature space and then extract and decode the watermark information. Based on different target applications, traditional image watermarking methods manually design the projection, embedding, extraction, encoding, and decoding functions. As the criteria of a design, an image watermarking scheme often highlights its fidelity (i.e., high similarity between the and ) and robustness (i.e., keeping the integrity of when is distorted).

Traditional image watermarking methods perform competently through hand-designed algorithms; however, it remains challenging to automatically learn these algorithms without complete dependence upon careful design. To tackle this difficulty, we propose a novel scheme which develops a deep learning model to automatically learn and generalize the embedding, the extraction, and the invariance functions in image watermarking. Fig. 2 illustrates the overall architecture of the proposed scheme with some example images. Given two input spaces that are all the possible inputs of watermark images and cover-images ( and , respectively), neural network parameterized by is applied to learn a function that encodes . , the encoded space of , not only enlarges to prepare for the next-step concatenation, but also brings some redundancy, decomposition, and perceivable randomness to help information protection and robustness. Like the embedding process in traditional watermarking where an encoded is inserted into a feature space of , in the proposed scheme, an embedder that takes , as inputs and produces the marked-image is fit by the neural network parameterized by . The marked-image space is named as . To handle possible distortions, a neural network parameterized by is introduced to learn to convert to its enlarged and redundant transformed-space . After the transformation, preserves information about and rejects other irrelevant information, such as noises on , therefore providing robustness. Finally, the inverse watermark reconstruction functions are fitted by two neural network components, and with trainable parameters and , that extract from and decode from , respectively.

Fig. 2: The overall architecture of the proposed image watermarking scheme.

To compare the proposed scheme in Fig. 2 with the traditional design in Fig.1, one can observe that the neural network components fit and optimize the image watermarking process dynamically. Encoder and decoder networks (denoted as and , respectively) fit the watermark encoding and decoding functions. An embedder (denoted as ) projects into a feature space, embeds into the space, and projects to the marked-image space . An extractor (denoted as ) inverses all the processes in , and handles the distortion during the transmission through a communication channel. More details of the architectures are described in Section III-C.

Compared to an autoencoder [hinton2006reducing], where an input space is transformed to a representative and intermediate feature space and the original input is recovered from this feature space, the proposed scheme takes two spaces and as inputs, produces an intermediate marked-image space , and recovers from . The recovery ability of autoencoders, i.e.,

an exact reconstruction of the input with appropriate features extracted by the deep neural networks, secures the feasibility of watermark extraction in the proposed scheme. The reconstruction requires only

without the need of and , enabling the blindness of the proposed scheme. A feature space in an autoencoder is often learned through a bottleneck for the dimensionality compression, but the proposed scheme learns equal-sized or over-complete representations to achieve high robustness and accuracy of watermark extraction.

Iii-B The Loss Function and Scheme Objective

The entire architecture is trained as a single deep neural network with several loss terms designed for image watermarking. Given the data samples , and , , the proposed scheme can be trained in an unsupervised manner. There are two inputs and , and two outputs and in the proposed deep neural network. For the output , an extraction loss that minimizes the difference between and is computed to ensure full extraction of the watermark. For the output , a fidelity loss that minimizes the difference between and is computed to enable watermarking invisibility. For the output , we also compute an information loss that forces to contain the information of . To achieve this, we maximize the correlation between feature maps of and feature maps of , where the feature maps are the outputs of convolutional layers in the proposed architecture, and the feature maps of and , i.e., and , are illustrated in Figs. 3 and 4. Denoting the parameters to be learned as , the loss function of the proposed scheme can be expressed as:


where , is the weight factor and is a function computing the correlation given as:


where denotes the Gram matrix that contains all possible inner products. By minimizing the distance between the Gram matrices of the feature maps of and produced by intermediate layer outputs and , we maximize their correlation. As the feature producers, the annotation of and is presented in Fig. 4, and the convolution block that contains and is annotated in Fig. 3. More discussions will be presented in Section III-C.

In Eq. 1, each two of the fidelity loss, information loss, and extraction loss terms can be a trade-off for image watermarking. For example, minimizing the fidelity loss term to zero means that is identical to . However, there is no embedded information in in this case, so the extraction of will fail. To allow some imperfectness of the loss terms, the mean absolute error (i.e.,

the L1 norm) is selected to highlight the overall performance rather than a few outliers.

With regularization, the proposed scheme objective is represented as , where is the penalty term to achieve robustness as in Eq. 6, and is the weight controlling the strength of the regularization term. The deep neural network needs to learn the parameter that minimizes :


In the backpropagation during training, the term

is applied by all the components of the proposed architecture in their weight updates, while only and apply terms and to their weight updates. This enables and to encode and embed the information in a way that and are able to extract and decode the watermark.

Iii-C Detailed Structure of the Proposed Neural Networks Model

Fig. 3: The detailed components of the proposed watermarking scheme: the Encoder , the Embedder , the Invariance Layer , the Extractor , and the Decoder . Every structure of the convolution block is the same, but only block marked with ”*” needs intermediate results to compute the loss.

This subsection describes the major components design of neural networks: , , , and in more detail. The overall design is modularized and illustrated in Fig. 3. If we single out two pairs (, ) and (, ), we can find that each pair is conceptually symmetrical. The watermark is considered as a binary image, and the cover-image is a color image in this description. One might adapt and customize the image sizes based on different target applications.

Iii-C1 The Encoder and the Decoder

Given the samples from the input space , the encoder learns a function that encodes to its code . Inversely, the decoder learns the decoding function from to with samples , . The encoder successively increases a binary watermark image to and , and the decoder successively decreases the feature space back to a binary watermark image. The reason to train this channel-wise increment is two-fold. First, it produces a that has the same width and height as the cover-image, so that we can concatenate a feature map of and along their channel dimension. Each of and will contribute equally to the concatenated matrix used in the embedder . Thus, we are evenly weighing the watermark and the cover-image. Second, this to increment introduces some redundancy, decomposition, and perceivable randomness to , which helps information protection and robustness.

Iii-C2 The Embedder and the Extractor

The embedder applies the convolution block to extract a to-be-embedded feature map of that is concatenated along the channel dimension with the cover-image. Directly applying , while only applying a feature map of , helps to dominate the appearance. The concatenation is fed into another convolution block to produce . The extractor inverses the process by two successive convolution blocks.

To capture various scales of features for image watermarking, the inception residual block [szegedy2017inception] is applied. It consists of a , a , and a

convolution, as well as a residual connection that sums up the features and the input itself. In the proposed structure, each convolution has 32 filters, and the 5 × 5 convolution is replaced by two 3 × 3 convolutions for efficiency. The 32-channel feature maps produced by different convolution paths are concatenated along the channel dimension to form a 96-channel feature, and a 1 × 1 convolution is applied to convert the 96-channel feature back to the original input channel size for the summation in the residual connection. The architecture of a convolution block is shown in Fig. 

4, where , , and , respectively, denote the size of height, width, and channel.

Fig. 4: The architecture design of a convolution block. The block input/output size is denoted as the size of height (), the width (), and the channel (), respectively.

All the convolution blocks in Fig. 3 have the same inception residual structure as shown in Fig. 4. In the case of the ”*” convolution block of Fig. 3, the annotated intermediate results and of Fig. 4 are applied in Eq. 2. Specifically, block extracts features not only from its input in the architecture, but also from . The annotated and feature maps are the intermediate results and , respectively.

Iii-C3 The Invariance Layer

The invariance layer is the key component to provide robustness in the proposed image watermarking scheme. Using a fully-connected layer, learns a transformation from space to an over-complete space

, where the neurons are activated sparsely. The idea is to redundantly project the most important information from

into and to deactivate the neural connections of the areas on irrelevant of the watermark, thus preserving the watermark even if there is noise or distortion that modified a part of . As shown in Fig. 3, converts a -color-channel instance of into an -channel ( for non-compression) instance of , where is the redundant parameter. Increasing results in increased redundancy and decomposition in , which provides higher tolerance of the errors in and enhances robustness.

Based on the contractive autoencoder 

[rifai2011contractive], employs a regularization term that is obtained by the Frobenius norm of the Jacobian matrix of a layer’s outputs with regards to its inputs. Mathematically, the regularization term is given as:


where denotes the -th input and denotes the output of the -th hidden unit of the fully connected layer. Similar to a common gradient computation, the Jacobian matrix can be written as:



is an activation function and

is the weight between and . We set as the hyperbolic tangent () for strong gradients and bias avoidance [lecun2012efficient], and hence can be computed as:


If the value of is minimized to zero, all weights in will be zero, so that the output of will be always zero no matter how we change the inputs . Thus, minimizing alone will cause the rejection to all the information from the inputs . Therefore, we place as a regularization term in the total loss function to preserve useful information related to the loss terms of image watermarking, while rejecting all other noise and irrelevant information. In this way, we achieve robustness without prior knowledge of possible distortion.

Remarkably, each color channel in is treated as a single input unit to significantly improve the computational efficiency. For example, if we treat one pixel as an input, a marked-image will have 49,152 input units. Setting the redundant parameter N to its smallest value 3 will imply (= 147,456) units in the fully-connected invariance layer , which requires at least (= 7,247,757,312) parameters. This significantly lowers the efficiency. On the other hand, treating one color channel as an input unit considers only 3 input units for an RGB marked-image, which enables faster computation with much fewer parameters as well as a much larger N to enable higher redundancy for higher robustness.

Iv Experiments and Analysis

This section experimentally analyzes quantitative and analytical evaluation of the proposed deep learning–based image watermarking scheme. Section IV-A introduces our data preparation, and Section IV-B presents the experimental design, training and validation. To validate our proposed image watermarking approach, Section IV-C provides special testing experiments on synthetic images, and Section IV-D shows the robustness in different distortion. A feasibility case study on the scenario of watermark extraction from phone camera pictures is also presented in Section IV-E.

Iv-a Preparation of Datasets

The proposed deep learning–based image watermarking architecture was trained as a single deep neural network. ImageNet 

[russakovsky2015imagenet] was rescaled to size with RGB channels and then used as the cover-images. The binary version of CIFAR [krizhevsky2009learning] with its original size was used as the watermarks because the proposed architecture used

binary watermark images. Both datasets contain more than millions of images such that the proposed scheme during training can be introduced by a large scope of instances. For a validation set after each training epoch,

images from each dataset that were not used during the training phase are separated.

The testing is performed on 10,000 images (rescaled to ) from the Microsoft COCO dataset [lin2014microsoft] as the cover-images, and 10,000 images of the testing set of the binary CIFAR as the watermarks. Both the testing cover-images and watermarks were not applied in the training to demonstrate that the proposed scheme learns and generalizes the watermarking algorithms without over-fitting to the training samples.

Iv-B Training, Validation and Testing of the Proposed Model

As described in Section III, the proposed image watermarking scheme is trained as a single and deep neural network. The ADAM optimizer [kingma2014adam], which adopts a moving window in the gradient computation, is applied, for its ability of continuous learning after large epochs. The training and validation of the proposed scheme are shown in Fig. 5, where the values of the terms in the loss (Eq. 1) and objective (Eq. 3) during epochs are presented. During both training and validation, the terms T1 and T2 (defined in Fig. 5) in converge smoothly below 0.015, and converges below 0.03, indicating a proper fit. Term T1 has slightly more errors because when carrying the watermark, a marked-image cannot be completely identical to a cover-image. , , and are all set to be 1 to equally weigh the terms, and is set to be 0.01 as suggested in [rifai2011contractive]

. All the layers apply the rectified linear unit (ReLU) as the activation function apart from the outputs (marked-image and watermark extraction), which use sigmoid to limit the range to (0, 1).

Fig. 5: The loss values during training and validation.

At the testing phase, the peak signal-to-noise ratio (

PSNR) and bit-error-rate (BER) are respectively used to quantitatively evaluate the fidelity of the marked-images and the quality of the watermark extraction. The PSNR is defined as:


where MSE is the mean squared error. The BER

is computed as the percentage of error bits on the binarization of the watermark extraction

. In the testing, the BER is zero, indicating that the original and the extracted watermarks are identical. The testing PSNR is 39.72 dB, indicating a high fidelity of the marked-images, so that the hidden information cannot be noticed by human vision. A few testing examples with various image content and color are presented in Fig. 6, where we can observe high fidelity and full extraction. The watermark codes do not directly reveal information about the watermark, which shows the perceivable randomness and decomposition learned by the proposed model.

Fig. 6: A few testing examples of the proposed scheme with various image content and color.

Iv-C The Proposed Scheme on Synthetic Images

To further validate that the image watermarking task is properly generalized, the proposed scheme is applied for synthetic RGB cover-images and watermarks. The results of the blank cover-images and the random bits are presented below.

Fig. 7 illustrates the scenario of embedding binary watermarks into synthetic blank cover-images of black, red, white, green, and blue colors, respectively. Although the blank cover-images are not included in the training, the proposed scheme provides promising results. Applying blank cover-images is known to be extremely difficult in conventional watermarking methods due to the lack of psycho-visual information. However, unlike traditional methods that assign some unnoticeable portions of visual components as the watermark, the proposed deep learning model learned to apply the correlation between the features of space and of to indicate the watermark.

Fig. 7: Embedding watermarks into blank cover-image examples. (a) and (b): the embedded and extracted watermarks; (c) and (d): the five blank cover- and marked-images.

Fig. 8 shows an example of embedding a randomly generated binary image into a natural cover-image. To test the application scenarios where the watermarks are encrypted to random bits (besides the displayed example), randomly generated bit sets are tested on cover-images from the testing dataset. The average BER is , which indicates that applying random binary bits as the watermark does not deteriorate the performance of our proposed solution.

Fig. 8: Embedding random bits as the watermark. (a) random bits; (b) a cover-image; (c) extraction; and (d) the marked-image.

Iv-D The Robustness of the Proposed Scheme

The robustness of the proposed scheme against different distortions on the marked-image is evaluated by analyzing the distortion tolerance range. Fig. 9 illustrates a few visual examples of the marked-images and their distortions.

Fig. 9: A few visual examples of the marked-images (top row) and their distortions (bottom row). The methods are implemented from left to right: Histogram Equalization, Gaussian Blur, Salt & Pepper Noise, and Cropping, respectively.

Due to the over-complete design and the invariance layer , the proposed schemes can tolerate distortions at a very high percentage (see Fig. 10 (b) and (d) for an example of large cropping).

Fig. 10: An example of watermark extraction under large percentage cropping. (a) an example marked-image, (b) cropped (a), (c) original watermark, (d) extraction with , and (e) extraction without .

To demonstrate the importance of our core robustness provider, two control experiments extracting watermarks from distorted marked-images with and without are conducted. Without , the proposed scheme cannot extract the correct watermark (one example is shown in Fig.10), and the extraction from 10,000 testing attacked marked-images yields an average BER as high as 42.46%, which illustrates the significance of if we compare to the results presented in Fig.11.

With , distortions with swept-over parameters that control the attack strength are applied on the marked-images produced from the testing dataset. The watermark extraction BER caused by each distortion under each parameter is averaged over the testing dataset. The distortions with swept-over parameters versus the average BER are plotted in Fig. 11. Since focusing on image-processing attacks, the responses of the proposed scheme against some challenging image-processing attacks are discussed. The proposed scheme shows high tolerance range on these challenges, especially for cropping, salt-and-pepper noise, and JPEG compression. For example, the extracted watermarks have low average BERs as 7.8%, 11.6%, and 12.3% under severe distortions including a cropping discarding 65% of the marked image, a JPEG compression with a low quality factor 10, and a 90% salt-and-pepper noise. The attacks that randomly fluctuate the pixel values through image channels show a higher BER, including Gaussian additive noise and random noise that sets a random pixel to a random value. These extreme attacks can easily destroy most of the contents on the marked-image (see few examples in Fig. 12). Still, the proposed system achieves good performances when the marked-image contents are decently preserved, such as 14% BER on a 11% random noise.

Fig. 11: Distortions with swept-over parameters versus average BER.
Fig. 12:

An example of extreme distortions. (left): the marked-image; (middle): after Gaussian additive noise with variance 0.2; and (right): after 20% random noise.

As discussed in Table I, Mun’s scheme [mun2019finding] has the closest purpose with ours because it also achieves blindness and robustness simultaneously. Hence, we further compare our scheme with Mun’s scheme for analysis. The comparison is performed on the same cover-image sets and the same watermark images as reported in Mun’s scheme. To analyze the robustness of the proposed scheme, the extraction BERs under common image-processing attacks are shown in Table II. The proposed scheme shows the advantages by covering more distortion categories in image-processing attacks and obtaining a lower BER under the same distortion parameters. Although Mun’s method can tolerate geometric distortions while the proposed scheme cannot, Mun’s method requires the presence of the distortions in the training phase for robustness. In the real world, there is no way to predict and enumerate all kinds of attacks.

Method BER (%) under the distortions
G. F.
Mun’s 6.61 7.98 4.81 38.01 256
Ours 0.43 8.16 0 0.97 0 39.93 1,024
  • Note: denotes ”Not Applicable” that the robustness of an attack was not covered in [mun2019finding]; denotes histogram equalization; JPEG 10 denotes a JPEG compression with quality factor 10; denotes the salt-and-pepper noise; G. F. denotes Gaussian filtering.

TABLE II: A quantitative comparison between the proposed scheme and Mun’s scheme.

Iv-E A Case Study: Feasibility Test on Watermark Extraction from Camera Resamples

Currently, image watermarking applications are much different than they used to be when they were first developed. Early scenarios focus on copyright detection, while more recent real-world communication requirements introduce a challenging use-case: the watermark extraction from phone camera resamples of marked-images [digimark]. The challenges in this use-case arise because watermark extraction needs to handle multiple combinations of the distortions, such as optical tilt, quality degradation, compression, lens distortions, and lighting variation. Most existing approaches [pramila2018increasing, kim2006image, yamada2013method, delgado2013digital] focus on the resamples of printed marked-images, not on phone resamples of a computer monitor screen. This use-case typically involves additional distortions, such as the Moiré pattern (i.e., the RGB ripple), the refresh rate of the screen, and the spatial resolution of a monitor (some examples are mentioned in Fig. 13). We applied the proposed scheme as a major component in such scenarios since it is designed to reject all irrelevant noise instead of focusing on certain types of attacks. The outline of our scenario is shown in Fig. 13.

Fig. 13: The phone camera testing scenario.

An information provider prepares the information by encoding through an Error Correction Coding (ECC) technique. Although trained as an entire neural network, the proposed scheme is separated into the embedding components ( and ) and the extracting components (, and ). The marked-image can be obtained by embedding the encoded watermark into the cover-image using the trained embedding components. The marked-image that looks identical to the cover-image is distributed online and displayed on the user’s screen. A user scans the marked-image with a phone to extract the hidden watermark through the extracting components.

The distortions occurred in this test can be divided into two categories: perspective and image-processing distortions. The major function of the proposed scheme in this scenario is to overcome the pixel-level modifications coming from image-processing distortions like compression, interpolation errors and the Moiré pattern. To concentrate the test on the proposed scheme, we simplified the solution of the perspective distortions, although this can be an entire challenging research track 

[zhao2018automatic, jaderberg2015spatial].

With this setup, we develop a prototype for a user study. Fig. 14 (a) illustrates the Graphical User Interface (GUI) and a 32 16 sample information. Classic Reed Solomon () code [reed1960polynomial] is adopted as the ECC to protect the information. is applied to protect each row of the information so that the encoded information will be a watermark satisfying the fixed watermarking capacity of the proposed scheme. In the watermark, each row is a codeword with data of length 16 and a parity of length 16, and hence can correct up to an error of length 8. Therefore, in this watermark of length 1,024 (), up to 256 errors can be corrected if there are no more than 8 errors in each row [reed1960polynomial]. Applying half of the bits as the parity, the watermarking payload is 512 bits. For the perspective distortion, four corners of the largest contour inside the Region Of Interest (ROI) are used to map the contoured content to a bird’s-eye view (Fig. 14 (b)), and the watermark is extracted from the bird’s-eye view rectification.

Fig. 14: The prototype. (a) The GUI and sample information; and (b) the simple rectification.

In this case study, five volunteers were invited to take a total of photos of some marked-images displayed on a screen using the camera on their mobile phones. The volunteers’ phones were Google Pixel 3, Samsung Galaxy s9, iphone XR, Xiaomi 8, and iphone X. All the photos were taken under office light conditions. Volunteers were given two rules. First, the entire image should be placed as large as possible inside the ROI. As a prototype for demonstration, this rule facilitates our segmentation that the largest contour inside the ROI is the marked-image, so that this application can focus on the test of the proposed system instead of some complicated segmentation algorithms. In addition, placing the image largely in the ROI helps with the capture of desired details and features for the watermark extraction. Second, the camera should be kept as stable as possible. Although the proposed system tolerates some blurring effects, it is not designed to extract watermarks in high-speed motion.

Fig. 15 presents five watermark extractions, their BERs, and the corresponding ROIs. The closer up the photo is taken, the lower the error. Also, a lower error was observed with a greater parallel angle between the camera and the screen. The flashlight brings more errors due to over- or under-exposure to some image areas. Use of the flashlight in this application is optional because the screen has the back-lights. The average BER was 5.13% for the images.

Fig. 15: Five examples of the watermark extractions before ECC and their ROIs. The left-hand side of each picture is the ROI, and the right-hand side is the extracted watermark. The BERs of the extractions from left to right are 3.71%, 4.98%, 1.07%, 4.30%, and 8.45%, respectively.

For a visual comparison, the displayed watermark extractions are the raw results before error correction. After executing , all the watermark extractions in the testing cases can be restored to the original information without errors, as shown in Fig. 14. The proposed scheme can successfully extract the watermark within one second because it only applies the trained weights on the marked-image rectification.

V Applications of the Proposed Watermarking Scheme

In this section, we discuss different application scenarios, where the proposed image watermarking scheme can be applied to (i) authorized IoT device onboarding in Section V-A; (ii) creation of private communication channels in Section V-B; and (iii) authorized access to privileged content and services in Section V-C. We assume that a watermark is created based on credentials or secrets provided by users (e.g., passwords, cryptographic keys, or fingerprint scans). Our scenarios can also utilize the watermark extraction technique from camera resamples that we presented in Section IV-E.

V-a Authorized IoT Device Onboarding

IoT devices need to be ”onboarded” once are bought by users in terms of being associated with a home controller or some cloud service so that users can control them [mastorakis2020icedge, mastorakis2019towards].

Currently, the widely-used methods to onboard IoT devices are mostly out-of-band and include the use of QR codes physically printed on devices, pin codes, and serial numbers [latvala2020evaluation]. For example, once a user buys a smart IoT camera, he/she scans a QR code printed on the camera (or the packaging) with his/her mobile phone and through a mobile application, he/she connects the camera with a cloud service usually offered by its manufacturer. In this way, the user is able to watch the video stream capture from the camera.

Existing onboarding methods do not protect against unauthorized access. For example, attackers that have physical access to an IoT device can tamper it (e.g., install malware on the device) before the device is onboarded by its owner. The proposed image watermarking mechanism can be utilized to enable the onboarding of IoT devices only by the device owner. For example, user credentials can be embedded to a QR code, which will be physically printed on a device. Once a user receives his/her IoT device (i.e., an IoT camera with a QR that has the user credential embedded), he/she takes a picture of the QR code with his/her mobile phone. To onboard the device, the user sends the taken QR picture along with his/her credentials to a server that runs the extraction via a deep neural network. The deep neural network verifies that the user indeed possesses the credentials (watermark) embedded into the QR code and authorizes the user to onboard the IoT device.

V-B Creation of Private Communication Channels

The proposed image watermarking scheme can be used for the creation of private chatrooms and other communication channels. For instance, a chatroom organizer can collect the credentials of individuals that he/she would like to communicate with and create a QR code with this set of credentials embedded to the code. The created QR code can be uploaded on the Internet. Once an individual that has been included in the communication group scans the QR code with his/her mobile phone and provides his/her credential to the deep neural network, the network verifies that this individual indeed possesses credentials embedded into the QR code and authorizes the user to join the chatroom. Unauthorized Internet users (i.e., users that do not have their credentials embedded into the QR code watermark) might try to join the chatroom, but they will not be able to do so; even if such users take a photo of the QR code, they do not possess credentials that are embedded into this code; thus, the deep neural network will reject their requests to join the chatroom. The QR code with the embedded watermark will be publicly available and can be accessed by all Internet users, however, only the authorized users will be able to join the chatroom and communicate with each other.

V-C Authorized Access to Privileged Content and Services

The proposed image watermarking scheme can be also utilized for a broader scope of applications, where access to privileged content and/or services is desirable. The content producer or service provider will create cover-images that have the credentials of authorized users embedded. As a result, image watermarking can be used as an access control mechanism, where access to certain pieces of (privileged) content and/or services are restricted only to authorized users; i.e., users that have their credentials (watermark) embedded into the marked image. Similarly, when authorized users send the marked image and the embedded credentials to the deep neural network, the network will be able to extract the watermark only if the user possesses the proper credentials. Only users that possess the credentials (watermark) embedded into the marked image will be allowed to access the privileged content.

Vi Conclusion

This paper introduces an automated and robust image watermarking scheme using deep convolutional neural networks. The proposed blind image watermarking scheme exploits the fitting ability of deep neural networks to generalize image watermarking algorithms, shows an architecture that trains in an unsupervised manner for watermarking tasks, and achieves its robustness property without requiring prior knowledge of possible distortions on the marked-image. Experimentally, we have not only reported the promising performances for individual common attacks, but also have demonstrated that the proposed scheme has the ability and the potential to help combinative, cutting-edge, and challenging camera applications, which has confirmed the superiority of the proposed scheme. Our future work includes tackling geometric and perspective distortions by the deep neural networks inside the scheme, and refining the scheme architecture, objective and loss function by different methods like ablation studies.