The research on watermarking techniques in the multimedia sector is significantly rising, as both the presented methods achieve ever-increasing accuracy at the same time as mitigating the flaws of embedding additional information into the image, and the need for pervasive marking of the intellectual property due to increasing rate of piracy emerges. As the multimedia services are working on simplifying the content delivery systems and broadening its accessibility, the increased number of online assets allows to easily intercept the content and redistribute it illegally to violator’s own benefit, somewhat utilizing the amount of media online to hide his infringement. In order to prevent it, many techniques are used, with watermarking being one of the most effective methods IBC’s website ; Thorwirth et al. (2018). The idea of digital watermarking assumes that one embeds an additional message into an image or a video, so that when copied, the owner of the content may prove his rights. Moreover, when digital media is made available to a wide range of users, the embedding of a watermark that is unique for each user, allows, in the case of an unauthorised redistribution, to indicate which user has leaked the content. The aim is to allow any authorized party easy localization, extraction and identification of the watermark, even if the copy has been manipulated. Since a person that aims to violate the copyright wants to destroy the embedded watermark, it has to be robust against image processing attacks that remove identification information encoded in it. Moreover, as the main interest of copyright owner is to provide the highest possible quality of his property to the clients, the embedded watermark should not be visible for the end user (transparency), nor should it diminish the quality of the image. It is easy to note, that the requirements of transparency and robustness are to some extent opposing (assume a constant capacity), hence it is of paramount importance for a watermarking system to provide an appropriate balance between the two. Naturally, the higher demand on the content, the better it should be protected, hence the watermark should have the capacity that allows to identify a large number of unique end users, Note that not only adversarial attacks hinder the effectiveness of watermarking. In order to maximize the network transfer, images and videos are compressed so that almost all surplus information, i.e., the information that is not properly interpreted by human perception, is removed, leaving only the data that are important for the visual quality. An example of the information that is removed in compression (e.g. JPEG) is chrominance (Cr, Cb) for which human sight is more oblivious than the luminance (Y) of an object. Since information in chroma components of the colorspace is trimmed significantly, the watermark has to be embedded into the components that end user is more aware of. Note that, in simplification, video watermarking may be viewed as an extension of image watermarking with a obstacle – MPEG compression, encoding primary frames (I) using JPEG and other frames (P and B) by relying on references to the primary.
the authors used Discrete Cosine Transform (DCT) in order to comply with JPEG compression, whereas a different frequency approach, combined with Singular Value Decomposition (SVD) was proposed inNajafi and Loukhaoukha (2019); Liu et al. (2019)
(sharp frequency localized contourlet transform and discrete wave transform were used respectively) as frequency-domain modifications allow to easily ’spread’ the information across the visible image. InHsu and Tu (2020), the authors used redundancy in their dual watermarks, to improve the robustness against cropping. Similar effect was achieved by an end-to-end encoder-noiser-decoder framework proposed in Zhu et al. (2018) that spread the embedded message over all pixels. A follow up paper Wen and Aydore (2019) introduced adversarial training, that resulted in further improvement of the robustness, albeit at the expense of the transparency. In fact, the resulting image significantly deviates from the well established ranges of ’acceptable’ PSNR, mitigating its commercial value. A spread spectrum watermarking with adaptive strength and differential quantization allowed the authors of Huang et al. (2019) to improve PSNR guarantees. The robustness of the algorithm was the focus of Luo et al. (2020), where authors add another neural network for generating generic distortions in the training. The presented model allowed to achieve improvement on previous accuracy, however the authors did not provide tests of robustness against such common attacks as rotation and subsampling. The authors of Plata and Syga (2020) focused on the local capacity of an image, presenting a method improving robustness of the watermark against wide range of the attacks, as well as proposing a differentiable approximation of JPEG compression. The latter was investigated also in Ahmadi et al. (2020). Coping with the difficulties introduced by image compression was focused on likewise in Hamamoto and Kawamura (2020).
In this paper, we introduced (1) a novel end-to-end watermarking system utilizing an additional component to inspect if images contain a watermark. (2) We enhanced the architecture of the system and extended the encoder by a watermark adapter, resulting in a significant improvement of the robustness against some attacks and compression algorithms, such as rotation, resizing or JPEG. (3) We proposed a novel evaluation method coping with false suspects of a copyright violation and indicated the efficiency of our discriminator–detector approach. (4) We also provided an alternative method to increase the transparency of encoded images and shown its efficiency.
The aim of watermarking techniques is to embed some binary data into a cover image of shape . To achieve that, we use the encoder which returns the encoded image . The encoded image needs to pass transparency requirement, i.e., be perceptually similar to the cover . Moreover, the encoded image should be robust against some processing operations called attacks (we utilize noisers in the training phase to simulate attacks). Then, we process the distorted image using the decoder to extract the message . We aim to receive , i.e. the extracted message should be similar to the embedded one. Note that, in the real–life scenario we also need to determine if the investigated image contains the additional data or not. To distinguish between distorted images and , we use the discriminator .
Our system contains eight components, where three of them are trainable neutral networks — encoder , decoder and discriminator . Next two are differentiable layers - noiser and prenoiser , used for adding artificial noise to the images. The last two elements, the propagator and translator , are required to propagate the message to a spatial form and revert this operation. We also distinguished an auxiliary neural network called adapter which extends the propagator . The overall architecture was presented in Fig. 1.
Propagator, adapter and translator.
In our solution, we applied a method to reduce local bits-per-pixels capacity. Instead of assigning a representation of the message to every pixel of the cover image , as in Zhu et al. (2018); Luo et al. (2020), among others, we used a spatial spreading algorithm proposed in Plata and Syga (2020). The algorithm converts the message to a sequence , where is a tuple containing a slice of the message and a reference index in a binary form . We assumed that is a size of the slice and . It is easy to notice that, we need at most bits to store the binary index . For example, for and , we need to prepare a sequence with length equal to and a tuple containing 6 bits (4 bits for and 2 bits for ). We also define for simplicity of further formulations.
The propagator converts the message to a sequence and generates two spatially-spread representations of the message, i.e. and , where is an argument of the propagator, which refers to a size of a unitary block of a size , which contains some tuple replicated times in two directions. The tuples and the unitary blocks are randomly sampled in order to fill the messages and , respectively. In the case of , we assign tuples to cells of a grid of a shape , while in the case of , to every cell of the grid we assign a unitary block of shape , thus the final shape of is equal to . The translator works with the message and reverts the operations applied by the propagator . In order to extract some slice of the message , the translator chooses tuples (cells) in which stored binary indices are closest to a corresponding index of the slice, then elements of the chosen tuples referring to the slice of the message are averaged.
We extended the propagator by adding the adapter , which is a convolutional neutral network used to adapting the propagator output to a convenient form for the encoder . Similar approach was presented in Luo et al. (2020), where linear layers were used to adapt the message directly. The adapter was separated from the encoder as it allowed to produce the result of independently of the processing steps of the images. It could be particularly relevant in a case of working with a sequence of frames. The adapter is build with four convolutional blocks called conv-bn-relu
The primary role of the discriminator was the application of adversarial training approach Goodfellow et al. (2014); Hayes and Danezis (2017) in order to improve the perceptual similarity of the encoded and cover images. We also utilized it to indicate if the image contains the message or not. The details and motivation to use the discriminator in this way were presented in Sect. 4. The discriminator is build with 3 conv-bn-relu blocks, a global average pooling layer, a linear layer with single output unit and Sigmoid activation.
Encoder and decoder.
The encoder ands decoder are two main components of our watermarking system. Both networks need to cooperate during training in order to find the balance between the transparency and robustness requirements of the watermarking. Moreover, they determine a joint "scheme" of the encoding–decoding procedure which is known only for the trainable components of the pipeline. The watermarking system is designed to work with some real–life limitations. As an example, note that watermarking videos is currently in much higher demand than watermarking strictly still images, hence any proper image watermarking method should be easily extended for marking movies (e.g., at frame by frame basis). In the over-the-top (OTT) media, services providing video–streaming are severely constrained by the content servers (origin or cache) that, due to storage limitations, are not able to provide different contents to all users (with a key per user). As a consequence, to allow uniquely watermarked media for each user, the encoder needs to work on the client’s side and handle a high quality video Friend MTS (2018). This indicates that the proposed architecture of the encoder (and indirectly other neural networks) has to be relatively small and shallow. Additionally, it explains the reason of considering the adapter as the separated component, which allowing to be used once for all frames. The encoder processes the cover image using three conv-bn-relu blocks. Next, the output is concatenated with the cover images and the adapted message taken from , followed by the application of two conv-bn-relu blocks. Finally, we use the convolutional layer with the kernel size equal to
on the concatenation of the cover image and the prior output. The detector is based on eight conv-bn-relu blocks. We apply average pooling with the kernel size and stride equal toto the output. Finally, we use the conv-bn-relu and the convolution with the kernel sizes equal to for both.
Attacks and noiser layers.
In our work, we considered eight attack types, all implemented in the noisier layer . The crop attack crops a square form the encoded image and it is parameterized by referring to a ratio of the squared area. The cropout also crops the square while the rest of the images is replaced by the cover image. The dropout
attack chooses pixels with the probabilityand replaces them with the corresponding pixels of the cover image. Both cropout and dropout imitate binary symmetric channel which in information theory describes more challenging model of communication than binary erasure channel (that may be simulated by salt and pepper noising) Zhu et al. (2018)
. We also included common computer vision operations – resizing, rotation and Gaussian smoothing, as well as most common compression algorithm – JPEG (with quality parameter) and 4:2:0 chroma subsampling procedure. The JPEG algorithm includes non-differentiable operations, e.g. quantization, thus we could not apply it in the training pipeline not halting the neural network weights updates. To handle this problem, we used an approximation of the JPEG proposed in Ahmadi et al. (2020); Plata and Syga (2020). In the experiments described in Sect. 5, the noiser layer is executed before and always applies the dropout.
Training details and hyper parameters.
To train and test our method, we used the COCO dataset Lin et al. (2014). We sampled images for the training subset and for the testing subset. Both subsets were disjoint. We resized images to pixels and encoded the message of the length . The messages were sampled at random, the spatial spreading was random as well. We used Adam Kingma and Ba (2014) with learning rate equal to (other parameters were set to default). The pipeline was trained with the batch size for epochs and all attacks were applied (one type in each iteration of the training). At the end, we froze all the weights except the discriminator and ran the discriminator’s part of the pipeline for 20 epochs. To train the system, we used two GPUs – Nvidia RTX 2080Ti 11GB. The one epoch of the pipeline training took about 370 seconds, while during the inference we were able to process about 45 images per second by one component and using one GPU. The default parameters for the convolutional layers were as follow: the channel size , kernel size
and the reflection padding applied. For the linear layers we setunits by default. All neural network were fed by images in the YCrCb color space, the same space were used for the images returned by the encoder.
The training procedure and objectives.
In one iteration of the pipeline, first, we take a message and apply it to the propagator , which returns two variants of the spatially-spread messages, i.e. Next, we use the adapter to transform the message and encode the cover image . The output of the operation is the encoded image , which is distorted by the nosier layers afterward. Exactly the same distortion needs to be applied on the cover image , thus we have: . Note that, some types of attacks require the cover image in order to distort the encoded image . Moreover, for the types of noises which affect the encoded image spatially, we also calibrate the message , e.g. for cropping attacks, we also crop the message .
The decoder is fed by a noised encoded image and predicts the encoded message ,
while the discriminator distinguishes between the noised cover and encoded images, and respectively, i.e.,
We need to determine a loss component in order to train the discriminator used for two purposes — improving the perceptual similarity and, as an auxiliary component of our watermarking system for double discriminator–decoder approach. Naturally, the main focus is to keep the transparency and robustness at highest possible level.
We seek to ensure the transparency using the mean square error between and , thus , where is the Frobenius norm. In order to handle the message decoding, we used the
mean-varianceapproach proposed in Plata and Syga (2020), given by:
where , and returns element-wise absolute values. This way of the loss formulation converges to a case in which some tuples contain possibly high quality of the overal data, instead of returning the correct predictions for a proper subset of the indices for all tuples. Note that, due to high redundancy of tuples in , this way of convergence is advisable.
We provided the adversarial training of the encoder using the discriminator . The aim of the encoder is to produce an image recognized as the cover image by the discriminator, while in our pipeline the image is further distorted by the noiser . Thus, we defined . The aim of the discriminator is to distinguish between the distorted cover and encoded images, which was formulated as .
To obtain the parameters for the encoder , adapter and decoder , we minimize the objective over the distribution of images and messages, namely , with weights. To train the discriminator , we minimize the objective over the distribution of images: .
3 Watermark robustness
In this section we present the efficiency of the image encoding–decoding procedure of our watermarking solution. In order to compare our framework with the well established ones, we applied the common evaluation approach, which relies on calculating the bit accuracy of the detected messages.
We set the propagator’s parameters to and and the translator parameter . We split the message into tuples and further replicate them to fill the unitary blocks of the size . The expected number of the blocks’ redundancy was equal to . We trained the pipeline with the following parameters: , , and . The results’ comparison of the bit accuracy between our solution and the state-of-the-art works was presented in Tab. 1. The results were achieved for the PSNR equal to dB, dB and dB for the Y, Cb and Cr channels, respectively. The examples of the encoded images were presented in Figure 2. Our solution exhibited significant improvement for the rotation, resizing and JPEG attacks. Moreover, it provided the highest, among all methods, overall accuracy, which was measured as the lowest bit accuracy of all attacks. For our solution, the lowest bit accuracy was equal to , while it was for Plata and Syga (2020) and below for others methods.
|Our||Spatial Plata and Syga (2020)||HiDDeN Zhu et al. (2018)||DADW Luo et al. (2020)||RedMark Ahmadi et al. (2020)|
4 Double discriminator–decoder approach
In the real–life environment, watermarking systems are used only in small subset of the total multimedia content worldwide. Thus, in order to detect the message, we need to handle one of following approaches:
naïve — apply the detection procedure on every image and label the suspects, when the extracted key shares at least bits with any of the keys from the database,
double — at first execute a procedure to distinguish if given content comes from our watermarked sources or not
The former relies on using a highly effective detection procedure. For example, let us assume that
(estimated for accuracy about 90%) and the watermark contains 32 bits. Having one million random keys in the database, the probability of the event that at least one key from the database contains at least 29 shared bits is equal to. Thus, the chance of failure is high even for relatively small database of the keys and high accuracy of the decoder.
The latter could significantly improve the overall efficacy of the watermarking system, but it requires auxiliary subsystem to distinguish between the content sources. Instead of using the critic only to rate if the encoded image is similar to the original one, we moved its position in the system and placed it after the nosier in the training pipeline. By this, we fed the critic with noised images . Therefore, cover images also needed to be processed by the nosier in the same way as encoded images in order to avoid to learn inappropriate characteristics of the cover and encoded images, i.e. the critic would be able to learn features which are effects of processing by the nosier rather than the encoder. This modification still gave us possibility to use the critic for adversarial training and aiming to improve the transparency of the encoded images.
During the tests, we assume that an image contains the watermark if , where is a threshold for the critic’s outputs. We utilized the standard metrics: true positive (TP), when is an encoded image and ; false positive (FP), when does not contain embedded message and ; true negative (TN), when does not contain embedded message and ; false negative (FN), when is an encoded image and . While designing the watermarking system we aim at maximizing the detection of copyright infringements as well as minimizing the probability of false accusation of piracy. Following these conditions, we proposed two rates to measure the efficacy of the discriminator–decoder approach that translate well to real–life validation of the watermarking frameworks: (1) true identification rate (), defined as a probability of extracting appropriate key provided true encoded image, i.e., , (2) false identification rate (
), defined as a probability of indicating a wrong key, provided true encoded image or falsely classifying image as encoded, i.e.,. It is easy to note that covers the maximization problem discussed above, whereas complies with the minimization. Working with both rates gives us ability to precisely validate the watermarking system. We also consider for false identification of encoded images and cover images separately. Hence, we define to test an indication of wrong keys from encoded images, and to validate falsely classified cover images.
Robustness of the discriminator–decoder approach.
To the best of our knowledge, recent work on neural networks-based watermarking did not include experiments following similar conditions to those described in Sect. 4. Vast majority of the research focuses on improving only the detection procedures and transparencies of encoded images. We validate our system using the common in recent works procedure, e.g. Plata and Syga (2020); Zhu et al. (2018); Luo et al. (2020); Ahmadi et al. (2020) (see Sect. 3), as well as and .
Taking advantage of this testing formula, we shown the efficiency of our discriminator–decoder approach compared to the naïve system with a single decoder. The experiments were done for images. We assumed that the size of the keypool is equal to . The results are shown in Table 2. The threshold was adjusted to obtain the highest for the naïve approach, however, due to high sensitivity of , we chose lower one if the difference between them was at most . Then, we adjusted the rates for the double approach keeping the similar . Our results confirmed the high efficiency of the double approach. In most cases, the obtained results were higher than in the naïve one, notably for some attacks, the difference was significant. The double approach improves the overall performance significantly for some of the attacks in which the decoder presents relatively low bit accuracy, i.e., lower than . For example, having similar values for , we completely discarded from around and reduced around - times for cropping and cropout attacks. In turn, we observed that the naïve approach is as efficient as the double approach if the decoder demonstrates a high robustness against some attacks. Moreover, we observed that having inefficient discriminator and robust detector, the results could be better for the naïve approach, e.g., for Gaussian smoothing, the decoder’s bit accuracy was close to and the discriminator’s – (on balanced data) and finally the overall performance was higher for the naïve approach.
5 Transparency versus robustness
One of the most challenging problems of watermarking techniques is encoding message into an image in transparent manner that enables accurate decoding of the message from a distorted image. A robust watermarking needs to store the message in a part of the image, that is the most resistant on distortions resulting from attacks as well as compression algorithms.
Finding a suitable trade–off between those ratios is possible by a proper tuning of hyper parameters of the loss function during the training of the pipeline. Despite the fact that the tuning for our pipeline is time–consuming and exhaustive process, including changing parameters in the middle of the training procedure, the pipeline of three neural networks was usually not balanced the way that was expected. Additionally, in a case of creating a system that produces images with different transparency levels, the basic approach requires training separated neural networks. Thus, we extended our framework by a component which allows improving the quality of the encoded image.
We designed a method to improve transparency of the encoded image that could be applied after the training process. The method performs following steps:
select a mask , where
, i.e., sampling from Bernoulli distribution with the parameter,
update , where denotes element-wise multiplication.
We trained our pipeline with the hyper parameters set in a firmly favourable way for improving robustness against attacks, i.e. we set , , and . Then, we compared the encoded image quality and the robustness against attacks for various values of the parameter applied to the Bernoulli distribution. The results was presented in Figure 3. The results confirmed that an increase of the images’ transparency has negative impact on the robustness against some attacks. In Ahmadi et al. (2020), authors presented another method of increasing the transparency by calculating linear combination of the cover and encoded images. Note that, their system is not designed with consideration of spatial attacks (e.g. rotation), whereas we presented the method which handles this type of attacks as well.
In our work, we introduced a novel end-to-end watermarking solution based on neural networks. We significantly improved the robustness against some attacks, e.g. JPEG compression, rotation. The method stands out with the highest overall accuracy which is equal to . It is also characterized by one of the highest transparencies of the encoded images. We added a new component called adapter to the pipeline and used a novel architecture of the pipeline in which we were able to utilize the discriminator to distinguish between cover and encoded images. We proposed an evaluation method which fits to the real–life environment of the watermarking system based on three metrics, namely , and . We evaluated a naïve watermarking system and our double decoder–discriminator architecture following our evaluation method. It confirmed the high efficiency of our double approach. Finally, we explored the problem of robustness versus transparency of the watermarking systems and proposed a flexible solution to handle the problem. In the future work, we would like to continue to improve the robustness of the watermarking system, including multi-attack scenarios and new types of attacks. Moreover, we would like to enhance the capacity of the watermark as well as apply other quality measures in the training pipeline that could improve the transparency of the encoded images.
This paper presents a complete framework of image watermarking that allows both embedding and identifying the message encoded in the image. The natural environment for implementing the solution is copyright management. Due to high accuracy and low impact on the image quality, the solution may be used to protect intellectual property in the form of professional or amateur pictures, computer generated graphics or movies. Encoding a watermark may allow to identify the author of the intellectual property as well as help to identify the person that leaked the medium, allowing the breach of the copyright. For the latter case, it is required that the watermark is unique for each client. An additional branch that may benefit from the watermarking technique is emerging as a response to the needs of deep learning solutions, data annotation business. Both above cases present a scenario where content creators (or owners) are protected against copyright infringement and may prove their authorship or allows presenting proof in legal actions. Naturally, such actions may be taken only in the case of high certainty. Despite the high accuracy of the method it is not foolproof, hence it may give a false positive identification and has to be sufficiently supervised so that false accusation is not made. Aside from the intended use, the popularity of watermarking may incur some negative aspects, as the legal system adjusts to the proofs by presenting the watermark we may be witnessing a ’copyright trolling’ similar to recent patent rivalries between companies as some individuals may try to watermark their yet-not encoded-content with their signature, to claim ownership. This aspect may be enhanced by the false positive identification mentioned above. The watermarking systems are a proposition for the multimedia industry to protect their businesses against the piracy. The solution considers many aspects required for solutions working in the real–life environment. The proposed solution is flexible in a context of the size of the image. It is a ’lightweight solution’ as its time performance is relatively high, i.e. 30 FPS movie could be encoded in a real time harnessing graphic processing unit (GPU). The solution is also robust against video compression algorithms (most informative units of a compress movie, IFrames, are encoded using JPEG algorithm).
- ReDMark: framework for residual diffusion watermarking based on deep networks. Expert Systems with Applications 146, pp. 113157. External Links: Cited by: §1, §2, Table 1, §4, §5.
- Comparing subscriber watermarking technologies for premium pay tv content. White Paper External Links: Cited by: §2.
- Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680. External Links: Cited by: §2.
- Neural watermarking method including an attack simulator against rotation and compression attacks. IEICE Transactions on Information and Systems E103.D, pp. 33–41. External Links: Cited by: §1.
- Generating steganographic images via adversarial training. In Advances in Neural Information Processing Systems 30, pp. 1954–1963. External Links: Cited by: §2.
- Enhancing the robustness of image watermarking against cropping attacks with dual watermarks. Multimedia Tools and Applications 79 (17), pp. 11297–11323. External Links: Cited by: §1.
- Enhancing image watermarking with adaptive embedding parameter and psnr guarantee. IEEE Transactions on Multimedia 21 (10), pp. 2447–2460. Cited by: §1.
-  Using forensic watermarking to protect UHD content. Note: https://www.ibc.org/publish/using-forensic-watermarking-to-protect-uhd-content/946.article, last accessed on 2020-04-20 Cited by: §1.
- Adam: A Method for Stochastic Optimization. International Conference on Learning Representations, pp. . Cited by: §2.
- Improved wavelet-based image watermarking through spiht. Multimedia Tools and Applications, pp. 1–14. Cited by: §1.
- Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 740–755. External Links: Cited by: §2.
- An optimized image watermarking method based on hd and svd in dwt domain. IEEE Access 7 (), pp. 80849–80860. Cited by: §1.
- Distortion Agnostic Deep Watermarking. arXiv e-prints, pp. arXiv:2001.04580. External Links: Cited by: §1, §2, Table 1, §4.
- Hybrid secure and robust image watermarking scheme based on svd and sharp frequency localized contourlet transform. Journal of Information Security and Applications 44, pp. 144 – 156. External Links: Cited by: §1.
- A robust embedding and blind extraction of image watermarking based on discrete wavelet transform. In Mathematical Sciences, Vol. 11, pp. 307–318. External Links: Cited by: §1.
- Robust Spatial-spread Deep Neural Image Watermarking. arXiv e-prints, pp. arXiv:. External Links: Cited by: §1, §2, §2, §2, Table 1, §3, §4.
- Forensic Watermarking Implementation Considerations for Streaming Media. Streaming Video Alliance. Cited by: §1.
- ROMark: A Robust Watermarking System Using Adversarial Training. arXiv e-prints, pp. arXiv:1910.01221. External Links: Cited by: §1.
- HiDDeN: Hiding Data with Deep Networks. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2, §2, Table 1, §4.