1 Introduction
In the recent years the multimedia market has been steadily growing. The services provide access to vast range of desired multimedia and are getting more convenient to use, e.g. Netflix offers offline access to movies and TV shows [NetflixOffline]. It causes an increase in illegal redistribution of copyrighted content. One of the most efficient method to prevent such behaviour utilizes embedding of humaninvisible watermark in the digital content.
Watermarking uses the fact that the current bandwidth of digital image signals is much higher that the amount of information which could be properly received and interpreted by human. It is wellknown that a human eyesight is more sensitive to luminance component of a color space than to chrominance, i.e. one can recognize even small difference in brightness of an image, but small color perturbations are oblivious to human’s visual system. This is one among many properties operating on the surplus bandwidth that are used in such applications as compression, steganography or watermarking.
Watermarking describes a model of a communication, where a user needs to embed a message into a digital image and send it to a recipient. Afterwards the image may be manipulated on by an attacker, however a legitimate user who shares a set of joint strategies of embedding and extracting the message should be able to recover the embedded message from the (possibly manipulated) image. The goal of the attacker is to modify the image, without significant deterioration, in order to destroy the embedded message, yet preserving the commercial value of the original data.
During the work on watermarking techniques, we need to handle three following requirements [survey]:

transparency concerns the quality of the image after the watermark encoding. In general, the original and watermarked images need to be perceptually similar. All distortions affected by the watermark embedding should be invisible for the human eyes, so that the value of the data for the consumers does not deteriorate. In our work, we utilized
(PSNR), which measures the pixelwise difference between two images; 
robustness describes user’s ability to decode the message from the encoded images after applying some signal processing operations on it. These operations could be applied intentionally, in order to destroy the watermark, or be a result of technical requirements or limitations. In this work, we used a terminology attacks referring to these operations. Examples of attacks include cropping, resizing, Gaussian blur or JPEG compression;

capacity was defined in [2002] as "the number of bits a watermark encodes within a unit of time or work". In this paper, we considered local or block bits per pixel capacity. The size of the block could be delimited by calculating the longest distance on which an information about any pixel is spread over the image using the encoder architecture based on the sequence of the convolutional layers, e.g. for one convolutional layer with the kernel size equal to , the block size is and for two layers with kernel size equal to , it is .
In this paper, we introduce a novel technique of embedding a secret message into a digital image and extracting it using convolutional neural networks. We proposed a method of spatial spreading of the secret message over the image, which significantly reduces the local (block) bits per pixel capacity and at the same time preserves robustness on spatial attacks, such as rotating or cropping. Additionally, using the spatial spreading method significantly reduces the time of the training phase in comparison to previous solutions. The proposed method has been validated against a wide group of attacks including lossy compression techniques, such as subsampling and JPEG compression, and spatial attacks, such as rotating and cropping. Despite considering such attacks by the multimedia community throughout the history of ’classic’ watermarking, some of these attacks were neglected by other authors of recent watermark encoding solutions using neural networks, even though the attacks are easy to apply, and some of them are common components of a lossy compression techniques. We also divided the considered attacks into five groups based on the scope they affect the image. Next, we shown that it is essential to apply the attacks from various groups in order to build a robust deep learning system for watermarking. Finally, we evaluate the robustness of our method against the attacks in terms of the quality of the image measured by
peak signaltonoise ratio (PSNR).Our contribution
is (1) a new architecture of the spatialspread encoder and decoder as well as (2) the formulation of a loss function matching the architecture. (3) We improved the robustness against particular attacks in comparison to the current stateoftheart methods, especially JPEG lossy compression algorithm, resizing and Gaussian blurring. (4) We handled new types of attacks, such as subsampling, which is a part of JPEG algorithm. (5) We proposed precise differentiable approximation for JPEG compression algorithm. (6) The resulting training framework required half the time in comparison to prior solutions. (7) We prepared the analysis of attacks’ types – we grouped the typical attacks according to their scope, which is helpful in choosing the appropriate and balanced set of attacks applied to the noiser layers and deriving dependencies between them.
2 Problem formulation
The main goal of the watermarking methods is to embed an additional information, called a watermark, into a digital image, called cover image in a way, that allows recovering the watermark by a legitimate user. The watermark needs to be robust against some signal processing operations, called attacks. Our method is a blind technique, thus the decoder is not provided with an access to the original image. In this work, we considered the following attacks: cropping, cropout, dropout, rotation, Gaussian smoothing, subsampling 4:2:0, JPEG compression, resizing. All attacks as well as the watermark encoding need to ensure the transparency of the image.
We aim to encode a binary message , where is a positive integer, in the cover image of shape . The result of this operation is the encoded image containing the hidden watermark . Both images and need to be perceptually indistinguishable. Next, an attacker modifies by applying selected attacks in order to indispose to extract from the encoded image. An output after the modification is a noised image which has three channels and unspecified width and height. Finally, we try to extract a hidden message from that satisfies .
3 Related work
The problem of transparent and robust embedding of an additional information into a digital domain was deeply studied for many years. Watermark solutions could be divided into two types nonblind and blind
. The nonblind solutions require an original copy of the image for a detection step, whereas blind methods are able to detect a message encoded into the covertext without any additional data. Due to their easier application in reallife environment, most recent works has been focused on the blind approaches. Many solutions use spatialtofrequency domain transformations, such as Discrete Fourier Transform (DFT)
[4129062], Discrete Wavelet Transform (DWT) [Najafi, 7415839, Kumar2018ImprovedWI, makbol2014new, lai2010digital], Discrete Cosine Transform (DCT) [5966517, Shivani2017RobustIE, patra2010novel] and others [8324041, 8259288]. Extreme Machine Learning (EML) is another technique used for embedding watermarks into digital images which is gaining popularity over the last years
[6252363, 7415839, 7966011]. Another method used widely for handling the watermark problem is Singular Value Decomposition (SVD) that was utilized in
[makbol2014new, gupta2012robust, makbol2013robust, lai2010digital, loukhaoukha2011optimal] among others. Many presented works handled watermarking with combination of two or more techniques (e.g. [makbol2014new, 7415839]). In recent years, we could also observe increased interest in applying deep learning methods into the watermarking field. Authors of [Zhu_2018_ECCV] proposed a framework for training encoder and decoder networks in endtoend manner due to adding noiser layers between the encoder and the decoder and an advisory network decided whether the images were encoded or not. A message was spread over all pixels on an image, hence allowing to achieve impressive robustness for cropping attacks. The paper was followed by [wen2019romark], where the authors introduced a novel method of training the original architecture, called adversarial training. They reported a high robustness against the attacks, however it resulted in relatively low quality of encoded images measured by the PSNR. Another interesting approach for improving the robustness of a message detection was using an additional attack neural network for generating generic distortions introduced in [luo2020distortion]. The authors of [zhong2019robust] designed a fully automated deep learningbased system for watermark extraction from cameracaptured images. In [ZeroCNN], the authors used convolutional neural networks for zerowatermarking which does not modify the image but extracts some characteristics from the image in order to linking it with an owner. The paper [JpegRotate] described a deep learning solution robust against JPEG compression and rotating. In RedMark [ReDMark], there was a special transform layer used on an image before feed forwarding the encoding neural network and they worked out a differentiable approximation of JPEG.4 Method
4.1 Architecture
The architecture proposed in the paper consists of six main components. Three of them are trainable neural networks called encoder , decoder and adversarial discriminator/critic , where , , are trainable parameters. The next component is an additional attacker network used for performing the attacks from the previous section on the encoded image. We also specified two deterministic algorithms called message propagator and message translator . The overall sketch of the architecture was presented in Figure 1.
We denote the th bit of the message as . Moreover, we represent the message using a sequence of tuples , where and . In particular, for , we are able to represent the message as a trivial sequence of tuples for . We also define a function which for a given value returns its binary representation of a length equal to .
The propagator is a function which executes following steps:

converts the message into a sequence of tuples ,

for every , converts the first element of the tuple to the binary representation , flattens the tuple , and unsqueezes to ,

fills a spatiallyspread message with unsequeezed tuples for . Note that, we allow to produce redundant data in , i.e. insert more that one tuple .
We also need to extend if the message is an input to the encoder, then we go through with one additional step:

repeat times in horizontal and vertical directions every tuple in ( is converting to ).
If the additional step needs to be executed, we denote the propagator by and achieve . The visualization of the propagator is presented in Figure 2.
The output of the propagator together with the cover image is used by the encoder to produce the encoded image , i.e.:
(1) 
We follow by applying the attacks on the image by:
(2) 
Note that, some attacks required the cover image , e.g. dropout. For the crop attack, we also cropped the message during the training. The decoder tries to predict the message having an access only to :
(3) 
Additionally, we use to rate if is similar to :
(4) 
The last element of the architecture is the message translator . It is a deterministic function which calculates the final message based on the decoded message . The process of the calculation is similar to the kNearest Neighbours algorithm. For every , we find tuples from with first values (referred to binary index) that are closest to , i.e., we choose a tuple with coordinates if is one of lowest values. Then, we calculate mean values for every element encoding a bit of the message, i.e. elements from the tuple on positions . Based on this, we are able to predict all bits from the message .
The whole architecture allows to encode the message in the cover image and reduce a number of the local (block) bits per pixel capacity. The stateoftheart and recent architectures of encoders [Zhu_2018_ECCV, wen2019romark, luo2020distortion] are based on the convolutional layers. It means that the encoder embeds the watermark message locally, without an access to the whole image. This architecture of the encoder provokes two ways of encoding the message. (1) Encoding only subset of the whole message depending on the pixels color space, e.g. encode some bits only if a tone of the pixel is close to blue. This way of encoding is risky and unreliable. (2) Attempting to encode the whole message locally (in the block of pixels). A results’ analysis of the robustness on attacks, in particular, the high accuracy against cropping attack, indicated that the second way of the message encoding is more likely. Thus, we proposed the solution for reducing the local bits per pixel capacity and improved the robustness against attacks, especially smoothinglike attacks.
The proposed architecture spreads fractions of the message over the image in the form of tuples , where and . Note that the spread is performed in a block fashion rather than assigning the whole message to every single pixel. For example, we could encode the message of length by splitting it into 8 patches of length equal to 4 ( and ). Thus, we are able to encode the patch by 7 bits, where we need 3 bits for the index of the patch and 4 bits for the corresponding fraction of the massage. During our experiments, we achieved the best results for .
4.2 Loss functions
We formulated a novel loss function for training our models using gradient descent algorithm. Our general objective contains three separated loss functions , and , for training the encoder , the decoder and the critic respectively. The nosier layers are inside the training pipeline and do not contain training parameters. Furthermore, the message propagator and translator are deterministic algorithms outside of the training pipeline.
The aim of the loss function is keeping images and similar. It was formulated as follow:
(5) 
where is a standard Mean Square Error function. The loss function works on the similarity between propagated messages and . However, as contains redundant data, i.e. the same tuples, we do not need to perfectly recover the message. Our aim was to extract a subset of tuples with "high confidence" of information. Thus, we formulated the loss function
as a combination of mean and variance functions:
(6)  
(7) 
and
(8) 
where and and the operator
returns the absolute value of every element of the vector. The final loss function is
. Such formulation of the loss function promotes learning of all elements in some tuples over some elements over all tuples.We also defined the adversarial training for the encoder and the critic
. By this, we were able to achieve better visual similarity of the images
and . For the encoder , we expected to produce images following the transparency requirement, thus we defined the loss function . On the other hand, the rule of the critic was to distinguish between the "real" images and the modified image , thus in this case we defined the loss function .Finally, we ran the gradient decent algorithm on and parameters in order to minimize the loss function over the distribution of images and messages :
(9) 
where s are weights for particular losses. We simultaneously conducted the training of the critic with parameters by minimize the loss function over the distribution of images : .
4.3 The architecture of the networks
The main block applied to the neural networks, i.e. the encoder , the decoder and the critic , is a sequential structure of a convolutional layer with 64 channels, the kernel size equal to
, stride equal to
and padding equal to
, then a batch normalization layer and a ReLU activation. All networks operate on images in YCbCr color space.
The encoder
contains five sequential blocks, where the first block is fed by the concatenated tensor of the image
and the spread message . Next, the tensor is also concatenated with the input before every second convolutional layer, i.e. 1st, 3rd and 5th layer has an access to the cover image and the spread message. The last encoder layer is the convolution with 3 channels and default other parameters. Note, that the number of layers in the encoder does not exceed the other stateoftheart methods, e.g. [Zhu_2018_ECCV, wen2019romark, luo2020distortion, ReDMark]. It is important in the context of a time performance as in many practical scenarios (e.g. streaming) the encoder needs to work in realtime.The decoder takes the encoded image and puts it through 6 sequential blocks. Then, we apply an adaptive average pooling layer which produces a tensor with a size equal to . Next, the tensor is fed to the sequential block with 64 channels, the kernel size and the padding equal to . The last layer is the separated convolution layer with channels, the kernel size and the padding remain unchanged. Thus, the decoder returns a tensor with the same size as . The last two convolutional layers imitate fully connected layers for every spatial element of the output over channels. Note that during our experiments we did not change the size of the tensor produced by the adaptive pooling, i.e. the decoder returned the output tensor with the same size also after cropping or resizing attacks. Executing actions regarding attacks’ types could improve the robustness of the method, but requires a method to recognize the attack’s type and counters the endtoend approach, thus we decided to return with the same size in every case.
The critic consists of three sequential blocks, an adaptive average pooling layer which produces a 64dimensional vector, then a fully connected layer. The critic returns the value describing a similarity of the input image to real images.
4.4 Noiser layers and Attacks
We selected some nosier layers which we later applied during the training process. We exposed to the neural networks various kinds of distortions which they needed to handle in order to increase the performance. By this, we were able to determine a way of the training of the neural networks. The types of selected distortions included cropping and cropout, dropout, Gaussian smoothing, rotation, subsampling 4:2:0, approximation of JPEG and resizing.
The crop distortion returns a cropped square of the image of a specified area ratio . The cropout attack works similar to the crop, it crops the square of the image and instead of discarding the rest of the image, it replaces the extant area by the image . As in [Zhu_2018_ECCV], we decided to use the image as the background for the encoded image , because this simulates a binary symmetric channel (BSC), which is a standard model considered in information theory, where a receiver does not have knowledge if a obtained bit is correct or wrong. Applying monotonic or random color of pixels of the extant area imitates other simple communication model called a binary erasure channel (BEC). The cropout attack was parameterized by us with a value equal to a ratio of the cropped area over the entire image area. The dropout attacks keeps a percentage of the pixels of the image and the rest pixels replaces with corresponding pixels of the image . As in the cropout, this procedure also simulates the BSC model. The Gaussian smoothing was done with a parameter (a kernel width).
Next four attacks are our extension of those presented in [Zhu_2018_ECCV, wen2019romark]. The rotation attack rotates the image by degrees. The subsampling 4:2:0 is applied in many digital compression algorithms, such as JPEG or MPEG, and is the most popular from all its variants (e.g. 4:2:2, 4:1:1). It reduces the image channels Cb and Cr by calculating an average value of every square of . The procedure could be done using a 2D convolutional layer with one channel, kernel size equal to , stride equal to and weights set to . We also used the resize attack with a scale factor
. We handled two types of interpolation –
Nearest neighbours and Lanczos.Approximation of JPEG.
Lossy compression algorithms could be considered as most efficient attacks against a wide range of watermarking protocols. This comes from the fact that algorithms such as JPEG are very efficient in removing barely visible objects and which are not essential for the viewer. On the other hand, all watermarking techniques aim at changing the image in a way that is hardly noticeable for the viewer.
Thus, it was necessary to apply them into the training pipeline in order to obtain an appropriate design for the encoder and the decoder training. The main inconvenience of the JPEG is a rounding operation applied on quantized frequencydomain elements of the image. The derivative of the round function is indeterminate for points and equal to in the rest of the domain. Thus, using the rounding function in the middle of the training pipeline is impossible due to halting the update of the neural networks weights by the gradient descent algorithm. We proposed an approximation of JPEG compression which executes the following steps for the image : (1) converting to YCbCr color space, (2) subsampling 4:2:0, (3) splitting separately every channel into blocks of (4) applying the Discrete Cosine Transform (DCT), (5) dividing by the quantization table and (6) applying the approximation of the rounding. The last two steps, we formulated as follows:
(10) 
where , is the frequencydomain element of the image and is the related element of the quantization table. For our experiments, we set . We used the standard quantization table for the quality parameter and we modified the elements of the table for different in accordance with the JPEG standard [1993_jpeg_still, PARKER2017].
To the best of our knowledge it is the most precise differentiable approximation of the classic JPEG applied to the deep learning training pipeline for the watermarking. For the evaluation procedure, we used the standard JPEG.
4.5 Training details.
The method was trained on the COCO dataset [coco]. We used 10000 randomly sampled cover images for the training subset and 1000 for the validation subset. Both subsets were disjoint. The messages were sampled randomly and also the spatial spreading was random. The parameters , , and were set to , , and , respectively. We used Adam [adam] with learning rate equal to
and other default parameters for the stochastic gradient descent optimization. The models were trained with batch size equal to
. The final training with applied all nosier layers took 100 epochs.
5 Analysis of the attacks
We observed that most of the attacks considered by us could be distinguished into more general groups based on their specific characteristics. Thus, we classified attacks regarding the way in which they affect the image. We also assumed that after any attack a content of the image needs to be visible and its quality has to be acceptable to the customers. With these assumptions, we consider five types of the attacks:

Pixelspecific, where we modify only single pixels (without considering any others) by changing color, adding noise, replacing pixels by other random ones, removing some pixels or changing their position on the image. In this group we could specify two subgroups: one that applies one modification on all pixels, and the other that applies one modification on a subset of pixels. A characteristic of this group is that we have an access to a smaller subset of nonmodified pixels after attacks or all pixels were transformed in the same specific way. To this group, we selected some attacks such as color space conversions, cropping, cropout, dropout and rotation.

Local, where we modify pixels with regard to their local neighborhoods. In this group, all pixels are modified during attacks, but only neighbours of the pixels affect on the results (e.g., subsampling, Gaussian blur and resizing).

Domain, where modifications are domainspecific and even small changes in limited neighbourhoods could affect globally on an image represented in a different domain. This group of attacks includes all transforms methods, e.g. Discrete Cosine Transform (DCT).

Mixed, where a final modification is a combination of methods from other groups. Here we could distinguish JPEG which combines color space conversion, subsampling and locally applied DCT.
The analysis of attack types could be important and helpful in the context of designing a training pipeline. Most of the recent deep learning solutions for watermarking use additional noiser layers in order to improve robustness for particular attacks (e.g. [Zhu_2018_ECCV, wen2019romark, ReDMark]). It requires to choose a finite set of attacks applied during the training process. Moreover, all attacks in the training pipeline need to be differentiable as the noiser layers are usually embedded before the neural network responsible for the message’s detection. As such, it requires deferential approximations of nondifferentiable attacks, e.g. JPEG compression. An appropriate choice of attacks for a training pipeline could cause a high robustness for other attacks which were not applied to the training pipeline. In [luo2020distortion], where authors proposed a distortion agnostic method using adversarial neural networks, we could observe that even small perturbations generated by attacks classified by us into the local group noticeably decrease an accuracy of the message detection. It could mean that the attack neural network generated the distortions belonging to the pixelspecific or domain groups and ignored attacks similar to these from the local group. In our work, we focused on selecting a special set of attacks which covers all four groups.
5.0.1 Robustness on exclusionary attacks’ selection.
We conducted an experiment where we trained the pipeline with only a subset of the attacks, from only one of the mentioned groups, and we observed its impact on the robustness on the attacks from the same group and from the others. The results was presented in the Table 1. The experiments confirmed that there exist a correlation between the ways of image modifications by particular attacks and stronger correlations are noticeable between attacks belonging to the same group. It is trivial to notice, that crop and cropout attacks do not modify a complete (whole) patch of an image, i.e. the decoder has an access to the nonmodified patch. The dropout attack changes random pixels, but still the decoder could detect the message based on notmodified pixels. Thus, applying only the subset of the attacks during training we could achieve more general robustness on a wider collection of attacks from the same group.
Attacks  Noiser Layers  

Identity  Crop() Dropout()  Gaussian() Subsampling(4:2:0)  
Identity  0.999  0.991  0.985 
Crop()  0.847  0.894  0.833 
Cropout()  0.793  0.875  0.672 
Dropout()  0.530  0.972  0.574 
Rotate()  0.754  0.821  0.780 
Gaussian()  0.823  0.564  0.981 
Subsampling(4:2:0)  0.524  0.623  0.980 
Resize(, )  0.511  0.532  0.735 
JPEG()  0.502  0.512  0.783 
6 Watermark robustness
In this section, we presented the evaluation of our method and the comparison with the current stateoftheart solutions. The experiments were done for the images of the size and the message of the length . Our main goal was reducing the local bits per pixels capacity, thus we set . By this, the number of bits required for storing the patch (tuple) was equal to and the number of the patches was equal to . The tuple stored two bits of the message and the related index which took four bits. The block size was set to . In order to spread all patches over the image, we needed to locate blocks with the size equal to pixels. It indicated that the smallest size of the image was equal to pixels. The final method was trained with all types of attacks applied to the noiser layers. We considered the bit accuracy as a metric for calculating the robustness against attacks. The results of the robustness on attacks were presented in the Table 2.
Attacks  Methods  
Our  HiDDeN [Zhu_2018_ECCV]  DADW [luo2020distortion]  RedMark [ReDMark]  
Identity  1.0  1.0  1.0  1.0 
Crop()  0.832  1.0  1.0   
Cropout()  0.902  0.94    0.925 
Dropout()  0.962  1.0  1.0  0.99 
Rotate()  0.842       
Gaussian()  0.986  0.96  0.6  0.5 
Gaussian()  0.982  0.82  0.5  0.5 
Subsampling(4:2:0)  0.984       
Resize(, )  0.849    0.671  0.819 
Resize(, )  0.908    
JPEG()  0.831  0.67  0.817  0.746 
6.0.1 Lossy compression versus watermark encoding.
Lossy compression algorithms and watermark encoders work in the same subdomain of the image, i.e. they try to modify invisible pixels in order to reduce a size of the image or encode additional information in the image, respectively. Thus, we considered these algorithms as a special and sophisticated group of attacks. Assuming invisibility of the watermark, the encoder should change these pixels which are removed or modified by the lossy compression algorithms. Therefore, in our work we mainly focus on preserving a robustness on lossy compression techniques as contemporary multimedia applications or services use lossy compression algorithms by default and it is impossible to skip the compression step due to some technical reasons, such as the limitations of the broadcast bandwidth.
6.0.2 Robustness vs. quality of images.
The method was evaluated for the PSNR equal to dB, dB and dB for the Y, Cb and Cr channels, respectively. The quality of the encoded image is similar to results achieved in [Zhu_2018_ECCV]. In [luo2020distortion, ReDMark], authors reported slightly higher values of the PSNR. All methods achieved the quality of the images similar to the lossy compression algorithms [barni2006document], where the average PSNR for all channels is typically above dB. We did not take into consideration the results of robustness from [wen2019romark] because their method modifies the image significantly. In order to compare the distortion level we calculated the PSNR for our validation dataset after applying JPEG compression algorithm with the quality factor and the subsampling 4:2:0. We achieved the PSNR equal to dB, dB and dB for the Y, Cb and Cr channels, respectively. And without using the subsampling technique, we achieved dB, dB and dB. The results of the PSNR suggest that the message was encoded on the Y channel chiefly.
7 Conclusions
In the paper we proposed the watermarking method based on spatial spreading of the message. Our architecture is done with convolutional neural networks and is scalable for any size of an image. We developed a special architecture for the encoder network where the cover image and the message are yielded to every second layer. We also formulated a novel and custom loss function for training the neural networks. In comparison to previous method our watermarking system provides significantly improvement of robustness against Gaussian smoothing, resizing and JPEG, which are attacks from the local group. The work was extended by additional attack types, such as subsampling 4:2:0 or rotation. We also achieved the bit accuracies above for all considered attacks. This indicates that the method accomplished high general robustness exceeding previous solutions. As a way to obtain our results, we additionally provided a precise differentiable approximation of JPEG compression and grouped the attacks on the watermark based on their scope. In future work we would like to continue to improve the robustness against the attacks as well as apply and evaluate multiattacks scenarios. We would like to increase the message capacity and extend the solution over a video domain and videospecific compression algorithms. Moreover, some other quality measures like the one presented in [Perceptual2016] may be considered in order to adjust the transparency.
Comments
There are no comments yet.