I Introduction
With the rapid development of digital information processing technologies, various digital media content have been widely used in many areas. Because digital media is easy to propagate, copy and modify, how to protect the copyright of digital media has become a crucial but also practical problem. Digital watermarking aims to solve this problem by embedding extra information into the digital media and extracting such extra data for authorized access. Nowadays, digital watermarking has been widely used in many applications include broadcast monitoring [11], copy control [21], and device control [9]. In this paper, we focus on digital image watermarking. In particular, the expected image watermarking algorithm asks for embedding the message (i.e. watermark) into the cover image (i.e. the image requiring authorized access) to obtain the watermarked image. In addition, the image watermarking algorithm needs to recover the original message as much as possible from the watermarked image.
Although digital image watermarking has been widely studied in the academic community, it remains a challenging issue. There are three key factors to measure the performance of the digital image watermarking algorithm, namely the robustness, imperceptibility, and capacity. The robustness requires the message embedded into the image to survive under malicious and nonmalicious attacks. The imperceptibility needs the watermarked image to be as identical as possible to the original one, and it emphasizes that when changing the original image it is negligible for people to detect such activity. The capacity refers to the amount of messages that can be embedded. Besides that, the security [42] and complexity [48] aspects are also considered under certain conditions in the expected watermarking scheme, although in many cases they are with much less priority. Moreover, these key factors are conflicted between each other, and it is impossible to satisfy these features simultaneously [57]. Existing applications of watermarking algorithms usually focus on some special features or intend to make a tradeoff for the abovementioned features. For instance, watermarking for copyright protection considers the better robustness, while watermarking for broadcast monitoring concerns a larger capacity. For most existing deep learningbased robust image watermarking systems [72, 56, 44], the better robustness and imperceptibility are more important, but how to make a tradeoff to satisfy those conflict features of watermarking is still one of the main challenges in this research domain.
Traditional image watermarking techniques usually embed watermarks in the spatial domain or frequency domain
[49]. For the spatial domainbased watermarking techniques, one of the advantages is the computational efficiency when directly changing pixel values of the image, while it easily suffers from robustness. On the contrary, frequency domainbased watermarking solutions would obtain higher robustness by manipulating frequency coefficients of the image in the frequency domain, and they are usually with higher computational complexity. The main drawback of traditional watermarking methods is that they are specified on some prior constraints or targets, making them difficult to be generalized for novel types of attacks [10]. This significantly constraints them in some limited applications. In recent years, deep neural networks have already been applied to digital image watermarking [32, 72, 60, 61, 56, 31, 66, 71]. Because of the strong representation abilities of deep neural networks, these approaches have achieved better robustness and imperceptibility than traditional methods. In addition, neural networks can be retrained to resist novel types of attacks, or to focus on particular features such as the robustness and imperceptibility without designing a new specialized sophisticated algorithm, enabling them possible to develop an adaptable and generalized framework for various watermarking applications [72]. However, most of them use the EncoderNoiserDecoder framework [72, 61, 56, 31, 66, 71], as shown in Fig. 1 (a). In general, this framework employs a separate encoder and decoder to embed and extract watermark respectively. This asks for careful construction for both message embedding and extraction, and the training of two separate neural networks needs complicated parameter tuning.In this paper, we propose a novel digital image watermarking scheme named invertible watermarking network (IWN) using invertible neural network (INN). Inspired by that from the perspective of reversible image conversion (RIC), INN can alleviate the information loss problem better than classic neural network architecture [13], we thus consider watermark embedding and extracting as a pair of inverse problems, and we effectively solve them with INN. Different from existing EncoderDecoder based deep watermarking networks, our compact IWN respectively applies watermark embedding and extracting in the forward and reverse process of INN sharing all network parameters, as shown in Fig. 1 (b). As already demonstrated that INN is an effective tool for embedding and extracting a large amount of information [43], our IWN achieves high imperceptibility benefited from the strictly invertible property of INN [3]. To enhance the robustness, we introduce a welldesigned bit message normalization module and a noise layer in our system. The former also ensures that different lengths of the bit messages can be easily adapted with a high recovery accuracy in our IWN. With the noise layer which is used to simulate various attacks, the strong fitting ability of our IWN enables us to effectively learn the robustness against various practical distortions. Extensive experiments show that our method achieves better results than the most commonly used baseline. In addition, we are the first to introduce INN into the field of watermarking, and we hope to enlighten the followup research.
In summary, the main contributions of this paper are:

To our knowledge, we are the first to introduce invertible neural networks into digital watermarking, and we propose an invertible watermarking network (IWN) for robust and blind digital image watermarking.

We introduce a bit message normalization module for condensing the messages and a noise layer for simulating various attacks, respectively, with which the watermarking robustness is significantly improved.

We provide extensive experiments to demonstrate the superiority of our method under a variety of distortions.
Ii Related Work
Since the terminology digital watermarking first appeared in [59], it has been an active research area [34, 14, 15] with many applications such as copyright protection and owner identification. Besides natural images, digital watermarking has also been used in other fields like medical image watermarking [26], video watermarking [5], dynamic software watermarking [45], 3D watermarking [22, 28], audio watermarking [41], neural network watermarking [62] and so on. In this paper, we focus on robust digital image watermarking and we briefly review two main research areas that are most relevant to our work, i.e. digital image watermarking and invertible neural networks, in this section.
Iia Digital Image Watermarking
Traditional digital image watermarking techniques usually embed messages in spatial domains or frequency domains [49]. In general, those methods of the spatial domain directly embed watermarks by manipulating bitstreams or pixel values [55, 37, 17]. Among them, LeastSignificantBit (LSB) [59] is a representative work of this subcategory. However, it easily suffers from low capacity and sensitivity to various image processing attacks [10]. On the other hand, frequency domainbased watermarking techniques modify the frequency coefficients when embedding watermarks. Compared with the spatial domainbased watermarking methods, these solutions further improve the robustness, imperceptibility, capacity, fidelity, and security with the cost of higher computational complexity [16, 36]. In this class of watermarking methods, the commonly used frequency domains include Discrete Cosine Transform (DCT) domain [29]
, Discrete Fourier Transform (DFT) domain
[27], Discrete Wavelet Transform (DWT) domain [25] and contourlet domain [6, 8]. For instance, Kang et al. [33] propose to embed the spreadspectrum watermark in the coefficients of the LL subband in the DWT domain, and Sadreazami et al. [52] design to embed the watermark in the contourlet domain. They observe the robustness against JPEG compression of the low frequency component in the wavelet domain and the contour component in the contourlet domain, respectively. This excellent idea of finding robust invariant under attacks is also utilized to resist geometric distortions including translation, rotation, and cropping [68, 47, 63, 58]. The main drawback of these traditional watermarking methods is that they are specified on some prior constraints or targets, making them difficult to be generalized for novel types of attacks [10]. In other words, these techniques can only handle some limited tasks.Recently many researchers apply neural networks to digital image watermarking, and indeed some novel methods bring superior robustness and imperceptibility over traditional methods. For example, Kandi et al. [32]
first introduce convolutional neural networks (CNNs) to nonblind watermarking. Mun
et al. [46] further propose a blind watermarking architecture based on CNN to embed and extract watermarks. Zhu et al. [72] propose an endtoend neural network with adversarial training for both steganography and robust blind watermarking. ROMark [60] simplifies adversarial training by using a minmax formulation for robust optimization. After that, RedMark [2]uses two Fully Convolutional Neural Networks (FCNs) with residual connections to embed watermarks in the frequency domain without adversarial training. Different from the dependent deep hiding methods (DDH)
[61, 56, 31, 66, 71], which adapt the watermark to the original cover image, UDH [67] proposes a universal deep hiding method to embed the watermark independent of the cover image. These existing works have demonstrated a variety of neural network structures to effectively realize the message embedding and extraction, ensuring that the watermarked image and the cover image have little or even no perceptual differences.For existing deep learningbased watermarking methods, the noise layer is usually introduced to the networks for dealing with various distortions. However, in order to train the entire network in an endtoend manner, the noise layer must be differentiable. For nondifferentiable distortions including JPEG compression, some methods [72, 2, 56] turn to simulate them with a differentiable approximation, allowing the network to be trained in an endtoend style. In [44] and [61], some distortions are generated by a trained CNN instead of explicitly modeling distortions from a fixed pool during training, which is another way to deal with nondifferentiable and hard modeled distortions. In addition, Liu et al. [39] design a redundant twostage separable deep learning framework to address the problems in onestage endtoend training, such as image quality degradation and difficulty to simulate noise attacks using differentiable layers. Although many strategies have been proposed to deal with various distortions, how to ensure the robustness of the digital watermark in various situations is still a problem that needs to be solved well.
Besides the noise layer, most existing robust image watermarking systems based on deep learning use EncoderNoiserDecoder frameworks [72, 61, 56, 31, 66, 71]. Among them, the encoder embeds the watermark into the cover image in an imperceptible manner, and the decoder recovers the watermark message from the distorted watermarked image. This kind of architecture usually asks for sophisticated designing of both the encoder and decoder, resulting in much complex training with carefully tuning parameters. Different from those previous works where the encoder and decoder are two independent networks, we adopt a bijective INN for watermark embedding and extraction.
IiB Invertible Neural Network (INN)
In recent years, INN has attracted much attention because of their efficient inversion. INN is usually proposed for the flowbased generative model, where a stable invertible mapping is learned between the complex data distribution and a simple latent distribution . NICE [18] and RealNVP [19] propose the additive and the affine coupling layers, respectively. These coupling layers are the basic component of INN, which satisfy the requirements of efficient inversion and a tractable Jacobian determinant. In [23, 3] the explanation is specially explored for the invertibility. In [54], flexible INN is constructed with masked convolutions under some composition rules. An unbiased flowbased generative model is also introduced in [12]. Besides, Glow [35], FJORD [24], iRevNet [30] and iResNet [7] achieve better generation results by continuously improving the network representation capacity.
In this context, INN has been used for a variety of challenging tasks due to their powerful fitting ability. For example, a conditional invertible neural network (cINN) is introduced for guided image generation [4]
, including MNIST digits generation and image colorization. cINN is also used for networktonetwork translation
[50] and imagetovideo synthesis [20]. In addition, there are different solutions specified for image scaling [64], image compression [65], image or video superresolution
[73], image denoising [40], underexposed image enhancement [69] and image color adjustment [70]. As the latest work, Cheng et al. [13] also propose a generic framework for the reversible image conversion, namely IICNet, which aims to encode a series of input images into a single image and decode them. Particularly, Lu et al. [43] firstly introduce INN into largecapacity image steganography, where up to 5 images are successfully embedded into a host image with the same spatial and color resolution. These latest advances demonstrate that INN has great potential in data embedding and extraction. However, they all ignore the robustness issue that the embedded image may be manipulated under image compression and other distortion conditions. On the contrary, our approach focuses on solving this robustness challenge.Iii Proposed Method
Iiia Overview
Instead of employing a cascading EncoderNoiseDecoder architecture that is widely used in existing methods, here we propose an invertible watermarking network IWN, where a bijective INN is used to embed and extract the message. As shown in Fig. 2, our compact IWN contains three components: 1) the invertible neural network, 2) bit message normalization module which includes the preprocessing and postprocessing submodules, and 3) noise layer . To efficiently represent the bit message, the bit message normalization module is used to convert the original bit sequence into a normalized tensor . After that, INN is used as our backbone for efficient message embedding and extraction. In this component, the forward process of INN takes the cover image and the preprocessed message as input, and it generates the watermarked image which is as similar as possible to the original image and which is lost information. The noise layer is then introduced to deal with various noises and distortions produced by practical image operations. Our solution combines different noises to the watermarked image and obtains the simulated noised image . To extract the watermark message, the noised image is fed into the reverse process of INN to generate the output message tensor and the revealed cover image . With the message postprocessing submodule of bit message normalization, we finally extract the bit message sequence from . More details about the introduced notations are summarized in Tab. I.
Notation  Description 
Cover image  
Watermarked image  
Noised image produced by distortion simulation  
Revealed cover image  
Original bit message  
Extracted bit message  
Preprocessed message tensor for INN’s forward input  
Revealed message tensor of INN’s reverse output  
Lost information  
Constant matrix for recovering message 
IiiB Invertible Neural Network (INN)
INN is powerful and effective in dealing with reversible problems, especially for data hiding and recovery [43]. Intuitively, watermarking is a special application of image steganography when a bit sequence message is taken as the hidden data. In this sense, the original bit message should be converted into a tensor with the same spatial resolution as the cover image by preprocessing submodule of our bit message normalization. After that, we use the forward process of INN for message embedding and its reverse process for message extraction. As shown in Fig. 2, in the forward process, the message tensor and the cover image are served as inputs, and the corresponding outputs are the watermarked image and a matrix which is just to satisfy the structural consistency of INN and will not be used for reverse mapping. As for the reverse process of our INN, the noised image and a predefined constant matrix are fed in, then the revealed cover image and the message tensor are extracted. Finally, the message is obtained by the postprocessing submodule of our bit message normalization.
As shown in Fig. 2, our INN consists of several invertible blocks with the same structure, and each block includes three submodules. INN contains and two branches, corresponding to the hidden message and the cover image, respectively. For the th invertible block, the input is [ , ] and its output is [], where is the concatenation operator in channel dimensions. Formally, the forward process is calculated as follows:
(1) 
where , and are convolution operations, is the Exponential function and is the Hadamard product. Accordingly, the reverse process in the th invertible block is calculated as follows:
(2) 
In other words, given = [], we can accurately calculate = [] according to Eq. (2). By cascading, given the reverse input [], our output [] can be solved. It is worth noticing that the three submodules , and , which contain the learnable parameters, appear both in the Eq. (1) of forward process and the Eq. (2) of reverse processe. That is to say, INN shares all parameters during its forward and reverse mapping operations. Benefiting from this architecture, INN performs stable and efficient inversion operations, which is exactly what we need in the watermarking task.
To optimize the network, we calculate the loss function for the above four items of outputs, respectively. One of our goals is that there is no visual difference between the watermarked image
and the original cover image , so we introduce the loss function to achieve that:(3) 
where is the combination of norm and norm. Similarly, we introduce to ensure that and are as close as possible:
(4) 
In addition, we also add the constraints and for the matrix and the revealed cover image , respectively:
(5) 
Finally, the total loss of our system is formulated as:
(6) 
where , , , are the weights of the corresponding losses presented above. Note that under the Crop and Cropout attacks, we only calculate in the cropped region, and multiply the corresponding ratio of the origin shape to the cropped region shape. Please refer to Sec. IIID for more details on how we deal with the Crop and Cropout attacks.
IiiC Bit Message Normalization
In general, there is a conflict between the robustness and the capacity of watermarking, i.e. , when embedding more information, the watermarking scheme is more vulnerable to attacks. Therefore, we introduce a novel bit message normalization module which contains preprocessing and postprocessing submodules to normalize the bit message in a simple but effective way, with which the robustness is significantly improved. Specifically, different with HiDDeN [72] that uses one channel to represent each bit of message, our bit message normalization module can represent many bits with just one channel. Moreover, this module allows us to flexibly adjust the watermarking capacity according to different practical applications, without changing the network architecture of our system. Here we introduce the details of our bit message normalization module.
Preprocessing. The main purposes of this submodule are to improve robustness and convert a sequence of bits into a tensor for the use of convolutional networks, including the following two steps: bit transformation and broadcasting. For a bit message sequence of length , we divide it into groups. In each group, there exists bits, which is treated as a binary number. For the convenience of training, we convert the grouped binary numbers into the corresponding 8bit integers aligned with the most common 8bit color depth. In other words, we transform the grouped binary numbers into their corresponding decimal numbers. In order to encode the bit information into the highest bits for higher error tolerance rather than on lower bits positions, we specially left shift each binary number by bits. Then by treating the shifted binary numbers as 8bit integers, we add an offset on them, ensuring that the mean value of all integers generated from a random bit sequence equals to 128, which is the median of color pixel values between 0 and 255. In Fig. 3 (a) we show an example for encoding 3 bits into one channel, i.e. . In order to spread the watermark message over all image pixels, the message with shape is then broadcast to the input message as (), where and are the height and width of the cover image , respectively. Interestingly, the bit message processing method in HiDDeN [72] can be regarded as a special case of ours when .
Postprocessing. Our goal is to obtain a bit message sequence from the tensor (), so we need to get integers at first. Obviously, for the elements in each channel, we just need to map them to a single decimal number. As shown in Fig. 4 (a), the original data distribution of one channel from
presents a single peak state. To eliminate the interference of outliers for each channel, we convert all numbers to their corresponding nearest groundtruth values, which share
candidates as shown in Fig. 4 (b), and then we take the mode of converted numbers as the final extracted number. After that, the integers are converted to a bit message sequence according to the inverse process of preprocessing, as shown in Fig. 3 (b). Specifically, it includes offset subtraction, right shift, and binary conversion operations.IiiD Noise Layer
Identity  Crop  Cropout  Dropout  Gaussian  JPEG  Combined  
(p=0.035)  (p=0.3)  (p=0.3)  (=2)  (Q=50)  
HiDDeN [72]  36.74  32.70  31.94  34.39  30.38  32.64  32.92 
Ours  37.88  34.30  32.26  30.31  37.03  36.16  32.99 
In practice, watermarked images would suffer from various distortions during compression, transmission and interactive editing operations. The robustness against these different attacks (or noises), which may destroy the embedded watermarks in the real world, is one of the important issues for a digital image watermarking algorithm. In order to improve the robustness, here we introduce the noise layer to simulate various image distortions, including Identity, Crop, Cropout, Dropout, Gaussian and JPEG operations. See Fig. 5 for more details and examples. Specifically, as a component of IWN, we expect the proposed noise layer is also differentiable so that the whole network can be trained in an endtoend style. Next, we will detailed discuss these different distortions in terms of whether they are differentiable or not.
Differentiable Noises. Most watermarking noises, including Identity, Crop, Cropout, Dropout, and Gaussian, are inherently differentiable, so we add them to our framework directly. For the Identity noise, we do not change the watermarked image at all. Crop refers to producing a rectangle by randomly cropping from the watermarked image, and the calculated percentage is introduced to control the remaining ratio of the watermarked image. Cropout means randomly replacing the rectangle of the cover image with counterpart of the watermarked image, and similarly, the ratio . Dropout also replaces some cover image pixels with the watermarked image pixels similar to Cropout, while the difference is that instead of replacing a whole area, the former randomly selects some pixels for replacement based on the remaining ratio . Finally, Gaussian blurs the watermarked image with a gaussian kernel of the given width .
Quantization. Watermarked images are being widely applied in storage and transmission, so they must be converted into commonly used image formats, such as the 8bit RGB format (i.e.
8 bits for each color channel). In the practical implementation, we need a differentiable quantization module to convert the floatingpoint values of the INN’s outputs to 8bit unsigned integers. To ensure the gradients back propagation during training, here the rounding operation is used as the quantization module, and the StraightThrough Estimator
[7] is adopted when calculating the gradients. In our solution, we combine this quantization noise in all our training and testing experiments, such that our watermarking system could effectively deal with quantization error.JPEG compression. Similar to the quantization operator, the noise produced by JPEG compression is nondifferentiable due to the quantization step in the compression framework. To solve the gradients back propagation problem during training, we follow [53] to simulate the quantization step in the standard JPEG compression through the following equation,
(7) 
where is the rounding function and is the differentiable approximation of the rounding function which has nonzero derivatives nearly everywhere. Note that in our solution we use the real JPEG compression instead of constructing a JPEG simulator during testing. By transforming the nondifferentiable part into an approximate representation that is derivable, we construct a completely differentiable noise layer, with which our IWN can be efficiently trained in an endtoend way.
Iv Experiments
Iva Experimental Setup
We implement our IWN with PyTorch and train our model with the Adam Optimizer. The learning rate is set to 2e4 and the batch size is 6. We use 16 invertible blocks for embedding a
bits message and the message is divided into groups. In addition, all elements of constant matrix are set to 0.5. Our IWN is trained on DIV2K [1] and Flickr2K [38]datasets, and it is tested on a subset of ImageNet
[51] which contains 1000 images. All training images are cropped into 480480 patches and resized to 128128 during training, while the test images are all resized to 128128. Flipping and rotation are randomly used for data augmentation. The quality factor Q of the JPEG simulator is uniformly sampled from {50, 60, 70, 80, 90}.In our system, the loss weights are specified as , , and . Besides, the weight of varies according to different training stages. For instance, when the noise layer is with the Identity layer, which means that no distortions are applied on the watermarked image, is set to 32. In other cases, is first set to 0.1 until the system is converged, and then is refined as 48.0 until convergence. In other words, we first train the robustness against various distortions when extracting the watermark, and then train the imperceptibility of our watermarking solution. All experiments are conducted on two Nvidia RTX 2080Ti GPUs.
Block number  PSNR  Identity  Crop  Cropout  Dropout  Gaussian  JPEG 
(dB)  (p=0.035)  (p=0.3)  (p=0.3)  (=2)  (Q=50)  
4  30.80  0.8621  0.7987  0.8181  0.6428  0.7523  0.5433 
8  30.21  0.9762  0.8187  0.9143  0.7329  0.8449  0.6306 
12  30.97  0.9949  0.8594  0.9655  0.7138  0.9364  0.6604 
16  32.99  0.9994  0.8331  0.9471  0.7529  0.8611  0.7687 
&  PSNR  Identity  Crop  Cropout  Dropout  Gaussian  JPEG 
(dB)  (p=0.035)  (p=0.3)  (p=0.3)  (=2)  (Q=50)  
30 & 10  32.99  0.9994  0.8331  0.9471  0.7529  0.8611  0.7687 
40 & 10  31.34  0.9187  0.7055  0.8211  0.6297  0.8087  0.7313 
50 & 10  31.96  0.7494  0.6708  0.7342  0.6129  0.6879  0.6585 
60 & 10  30.50  0.7212  0.6275  0.6749  0.5913  0.6256  0.6403 
30 & 30  30.36  0.7470  0.7271  0.7415  0.6306  0.7455  0.5788 
IvB Metrics
We evaluate our method mainly on robustness and imperceptibility which are more important than capacity for watermarking algorithm generally. Specifically, we measure the imperceptibility using peak signaltonoise ratio (PSNR) between the cover image and watermarked image. And we measure robustness using bit accuracy, which is the percentage of identical bits between the original message
and the extracted message to total bits of the message.IvC Comparison
We take HiDDeN [72] as the baseline method for comparison since it is a well studied model and a commonly used benchmark. We reconduct the experiments of HiDDeN [72]
with the open source code. Following HiDDeN
[72], watermarked images are exposed to the following 6 distortions: Identity, Crop, Cropout, Dropout, Gaussian, and JPEG compression. We respectively control the intensity of distortions with the following scalars: the remaining ratio for Crop, Cropout and Dropout, the kernel width for Gaussian, and the quality factor Q for JPEG compression. These scalars are identical with those adopted in HiDDeN [72]during testing. Specialized models are optimized to be resistant to specific distortions aforementioned, and the final combined model is trained to be robust against all kinds of distortions. In order to further improve the robustness against JPEG compression from the real world, the noise layer is randomly sampled from set {Identity, Crop, Cropout, Dropout, Gaussian, JPEG} with a probability distribution of {0.05, 0.05, 0.1, 0.15, 0.65} for each minibatch during training the combined model. We report both the bit accuracy and PSNR when various distortions are applied on watermarked images. It is worth noticing that these two metrics may present conflicting evaluation results. For instance, a higher PSNR value usually means that the embedded message changes less information to the cover image, which makes it more difficult to accurately recover the bit sequence message, and the corresponding bit accuracy would decrease. For a welldesigned watermarking algorithm, it should not be with a high PSNR but low bit accuracy or high bit accuracy but low PSNR. Therefore, we deliberately avoid this conflicting situation for the fair comparison.
IvC1 Quantitative Results
In Tab. II, it shows the PSNR metric between cover images and watermarked images produced by 6 specialized models against the corresponding noises and 1 combined model. Moreover, Fig. 6 illustrates the bit accuracy of different models against 6 distortions. In general, when compared with those specialized models, our 5 models, i.e. Identity, Crop, Cropout, Gaussian, and JPEG, have higher PSNR values. Meanwhile, our solution obtains higher bit accuracy than the baseline method. Especially, our method achieves both +3.52 dB gains than the baseline for the imperceptibility under JPEG compression, and 18.4% higher bit accuracy in terms of robustness. When comparing the combined model, we have almost the same performance for the PSNR metric, but the robustness of our algorithm is much better than the baseline against the Identity, Cropout, Gaussian, and JPEG compression distortions, among which 18.6% higher bit accuracy is achieved under the JPEG compression distortion. Besides, these two metrics, i.e. the PSNR and the bit accuracy, demonstrate that our method achieves a better balance than that of HiDDeN [72] between the imperceptibility and the robustness of watermarking.
Fig. 7 provides a more comprehensive comparison in the bit accuracy against various intensities of distortions. In general, our combined model achieves better performance than HiDDeN [72] for resisting distortions. Although our method fails in Dropout (p=0.3) according to Fig. 6, the curve of Dropout in Fig. 7 shows our method surpasses HiDDeN [72] when the remaining ratio . When comparing the identity model to combined model, we can see that the noise layer has obvious benefits for Gaussian and JPEG compression distortions, as the identity model generally fails under such two attacks.
IvC2 Qualitative Results
Fig. 8 provides visual comparison of watermarked images produced by combined models. Both the watermark images and the corresponding magnified differences compared to the original images demonstrate that the bit message is successfully embedded in the images in an imperceptible way. The watermark images generated by our model and HiDDeN [72] look very similar to the corresponding cover images, which is consistent with the PSNR metric reported in Tab. II.
IvD Ablation Study
Here we discuss how the hyperparameters, the number of invertible blocks and the length of the bit message affect the performance of our method. Because our goal is to get a robust watermarking model, the experiments in this part are conducted on the combined model rather than any specialized one.
Firstly, we discuss the number of the invertible blocks in our INN module, and we report the performance in Tab. III. In general, our solution gets better performance when with more blocks. It is reasonable to understand, as more invertible blocks imply more trainable parameters. It is particularly worth noting that when the number of blocks changes from 12 to 16, the bit accuracy under JPEG compression significantly increases by 10% (see the last two rows of the last column of Tab. III), and the PSNR of the watermarked images has also increased by 2.02 dB. The bottleneck of our method is the robustness against JPEG compression, as when using 8 or 12 blocks, the bit accuracy is above 0.7 over other distortions except for JPEG compression. On the other hand, although the model of 16 invertible blocks does not always bring the best performance, it achieves the highest PSNR values and the best robustness against the JPEG compression. Thus, our final model chooses 16 invertible blocks.
Secondly, we verify the effectiveness of our bit message normalization module. To this end, we carry out different experiments by changing two variables, i.e. , the bit length of the embedded message , and the number for dividing the message into groups. The detailed experimental results are shown in Tab. IV, which includes two kind of welltrained models with as 10 and 30, respectively. To study the performance of our model fluctuated with different bit message length , we test the models with message groups when is set to 30, 40, 50, and 60, i.e. each channel represents 3, 4, 5 and 6 bits, respectively. The other experimental settings are the same as we mentioned in Sec. IVA. Obviously, the bit accuracy decreases when the message becomes longer. This follows the conflict between the capacity and robustness of watermarking algorithm. We further carry out an experiment without the bit message normalization module. Specifically, we directly treat the binary bit message as float numbers (0.0 or 1.0) like HiDDeN [72], and both and
are thus set to 30. Without our bit message normalization, all the evaluation metrics drop dramatically, as shown in the last row of Tab.
IV.IvE Limitations and Future Work
In order to resist the practical distortions especially introduced by JPEG compression, our watermarked images may generate some visual artifacts when there exist many smooth regions. In Fig. 9, the background (case 1) and the sky (case 2) presents this phenomenon, respectively. Embedding information in a smooth area is inherently difficult till now. This issue may be improved through embedding the watermark by paying more attention to the edges or rich texture areas of the image. Besides, introducing some smoothing loss for the watermarked image during training, such as the Fourier transform loss and structural similarity index measure (SSIM) loss, may also be helpful to alleviate this problem. We will further explore how to remove these visual artifacts in the future.
V Conclusion
In this paper, we have presented an invertible watermarking network (IWN) for robust blind digital image watermarking. Our compact IWN utilizes the invertible neural network (INN) to embed and extract the watermark message with an endtoend training style. To promote the watermarking robustness against various practical distortions, we specifically introduce a noise layer to simulate various attacks. Moreover, we propose a simple but effective bit message normalization module to further enhance the watermarking robustness. Extensive experiments demonstrate the superiority of our method against the commonly used baseline. In the future, we will also explore the application of our framework in crossmedia channels, such as printing and photographing, screen photographing, and audiovisual watermarking in other multimedia domains.
References
 [1] (2017) Ntire challenge on single image superresolution: methods and results. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pp. 114–125. Cited by: §IVA.
 [2] (2020) ReDMark: framework for residual diffusion watermarking based on deep networks. Expert Systems with Applications 146, pp. 113157. Cited by: §IIA, §IIA.
 [3] (2018) Analyzing inverse problems with invertible neural networks. In Int. Conf. Learn. Represent., Cited by: §I, §IIB.
 [4] (2019) Guided image generation with conditional invertible neural networks. arXiv preprint arXiv:1907.02392. Cited by: §IIB.
 [5] (2014) Imperceptible and robust blind video watermarking using chrominance embedding: a set of approaches in the dt cwt domain. IEEE Trans. Inf. Forensics Security. 9 (9), pp. 1502–1517. Cited by: §II.

[6]
(2005)
Image adaptive watermarking using wavelet domain singular value decomposition
. IEEE Trans. Circuits Syst. Video Technol. 15 (1), pp. 96–102. Cited by: §IIA.  [7] (2019) Invertible residual networks. In ICML, pp. 573–582. Cited by: §IIB, §IIID.
 [8] (2007) Robust image watermarking based on multiband wavelets and empirical mode decomposition. IEEE Trans. Image Process. 16 (8), pp. 1956–1966. Cited by: §IIA.
 [9] (1989February 21) Interactive video method and apparatus. Google Patents. Note: US Patent 4,807,031 Cited by: §I.
 [10] (2021) Data hiding with deep learning: a survey unifying digital watermarking and steganography. arXiv preprint arXiv:2107.09287. Cited by: §I, §IIA.
 [11] (2001) Quantization index modulation: a class of provably good methods for digital watermarking and information embedding. IEEE Trans. Inf. Theory 47 (4), pp. 1423–1443. Cited by: §I.
 [12] (2019) Residual flows for invertible generative modeling. In Adv. Neural Inform. Process. Syst., pp. 9916–9926. Cited by: §IIB.
 [13] (2021) IICNet: a generic framework for reversible image conversion. In Int. Conf. Comput. Vis., pp. 1991–2000. Cited by: §I, §IIB.
 [14] (2002) Digital watermarking. Vol. 53, Springer. Cited by: §II.
 [15] (2007) Digital watermarking and steganography. Morgan kaufmann. Cited by: §II.
 [16] (201306) A study on spatial and transform domain watermarking techniques. International Journal of Computer Applications 71, pp. 38–41. Cited by: §IIA.
 [17] (2010) Local histogram based geometric invariant image watermarking. Signal Processing 90 (12), pp. 3256–3264. Cited by: §IIA.
 [18] (2014) NICE: nonlinear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §IIB.
 [19] (2016) Density estimation using real NVP. arXiv preprint arXiv:1605.08803. Cited by: §IIB.
 [20] (202106) Stochastic imagetovideo synthesis using cinns. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3742–3753. Cited by: §IIB.
 [21] (2007) Speaker identification security improvement by means of speech watermarking. Pattern Recognition 40 (11), pp. 3027–3034. Cited by: §I.
 [22] (2021) ThermoTag: a hidden id of 3d printers for fingerprinting and watermarking. IEEE Trans. Inf. Forensics Security. 16, pp. 2805–2820. Cited by: §II.
 [23] (2017) Towards understanding the invertibility of convolutional neural networks. In IJCAI, pp. 1703–1710. Cited by: §IIB.
 [24] (2018) FFJORD: freeform continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367. Cited by: §IIB.
 [25] (2002) Digital image watermarking for joint ownership. In ACM Int. Conf. Multimedia, pp. 362–371. Cited by: §IIA.
 [26] (2020) Joint watermarkingencryptionjpegls for medical image reliability control in encrypted and compressed domains. IEEE Trans. Inf. Forensics Security. 15, pp. 2556–2569. Cited by: §II.
 [27] (2018) Hybrid blind robust image watermarking technique based on dftdct and arnold transform. Multimedia Tools Appl. 77 (20), pp. 27181–27214. Cited by: §IIA.
 [28] (2017) Blind 3d mesh watermarking for 3d printed model by analyzing layering artifact. IEEE Trans. Inf. Forensics Security. 12 (11), pp. 2712–2725. Cited by: §II.
 [29] (2019) Enhancing image watermarking with adaptive embedding parameter and psnr guarantee. IEEE Trans. Multimedia 21 (10), pp. 2447–2460. Cited by: §IIA.
 [30] (2018) iRevNet: deep invertible networks. In Int. Conf. Learn. Represent., Cited by: §IIB.
 [31] (2020) RIHOOP: robust invisible hyperlinks in offline and online photographs. IEEE Trans. Cybern.. Cited by: §I, §IIA, §IIA.
 [32] (2017) Exploring the learning capabilities of convolutional neural networks for robust image watermarking. Computers & Security 65, pp. 247–268. External Links: ISSN 01674048, Document Cited by: §I, §IIA.
 [33] (2003) A dwtdft composite watermarking scheme robust to both affine transform and jpeg compression. IEEE Trans. Circuits Syst. Video Technol. 13 (8), pp. 776–786. Cited by: §IIA.
 [34] (2000) Digital watermarking. Artech House, London 2. Cited by: §II.
 [35] (2018) Glow: generative flow with invertible 1x1 convolutions. In Adv. Neural Inform. Process. Syst., pp. 10215–10224. Cited by: §IIB.
 [36] (2018) A recent survey on image watermarking techniques and its application in egovernance. Multimedia Tools Appl. 77 (3), pp. 3597–3622. Cited by: §IIA.
 [37] (2013) General framework to histogramshiftingbased reversible data hiding. IEEE Trans. Image Process. 22 (6), pp. 2181–2191. Cited by: §IIA.
 [38] (2017) Enhanced deep residual networks for single image superresolution. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pp. 136–144. Cited by: §IVA.
 [39] (2019) A novel twostage separable deep learning framework for practical blind watermarking. In ACM Int. Conf. Multimedia, pp. 1509–1517. Cited by: §IIA.
 [40] (2021) Invertible denoising network: a light solution for real noise removal. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 13365–13374. Cited by: §IIB.
 [41] (2018) Patchworkbased audio watermarking robust against desynchronization and recapturing attacks. IEEE Trans. Inf. Forensics Security. 14 (5), pp. 1171–1180. Cited by: §II.
 [42] (2018) Secure and robust digital image watermarking using coefficient differencing and chaotic encryption. IEEE Access 6, pp. 19876–19897. Cited by: §I.
 [43] (2021) Largecapacity image steganography based on invertible neural networks. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 10816–10825. Cited by: §I, §IIB, §IIIB.
 [44] (2020) Distortion agnostic deep watermarking. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 13548–13557. Cited by: §I, §IIA.
 [45] (2019) Xmark: dynamic software watermarking using collatz conjecture. IEEE Trans. Inf. Forensics Security. 14 (11), pp. 2859–2874. Cited by: §II.
 [46] (2017) A robust blind watermarking using convolutional neural network. arXiv preprint arXiv:1704.03248. Cited by: §IIA.
 [47] (2000) Robust template matching for affine resistant image watermarks. IEEE Trans. Image Process. 9 (6), pp. 1123–1129. Cited by: §IIA.
 [48] (2019) Optimization and hardware implementation of image and video watermarking for lowcost applications. IEEE Trans. Circuits Syst. I 66, pp. 2088–2101. Cited by: §I.
 [49] (2020) Recent trends in image watermarking techniques for copyright protection: a survey. International Journal of Multimedia Information Retrieval, pp. 1–22. Cited by: §I, §IIA.
 [50] (2020) NetworktoNetwork Translation with Conditional Invertible Neural Networks. In Adv. Neural Inform. Process. Syst., Cited by: §IIB.
 [51] (2015) Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115 (3), pp. 211–252. Cited by: §IVA.
 [52] (2018) A robust image watermarking scheme using local statistical distribution in the contourlet domain. IEEE Trans. Circuits Syst. II 66 (1), pp. 151–155. Cited by: §IIA.

[53]
(2017)
Jpegresistant adversarial images.
In
NIPS 2017 Workshop on Machine Learning and Computer Security
, Vol. 1. Cited by: §IIID.  [54] (2019) Mintnet: building invertible neural networks with masked convolutions. In Adv. Neural Inform. Process. Syst., pp. 11004–11014. Cited by: §IIB.
 [55] (2018) Robust color image watermarking technique in the spatial domain. Soft Computing 22 (1), pp. 91–106. Cited by: §IIA.
 [56] (2020) Stegastamp: invisible hyperlinks in physical photographs. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 2117–2126. Cited by: §I, §I, §IIA, §IIA, §IIA.
 [57] (2014) Robust image watermarking theories and techniques: a review. Journal of applied research and technology 12 (1), pp. 122–138. Cited by: §I.
 [58] (2013) LDFTbased watermarking resilient to local desynchronization attacks. IEEE Trans. Cybern. 43 (6), pp. 2190–2201. Cited by: §IIA.
 [59] (1994) A digital watermark. In Proceedings of 1st international conference on image processing, Vol. 2, pp. 86–90. Cited by: §IIA, §II.
 [60] (2019) ROMark: a robust watermarking system using adversarial training. ArXiv abs/1910.01221. Cited by: §I, §IIA.
 [61] (2019) Light field messaging with deep photographic steganography. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 1515–1524. Cited by: §I, §IIA, §IIA, §IIA.
 [62] (2020) Watermarking neural networks with watermarked images. IEEE Trans. Circuits Syst. Video Technol.. Cited by: §II.
 [63] (2008) Invariant image watermarking based on statistical features in the lowfrequency domain. IEEE Trans. Circuits Syst. Video Technol. 18 (6), pp. 777–790. Cited by: §IIA.
 [64] (2020) Invertible image rescaling. Eur. Conf. Comput. Vis.. Cited by: §IIB.
 [65] (2021) Enhanced invertible encoding for learned image compression. In ACM Int. Conf. Multimedia, pp. 162–170. Cited by: §IIB.

[66]
(2020)
Attention based data hiding with generative adversarial networks
. In AAAI, Vol. 34, pp. 1120–1128. Cited by: §I, §IIA, §IIA.  [67] (2020) Udh: universal deep hiding for steganography, watermarking, and light field messaging. Adv. Neural Inform. Process. Syst. 33, pp. 10223–10234. Cited by: §IIA.

[68]
(2011)
Affine legendre moment invariants for image watermarking robust to geometric distortions
. IEEE Trans. Image Process. 20 (8), pp. 2189–2199. Cited by: §IIA.  [69] (2021) Deep symmetric network for underexposed image enhancement with recurrent attentional learning. In Int. Conf. Comput. Vis., pp. 12075–12084. Cited by: §IIB.
 [70] (2021) Invertible image decolorization. IEEE Trans. Image Process. 30, pp. 6081–6095. Cited by: §IIB.
 [71] (2020) An automated and robust image watermarking scheme based on deep neural networks. IEEE Trans. Multimedia. Cited by: §I, §IIA, §IIA.
 [72] (2018) Hidden: hiding data with deep networks. In Eur. Conf. Comput. Vis., pp. 657–672. Cited by: §I, §I, §IIA, §IIA, §IIA, Fig. 8, §IIIC, §IIIC, TABLE II, §IVC1, §IVC1, §IVC2, §IVC, §IVD.
 [73] (2019) Residual invertible spatiotemporal network for video superresolution. In AAAI, Vol. 33, pp. 5981–5988. Cited by: §IIB.