Nowadays, people tend to post photos and videos on the Internet for sharing and preserving memories of events and so on. To protect the copyright of photos and videos, the visible watermark is commonly used. Typically, those watermarks are opaque or semi-transparent images containing names or logos, overlaying on the original images. Despite billions of online images have been embedded with visible watermarks for ownership declaration by watermarking techniques, they always suffer from a security flaw that watermarks may be affected and damaged by various watermark processing methods. To evaluate and improve the robustness of watermarks, a number of scientists [14, 11, 5, 16, 3, 12, 2] attempt to attack it by removing watermarks from images.
Due to the existence of diverse categories and patterns of visible watermarks, developing an advanced visible watermark removal method remains as a difficult task. More specifically, visible watermarks often contain complex structures (e.g., the texts, symbols, graphic, thin lines and shadows are diverse (Fig. 1(a))), leading to the challenge of removing unknown and diverse patterns of watermarks from images without user supervision or prior information in practical situation.
Most of existing watermark removal methods are unable to tackle those aforementioned challenges on removing watermarks from
images. Although these models are designed for estimating and wiping off the watermark regions, they either highly depend on the prior knowledge[14, 5, 11] or assume that the watermarked images have the same watermark pattern [3, 16], which are not suitable for removing watermarks in real-world scenarios where the watermarks may be unknown and the watermarks in different images are more likely to be distinct. Recently, Cheng et al.  cast the watermark removal as an image-to-image translation problem and used a fully convolutional architecture to transfer the watermarked pixels to the original unmarked pixels, which provided a reasonable solution for watermark removal. However, directly training a generator with pixel-wise loss to estimate pixel relation mapping is difficult. In addition, the watermark-free images recovered by this kind of approach mostly contains a few residual watermark traces and are perceptually not photo-realistic in human visual sense (Fig. 1(b)).
To make the results of watermark removal more photo-realistic and convincing (e.g., to recover the watermarked patch without any residual watermarks and make it more photo-realistic), in this work, we propose a new watermark removal framework with conditional generative adversarial networks (cGANs) . Specifically, an effective cGAN model is widely adopted to form a framework
for photo-realistic watermark removal. To achieve this, we introduce a new loss function,consisting of an adversarial loss and a pixel-wise content loss. In particular, the adversarial loss working with a patch-based discriminator network enables our method to reconstruct a photo-realistic watermark-free image. Here, the patch-based discriminator network is conditioned on the input watermarked images and is trained to differentiate the difference between the recovered images and original watermark-free images. Additionally, we use a content loss motivated by perceptual similarity and pixel similarity [7, 2], which consists of the L1 loss and the perceptual loss. With both the adversarial loss and the content loss, our framework is able to generate more convincing recovered results from images marked by diverse unknown watermarks (Fig. 1(c)).
In summary, our contributions are twofold. Firstly, to the best of our knowledge, this is the first work to exploit the concept of cGAN to design an effective framework to solve the visible watermark removal problem in a realistic setting. Our cGAN-based watermark removal framework is much more principled than existing approaches. Secondly, we introduce an effective watermark removal cGAN model with a new loss function, which is comprised of an adversarial loss and a pixel-wise content loss. This can drive the reconstruction of the watermark regions to be more photo-realistic. Moreover, extensive experiments are conducted on a large-scale visible watermark dataset for evaluation. The results demonstrate that our proposed model is capable of addressing the visible watermark removal problem confronted in real-world scenarios, achieving more convincing reconstruction than state-of-the-art methods.
In this section, we present our watermark removal framework which is build based on the concept of cGANs [10, 6]. In recent, the cGANs  are commonly adopted to reconstruct the hidden information which is obscured in original image. In this work, as we aim at restoring the original images from the watermarked images, we adopt the idea of the cGANs and propose a cGANs-based framework for watermark removal. The architectures of our proposed watermark removal cGANs is illustrated in Fig. 2.
Our network mainly embodies a generator and a discriminator. In the generator, we leverage a U-net based architecture  to transform a watermarked image to a watermark-free one. In our discriminator, we use a patch-based classifier
In our discriminator, we use a patch-based classifier conditioned on the input watermarked images to distinguish those recovered images generated by the generator from the ground-truth watermark-free images in a patch level.
More specifically, our network takes as input a watermarked image and exploits the generator to generate a photo-realistic watermark-free image. To enable the image restored by the generator to be similar to the ground-truth watermark-free image as much as possible, we introduce a new objective which is the combination of the L1 loss, perceptual loss and the patch-based adversarial loss to restrain the training of the generator. In the meanwhile, an adversarially trained discriminator is employed to detect the “fake” images (i.e., the images which is generated by the generator and is not distinguished as the real watermark-free images) from those ground-truth images (i.e., the real watermark-free images). We detail the adversarial network architecture and loss functions individually in Section 2.1 and 2.2.
2.1 Adversarial Network Architectures
Generator (U-net). Typically, watermark removal can be cast as an image-to-image translation problem . Analyzing the watermark removal task, we find that the unmarked image areas share the pixel values as the input while the watermarked pattern needs to be removed to meet the visual requirements. Unlike the general encode-decode structure , which directly transforms an image in source domain to a target image through a series of convolution module in the network, we adopt a U-net based architecture as our generator in our work, followed by the fully convolutional network proposed in . In our system, the U-net takes the advantage of its skip connection structure, which combines the low level feature and the high level features, allowing the sharing of global information and edge details between the input and the output. Specifically, our generator comprises of six standard modules, which are down-blocks or up-blocks. In down-blocks, the channels of feature map are doubled and the its side size is reduced by half, while the up-blocks go the opposite. In addition, there are skip connections between every layer and the layer, where n is the total number of layers. Each skip connection simply concatenates all channels of the layer with those of the layer as shown in Fig. 2.
Discriminator (patch-based). Different from common GAN discriminators, which map the input into one scalar representing the probability of the input sample attributed to “real”. In our work, as shown in Fig.
Different from common GAN discriminators, which map the input into one scalar representing the probability of the input sample attributed to “real”. In our work, as shown in Fig.2, we employ the patch-based network  as our discriminator. The structure of our discriminator is a full convolutional network, which maps the combination of watermarked image and watermark removed image to a feature map, representing the class probabilities i.e., “fake” or “real” of the patches of the input. Since the point in the feature map can be traced back to the receptive field in the original image, thus each value in the output matrix refers the probability that the patch in the original image is “real”, and we calculate the probability of the input image is “real” as the average of the all patches are “real”.
Observing the watermark images, we can find that the difference between the watermarked image and the original watermark-free one solely exists in some parts of the image. Since the watermarked area is relatively small compared with the whole image, it is critical for the discriminator to identify the most different patches of two input images and focus more on minimizing loss of these patches. Thus, introducing patch-based discriminator can make our cGANs based network be more powerful for removing visible watermark.
2.2 Objective Function
The objective functions play a very important role in training the network. In this work, we aim to learn a solution to minimize the loss function defined as below:
During each stage of training our cGANs-based watermark removal model the generator and the discriminator are trained alternately, where is trained to minimize this objective against an adversarial discriminator , which is trained to maximize the loss.
Specially, the generator is trained by minimizing the loss. The task of generator is not only to fool the discriminator but also to generate a image that is closed to the ground truth watermark-free image in visual. Therefore, the objective of the generators consists of a content loss and an adversarial loss, where the perceptual loss and loss comprise the content loss. In the Eq. (1), , and refer to the loss, perceptual loss and adversarial loss respectively, where and are weights to balance the loss, perceptual loss and adversarial loss. The discriminator is trained alternately to avoid being fooled by the generators by distinguishing the inputs as either real or fake. Thus, the adversarial loss is defined as the opposite loss function as that in the generator. We detail the content loss and adversarial loss in the following sections.
Content loss. At present, the most commonly used content loss in image-to-image tasks is MSE loss. It obtained the state-of-art PSNR results in
many image-to-image task such as super-resolution and image style translation and etc. However, itis proved to be blur occurrence in the generated images and the output results do not satisfy human visual sense. To solve this problem, we use the distance rather than the , which is defined as:
denotes the output of generators and denote the ground truth watermark-free image. Apart from the loss, it is beneficial to inject the perceptual loss for watermark removal 
. The perceptual loss function of our network can be expressed as:
Here, we define as the convolutional transformation for calculating the perceptual loss, which refers to the pertrained  in our work. The feature size of the
convolutional layer of loss network is. Specifically, the weight of the is frozen and the outputs of the are extracted as features to calculate the semantic difference between the input and output of generator as .
3.1 Datasets and Settings
Dataset. To evaluate the performance of our framework on the large visible watermark image dataset, we conduct extensive experiments on the Large-scale Visible Watermark Dataset (LVW) , containing 60k watermarked images made of 80 watermarks, with 750 images per watermark. In this dataset, the original images in the training and testing sets are randomly chosen from the train/val and test sets in PASCAL VOC2012 dataset with replacement, respectively. The 80 categories of watermarks covering a vast quantity of patterns (e.g., the watermarks contain English and Chinese), are collected from renowned E-commercial brand, websites, organization, personal, and etc. Moreover, the size, location and transparency of each watermark in different images are distinct and set randomly. Example images of LVW dataset are shown in Fig. 3.
Training details.PyTorch platform was applied to construct the proposed deep architecture. All experiments are conducted on a computer cluster equipped with NVIDIA Tesla K80 GPU with 12GB memory. To optimize the generative adversarial networks, we follow the training strategy in  to alternate between one gradient descent step on the discriminator, then one step on generator. During training, we use mini batch SGD (the batch size is set to be 1) and apply the Adam solver , with the initial learning rate of 2e-4 and momentum parameters (i.e., = 0.5, = 0.999). And we evaluated the proposed framework with default setting ( = 10, = 1e-4).
Evaluation setting and metrics.
In our experiments, the watermarks in training set are different from those in testing set. In LVW dataset, around sorts of watermark are used for training, and the remaining are for test (see Fig. 3), which is the same as  . This setting meets the requirements of unknown watermarks removal in real-world scenarios well. Both of the Peak signal to noise ratio (PSNR) and structural dissimilarity image index (DSSIM), measuring the similarity between the recovered image and the ground truth one, are adopted as evaluation metrics by previous work (e.g.,
Both of the Peak signal to noise ratio (PSNR) and structural dissimilarity image index (DSSIM), measuring the similarity between the recovered image and the ground truth one, are adopted as evaluation metrics by previous work (e.g.,[3, 2]). However, both metrics fail to capture and accurately assess image quality with respect to the human visual system . Therefore, in addition to two aforementioned evaluation metrics, we used the mean opinion score (MOS) testing  to further quantify the ability of different methods of reconstructing photo-realistic and convincing watermark-free images from watermarked images. Specifically, we asked 10 raters to assign an integral score from 1 (bad quality, i.e., watermarked images) to 5 (excellent quality, i.e., original images) to the recovered images for assessing the quality of the recovered images.
3.2 Results and Analysis
Analysis of the objective function. The objective function of our framework has three components terms in Eq. (1), including the loss term, the perceptual loss term and the adversarial loss term. In this section, we conduct experiments to analyze the effect of these loss terms.
|L1 + Perceptual||30.86||0.043||3.23|
|L1 + Perceptual + GAN||30.33||0.049||3.31|
|L1 + Perceptual + cGAN||30.69||0.045||4.08|
In Table 1, ‘L1’ and ‘Perceptual’ indicate that the generator network only using the loss (Eq. (2)) and perceptual loss (Eq. (3)), respectively. ‘L1 + Perceptual’ represents the generator network using the combination loss of the loss as well as the perceptual loss. As shown in Table 1, ‘L1 + Perceptual’ performs clearly better than the ‘L1’ and ‘Perceptual’, demonstrating that the combination of loss and perceptual loss can incorporate the strength of both losses to reconstruct the fine details.
To evaluate the effect of adversarial loss term, we further show the performance of the combination of adversarial loss and L1 loss and the perceptual loss. Specifically, we compare the model using a discriminator conditioned on the input (adversarial loss of cGAN, Eq. (4)) with the model using an unconditional discriminator (adversarial loss of GAN). They are respectively denoted as ‘L1 + Perceptual + cGAN’ and ‘L1 + Perceptual + GAN’ in Table 1. Although combining the loss and the perceptual loss with the adversarial loss causes a slight drop in the PSNR and DSSIM values, it achieves higher MOS scores, indicating that the recovered results are more photo-realistic. The reason is that the GAN-based procedure encourages the reconstructions to move towards regions with high probability of containing photo-realistic images in searching space and thus closer to the convincing results. Moreover, the results in Table 1 also show clearly that cGAN performs much better than GAN, verifying the effectiveness of the conditional discriminator. This suggests that it is important that the loss measure the quality of the match between input (watermarked images) and output recovered images.
Evaluation of the patch-based discriminator. As the patch-based discriminator is essential in our proposed framework to model the discriminative information of local image patches. To investigate the effect of patch-based discriminator, we compared our model with the model using the conventional image-based discriminator, which classifies if whole image region is real or fake (i.e., in image level). Note that, in this section all experiments are conducted with the ‘L1 + perceptual + cGAN’ loss (Eq. (1)).
The results are shown in Table 2. Compared with our patch-based discriminator, the image-based discriminator gets a considerably worse performance. Specifically, the image-based discriminator identify the difference between two images in a image level, which alleviate the effect of the difference in local areas and hamper the results (i.e., the generated images are not photo-realistic). In addition, the image-based discriminator has much more parameters and deeper than the patch-based discrmininator. This can slow down the speed of the watermark removal model and it is harder to train the model, which makes this kind of discriminator unscalable to the real-world application. In other words, as the patch-based discriminator is a light weight model, it can run faster even on arbitrarily large image and it is shown to perform better. This strongly suggests that our proposed model is more suitable to be applied to train the cGANs-based model for visible watermarks removal in realistic data.
Comparison with state-of-the-art. To justify the effectiveness of the proposed model, we performed experiments to compare our method with Cheng et al. . As shown in Table 3, our model obtained the comparable results in PSNR and DSSIM, and the MOS results indicate that our method outperforms Cheng et al. by a large margin in human visual system.
|Metrics||Input||Cheng et al. ||Ours||Ground truth|
We also visualized the watermark removal results of test examples in LVW dataset and show them in Fig. 4. The results in the figure further demonstrate that the performance of the proposed method is noticeably convincing than the ones of existing methods, suggesting that the adversarial model is more suitable for solving the visible watermark removal problem.
3.3 Discussion and Future Work
Our experiments show that our proposed framework can effectively remove the unknown and diverse visible watermarks, resulting more satisfactory recovered images. The focus of this work was the photo-realistic quality of reconstruction rather than obtaining better performance in standard quantitative evaluation metrics such as PSNR and SSIM, which can not accurately capture and evaluate the quality of images associated with the human visual system. The experimental results further verify that a deep convolutional architecture using the concept of cGANs to form an adversarial loss is useful for photo-realistic watermark removal in real-world scenarios. Significantly, our original intention is to increase the awareness on the copyrights of online images, reminding that visible watermarks should be designed to be more resistant against removal attacking. Developing a more robust watermarking technique for copyright protection is challenging and part of future work.
In this work, we introduced a new watermark processing framework for more photo-realistic visible watermark removal, which augments the conventional L1 and perceptual loss function with an adversarial loss by training a conditional generative adversarial network. The proposed model is able to drive the reconstruction of watermark regions towards the photo-realistic results producing perceptually more convincing solutions. Extensive experiments are conducted on a large-scale visible watermark dataset to verify the feasible of our method. Experimental results clearly demonstrated the superior performance of the proposed framework compared to existing methods.
Xiang Li and Chan Lu equally contributed to this work. The authors would like to thank all voluntary raters.
-  Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39(12), 2481–2495 (2017)
Cheng, D., Li, X., Li, W.H., Lu, C., Li, F., Zhao, H., Zheng, W.S.: Large-scale visible watermark detection and removal with deep convolutional networks. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 27–40. Springer (2018)
-  Dekel, T., Rubinstein, M., Liu, C., Freeman, W.T.: On the effectiveness of visible watermarks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2146–2154 (2017)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)
-  Huang, C.H., Wu, J.L.: Attacking visible watermarking schemes. IEEE transactions on multimedia 6(1), 16–30 (2004)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1125–1134 (2017)
-  Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. pp. 694–711. Springer (2016)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4681–4690 (2017)
-  Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
-  Pei, S.C., Zeng, Y.C.: A novel image recovery algorithm for visible watermarked images. IEEE Transactions on Information Forensics and Security 1(4), 543–550 (2006)
Qin, C., He, Z., Yao, H., Cao, F., Gao, L.: Visible watermark removal scheme based on reversible data hiding and image inpainting. Signal Processing: Image Communication 60, 160–172 (2018)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
-  Santoyo-Garcia, H., Fragoso-Navarro, E., Reyes-Reyes, R., Sanchez-Perez, G., Nakano-Miyatake, M., Perez-Meana, H.: An automatic visible watermark detection method using total variation. In: 2017 5th International Workshop on Biometrics and Forensics (IWBF). pp. 1–5. IEEE (2017)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Xu, C., Lu, Y., Zhou, Y.: An automatic visible watermark removal technique using image inpainting algorithms. In: 2017 4th International Conference on Systems and Informatics (ICSAI). pp. 1152–1157. IEEE (2017)