Blind Image Decomposition

08/25/2021 ∙ by Junlin Han, et al. ∙ CSIRO Australian National University 3

We present and study a novel task named Blind Image Decomposition (BID), which requires separating a superimposed image into constituent underlying images in a blind setting, that is, both the source components involved in mixing as well as the mixing mechanism are unknown. For example, rain may consist of multiple components, such as rain streaks, raindrops, snow, and haze. Rainy images can be treated as an arbitrary combination of these components, some of them or all of them. How to decompose superimposed images, like rainy images, into distinct source components is a crucial step towards real-world vision systems. To facilitate research on this new task, we construct three benchmark datasets, including mixed image decomposition across multiple domains, real-scenario deraining, and joint shadow/reflection/watermark removal. Moreover, we propose a simple yet general Blind Image Decomposition Network (BIDeN) to serve as a strong baseline for future work. Experimental results demonstrate the tenability of our benchmarks and the effectiveness of BIDeN. Code and project page are available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 16

page 18

page 19

page 20

page 22

page 23

Code Repositories

BID

Code for Blind Image Decomposition (BID) and Blind Image Decomposition network (BIDeN).


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(1) rs (2) rs + snow
(3) rs + light haze (4) rs + heavy haze
(5) rs + mh + raindrop (6) rs + snow + mh + raindrop

rs: rain streak, mh: moderate haze

Figure 1: Example of raining cases. Rain exists in different formats such as rain streak and raindrop. Snow and haze often co-occur during raining. BID setting treats rainy images as an arbitrary combination of these components. Deraining under the BID setting becomes more challenging yet more overarching.

Various computer vision and computer graphics tasks 

[21, 84, 38, 42, 25, 36, 27, 1, 13] can be viewed as image decomposition, which aims to separate a superimposed image into distinct components with only a single observation. For example, foreground/background segmentation [14, 62, 2, 48] aims at decomposing a holistic image into foreground objects and background stuff. Image dehazing [30, 4, 45] can be treated as decomposing a hazy image into a haze-free image and a haze map (medium transmission map, atmosphere light). Shadow removal [42, 68, 11, 17, 16] decomposes a shadow image into a shadow-free image and a shadow mask. Other tasks like transparency separation [80, 15, 44, 52], watermark removal [49, 6], image deraining [79, 58, 73, 46, 76, 47, 74], texture separation [25], underwater image restoration [20, 28], image desnowing [51, 61], stereo mixture image decomposition [81], 3D intrinsic image decomposition [1], fence removal [12, 71, 50], flare removal [70, 3] are covered inside image decomposition.

Vanilla image decomposition tasks come with a fixed and known number of source components, and the number is most often set to two  [21, 84, 38, 42, 25, 76, 41, 27, 4, 45, 11, 17, 16, 80]. Such a setting does capture some basic real-world cases. However, real-world scenarios are more complex. Consider autonomous driving on rainy days, where the visual perception quality is degraded by different forms of precipitation and the co-occurring components, shown in Figure 1. Some natural questions emerge: Can a vision system assume precipitations to be of a specific form? Should a vision system assume raindrops always exist or not? Shall a vision system assume the haze or snow comes along with rain or not? These questions are particularly important based on their relevance in real-world applications. The answer to these questions is trivial, no. A comprehensive vision system should be able to handle all possible circumstances [54, 65]. Yet, the previous setting in deraining [79, 58, 73, 46, 47, 78, 60] remains a gap toward complex real-world scenarios.

This paper aims at addressing the aforementioned gap, as an important step toward real-world vision systems. We propose a task that: (1) does not fix the number of source components, (2) considers the presence of every source component and varying intensities of source components, and (3) amalgamates every source component as potential combinations. To disambiguate with previous tasks, we refer to our proposed task as Blind Image Decomposition (BID). This name is inspired by the Blind Source Separation (BSS) task in the field of signal processing.

The task format is straightforward. We no longer set the number of the source components to a fixed value. Instead, we set a maximum number of potential source components, where each component can be arbitrarily part of the mixing. Let ‘A’, ‘B’, ‘C’, ‘D’, ‘E’ denote five source components and ‘a’, ‘b’, ‘c’, ‘d’, ‘e’ denote images from the corresponding source components. The mixed image can be either ‘a’, ‘d’, ‘ab’, ‘bc’, ‘abd’, ‘ade’, ‘acde’, ‘abcde’ , up to 31 possible combinations in total. Given any of the 31 possible combinations as input, a BID method is required to predict and reconstruct the individual source components involved in the mixing.

As different components can be arbitrarily involved in the mixing, no existing datasets support such a training protocol. Thus, we construct three new benchmark datasets supporting the applications of BID: (I) mixed image decomposition across multiple domains, (II) real-scenario deraining, and (III) joint shadow/reflection/watermark removal.

To perform multiple BID tasks, we design a simple yet general, strong, and flexible model, dubbed BIDeN (Blind Image Decomposition Network). BIDeN is a generic model that supports multiple BID tasks with distinct objectives. BIDeN is based on the framework of GANs (Generative Adversarial Networks

[24], and we explore some critical design choices of BIDeN to present a strong model. Designed for a more challenging BID setting, BIDeN still outperforms the current state-of-the-art image decomposition model [84] and shows competitive results compared to models designed for specific tasks. Together, our proposed BID task, our proposed method BIDeN, our constructed benchmark datasets, and our comprehensive results and ablations form a solid foundation for the future study of this challenging topic.

Figure 2: The architecture of the Blind Image Decomposition Network (BIDeN). We show an example, where , , , and . are selected then passed to the mixing function , and outputs the mixed input image , which is here. The generator consists of an encoder with three branches and multiple heads . denotes the concatenation operation. Depth and receptive field of each branch is different to capture multiple scales of features. Each specified head points to the corresponding source component, and the number of heads varies with the maximum number of source components . All reconstructed images () and their corresponding real images () are sent to an unconditional discriminator. The discriminator also predicts the source components involved in the mixing of the input image . The outputs from other heads () do not contribute to the optimization.

2 Related Work

Image Decomposition. This task is a general task covering numerous computer vision and computer graphics tasks. Double-DIP [21] couples multiple DIPs [67] to decompose images into their basic components in an unsupervised manner. Deep Generative Priors [36] employs a likelihood-based generative model as a prior to perform the image decomposition task. Deep Adversarial Decomposition (DAD) [84] purposed a unified framework for image decomposition by employing three discriminators. A crossroad L1 loss is introduced to support pixel-wise supervision when domain information is unknown.

Blind Source Separation. BSS [34, 7, 18, 19, 33]

, also known as the “cocktail party problem”, is an important research topic in the field of signal processing. Blind refers to a constraint where sources and mixing mechanism are both unknown. The task requires performing the separation of a mixture signal into the constituent underlying source signals with limited prior information. A representative algorithm is the Independent Component Analysis (ICA) 

[34, 43, 56]. The BID task shares some common properties with BSS, where the blind settings are similar but not identical. BID is constrained by unknown source components involved in mixing and unknown mixing mechanisms. The number of source components involved in mixing is also unknown. We do not set the domain information of the source components to be unknown. The goal of BID is to advance real-world vision systems. The setting of including known domain information can be better applied to computer vision tasks. For instance, the goal of the shadow removal task is to separate a shadow image into a shadow-free image and a shadow mask, where the domain information is clear.

Generative Adversarial Networks. GANs [24] include two key components, a generator, and a discriminator, where the generator is trying to generate realistic samples while the discriminator is trying to identify real samples and generated samples. The adversarial training mechanism helps the outputs from the generator to match the distribution of real data. GANs are especially successful in image generation tasks [75, 39]

and image-to-image translation tasks 

[35, 29, 83]. GANs are also a common tool for image decomposition tasks, where GANs have been successfully employed in image deraining [76, 58], transparency separation [52, 80], and image dehazing [45].

3 Blind Image Decomposition Formulation

Given a set of () source components, i.e., image domains, denoted by . Each source component contains some images , . () source components are randomly selected from . Let indicate the index set for the selected source components, where . Hence, the selected source components are denoted by . Each selected source component contains some images . With a predetermined mixing function , the mixed image is given by . The mixed image can be identical to a single image when . The BID task requires the BID method to find a function to separate , as . Each reconstructed image is close to its corresponding image . That is, given a mixed image as input, the task requires the BID method to: (1) predict the source components involved in mixing, (2) reconstruct/generate images preserving the fidelity of the corresponding images involved in mixing.

For the example shown in Figure 2, where , , , can be , can be . Let and . Given as the input, knowing there are four different source components without any other information, the task requires the method to find a function , so that , where and . Also, the method should correctly predict the source components involved in the mixing, that is, giving a prediction [1,0,1,0], which is equivalent to predicting .

The BID task is challenging for the following reasons: (1) When increases, the number of possible increases rapidly. The BID setting forces the method to deal with possible combinations. For instance, when increases to 8, there are 255 variants of . (2) The task requires the method to predict the source components involved in the mixing. Source components are difficult to be predicted when is large and varies a lot. (3) The mixing mechanism, or the mixing function , is unknown to the method. The mixing function varies with different source components and can be non-linear/complex in specific circumstances, such as rendering raindrop images, adding shadows or reflections to images. (4) As increases, each source component contributes a decreasing amount of information to , making the task highly ill-posed.

4 Blind Image Decomposition Network

To perform diverse BID tasks, a generic framework is required. Inspired by the success of image-to-image translation models [35, 29, 83], we design our Blind Image Decomposition Network (BIDeN) as follows. Figure 2 presents an overall architecture of BIDeN.

The generator consists of two parts: a multi-scale encoder and multiple heads . We design a multi-scale encoder containing three branches to capture multiple scales of features. Such a design is beneficial to the source components requiring different scales of features for reconstruction. We concatenate different scales of features and send them to multiple heads, where the number of heads is identical to the maximum number of source components. Each head is specific to reconstructing a particular kind of source component. The discriminator consists of two branches and most weights are shared. The reconstructed images and their corresponding real images are sent to the discriminator branch (Separation) individually. The function of is similar to a typical discriminator. The discriminator branch (Prediction) predicts the source components involved in the mixed image . A successful prediction unveils the correct index set of the selected source components .

Taking the example of Figure 2, we name four heads , , , and . For an input , and aims to reconstruct , so that and while , are free to output anything or simply turned off. We employ adversarial loss [24], perceptual loss [37], L1/L2 loss, and binary cross-entropy loss. The details of the objective are expressed below.

4.1 Objective

We employ the adversarial loss [24] to encourage the generator to output separated and realistic images, regardless of the source components. For the function , the GAN loss is expressed as:

(1)

where behaves as . It tries to separate the input mixed image and reconstruct separated outputs , while attempts to distinguish between and real samples . Note inside equations 1-5 denotes . We employ the LSGAN [53] loss and the Markovian discriminator [35].

The reconstructed images should be separated, as well as to be near the corresponding in a distance sense. Hence, we employ perceptual loss (VGG loss) [37] and L1/L2 loss. They are formalized as:

(2)
(3)
(4)

where is a trained VGG19 [64] network, denotes a specific layer, and denotes the weights for the -th layer. The choice of layers and weights is identical to pix2pixHD [68]. We use L2 loss for masks and L1 loss for other source components. For simplification, we denote L1/L2 loss as .

For the source prediction task, we find that the discriminator performs better than the generator. The goal of the discriminator is to classify between reconstructed samples and real samples. It naturally learns an embedding. Such an embedding is beneficial even when the input is a mixed image

. The discriminator is capable of performing an additional source prediction task. Thus we design a source prediction branch . The binary cross-entropy loss is employed for the source prediction task:

(5)

where denotes the maximum number of source components, denotes the binary label of the source components involved in the mixing of input image . is the source prediction branch of the discriminator.

Our final objective function is:

(6)

We set to be 1, 10, 30, 1 respectively. This setting is a generic setting that is applied to all tasks.

4.2 Training details

Throughout all experiments, we use the Adam optimizer [40] with = 0.5 and = 0.999 for both

. BIDeN is trained for 200 epochs with a learning rate of 0.0003. The learning rate starts to decay linearly after half of the total epochs. We use a batch size of 1 and instance normalization 

[66]. All training images are loaded as then cropped to patches. Horizontal flip is randomly applied. At test time, we load test images in a resolution. More details on the training settings, the architecture, the number of parameters, training speed are provided in the Appendices. The training details for baselines are also provided in the Appendices.

Figure 3: Qualitative results of Task I (Mixed image decomposition across multiple domains). We train BIDeN 7 times setting different maximum numbers of source components (2-8). Double-DIP fails to separate the mixed input. DAD shows blurry, non-clean results while the results shown by BIDeN are well-separated and visually satisfying. When BIDeN is trained on a greater number of source components, the quality of the results drops progressively.
Method Fruit (A) Animal (B) Acc (AB) Acc (All) Model size
PSNR SSIM FID PSNR SSIM FID
Double-DIP [21] 13.14 0.49 257.80 13.11 0.39 221.76 - - -
DAD [84] 17.59 0.72 137.66 17.52 0.62 126.32 0.996 - 669.0 MB
BIDeN (2) 20.07 0.79 62.99 19.89 0.69 69.35 1.0 0.957 144.9 MB
BIDeN (3) 19.04 0.75 74.68 18.75 0.61 88.23 0.836 0.807 147.1 MB
BIDeN (4) 18.19 0.73 79.03 18.03 0.58 97.16 0.716 0.733 149.3 MB
BIDeN (5) 17.66 0.71 81.17 17.27 0.54 114.40 0.676 0.603 151.5 MB
BIDeN (6) 17.28 0.69 85.64 16.57 0.51 118.00 0.646 0.483 153.7 MB
BIDeN (7) 16.70 0.68 97.26 16.54 0.49 126.66 0.413 0.310 155.9 MB
BIDeN (8) 16.49 0.67 105.61 15.79 0.45 191.29 0.383 0.278 158.1 MB
Table 1: Quantitative results on Task I (Mixed image decomposition across multiple domains). The testing condition is identical, using Fruit (A) + Animal (B) mixture as inputs. in BIDeN () denotes the maximum number of source components. Double-DIP [21] performs poorly. Under a more challenging BID setting, BIDeN (2,3,4) still outperforms DAD [84] overall, suggesting the superiority of BIDeN. Please refer to Appendices for detailed case results.

5 Blind Image Decomposition Tasks

We construct three benchmark datasets from different views to support practical usages of BID. To explore the generality of BIDeN and the tenability of constructed datasets, we test BIDeN on three new challenging datasets. BIDeN is trained under the BID setting, which is more difficult than the conventional image decomposition setting. During training, mixed images are randomly synthesized. At test time, the input mixed images are fixed. As BID is a novel task, no existing baselines are available for comparison. For different tasks, we choose different evaluation strategies and baselines.

Throughout all tasks, BIDeN is trained under the BID setting, that is, BIDeN is facing more challenging requirements than other baselines. Also, BIDeN is a generic model designed to perform all kinds of BID tasks. These two constraints limit the performance of BIDeN. We compare BIDeN to other baselines designed for specific tasks, where BIDeN is still able to show very competitive results on all tasks. All qualitative results are randomly picked. More task settings, results, including the detailed case results of BIDeN, are provided in the Appendices.

5.1 Task I: Mixed image decomposition across multiple domains

Dataset. This dataset contains eight different image domains, i.e., source components. Each domain has approximately 2500 to 3000 images in the training set, and the test set contains 300 images for all domains. Image domains are designed to be big and inclusive to cover multiple categories, like animal, fruit, vehicle, instead of being small domains such as horse, cat, car. The eight domains are Fruit (2653), Animal (2653), Flower (2950), Furniture (2582), Yosemite (2855), Vehicle (2670), Vegetable (2595), and CityScape (2975). CityScape and Flower are selected from the CityScape [9] dataset and the VGG flower [55] dataset. The remaining six image domains are mainly gathered from Flicker using the corresponding keyword, except for Yosemite, which also combines the Summer2Winter dataset from CycleGAN [83]. The order of the eight domains is randomly shuffled. The mixing mechanism for this dataset is linear mixup [77].

Figure 4: CityScape, masks (Rain Streak, Snow, Raindrop), and transmission map (Haze) generated by BIDeN for case (6), rain streak + snow + moderate haze + raindrop. All generated images are perceptually faithful and visually close to the ground-truth (GT).
Method CityScape Rain Streak Snow Haze Raindrop Acc
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
BIDeN (1) 30.89 0.932 32.13 0.924 - - - - - - 0.998
BIDeN (2) 29.34 0.899 29.24 0.846 25.77 0.692 - - - - 0.996
BIDeN (3) 28.62 0.919 31.48 0.914 - - 30.77 0.960 - - 0.994
BIDeN (4) 26.77 0.898 30.57 0.897 - - 33.73 0.957 - - 0.998
BIDeN (5) 27.11 0.898 30.54 0.898 - - 30.52 0.952 20.20 0.908 0.994
BIDeN (6) 26.44 0.870 28.31 0.823 24.79 0.658 29.83 0.948 21.47 0.893 0.998
Table 2: Results of BIDeN on Task II (Real-scenario deraining). We employ PSNR and SSIM metrics for both CityScape images, masks, and transmission maps. We report the results for 6 test cases as presented in Figure 1, the 6 cases are (1): rain streak, (2): rain streak + snow, (3): rain streak + light haze, (4): rain streak + heavy haze, (5): rain streak + moderate haze + raindrop, (6) rain streak + snow + moderate haze + raindrop. Note that only haze is divided into light/moderate/heavy intensities. Both training set and test set of Rain Streak, Snow, and Raindrop already consist of different intensities.
Input MPRNet BIDeN GT





Figure 5: Results of Task II (Real-scenario deraining). Row 1-6 presents 6 cases as presented in Table 2. MPRNet [74] performs well at case (1) and it effectively removes rain streak but is not strong at removing other components. BIDeN is better at the removal of other components. BIDeN generates artifact-free, visually pleasing results while MPRNet leaves some components that are not completely removed.
Case Input MPRNet BIDeN
PSNR SSIM PSNR SSIM PSNR SSIM
 (1) 25.69 0.786 33.39 0.945 30.89 0.932
 (2) 18.64 0.564 30.52 0.909 29.34 0.899
 (3) 17.45 0.712 23.98 0.900 28.62 0.919
 (4) 11.12 0.571 18.54 0.829 26.77 0.898
 (5) 14.05 0.616 21.18 0.846 27.11 0.898
 (6) 12.38 0.461 20.76 0.812 26.44 0.870
Table 3: Comparison on task 2 (Real-scenario deraining) between BIDeN and MPRNet [74]. MPRNet shows superior results for case (1) and case (2). In contrast, BIDeN is better at four other cases. For the details of 6 test cases, please refer to Table 2 and Figure 1.

Experiments and results. We compare BIDeN to Double-DIP [21] and DAD [84]. DAD is trained on the first two domains (Fruit, Animal) with mixed input only. For BIDeN, we train it 7 times under the BID setting, varying from 2 domains to 8 domains. At test time, we evaluate the separation results on Fruit + Animal mixture using PSNR, SSIM [82], and FID [31]. For BIDeN, the accuracy of the source prediction is reported for all tasks. We report case results for all tasks and the overall averaged result for Task I.

Table 1 presents the results for Task I. In terms of PSNR/SSIM, BIDeN outperforms Double-DIP with a large margin and outperforms DAD when is less than 5. For FID, BIDeN shows better results than DAD even when , suggesting the superiority of BIDeN. Figure 3 presents the qualitative results of Task I.

5.2 Task II: Real-scenario deraining

Dataset. We construct the real-scenario deraining dataset based on the CityScape [9] dataset. We use the test set from the original CityScape dataset as our training set (2975), and the validation set from the original CityScape dataset as our test set (500). The test set for all source components contains a fixed number of 500 images. We use three different masks, including rain streak (1620), raindrop (3500), and snow (3500). These masks cover different intensities. For haze, we use the corresponding transmission maps (2975 x 3) with three different intensities acquired from Foggy CityScape [63]. The masks for rain streak are acquired from Rain100L and Rain100H [73] while the masks for snow are selected from Snow100K [51]. For raindrop masks, we model the droplet shape and property using the metaball model [5]. The locations/numbers/sizes of raindrops are randomly sampled. Paired refraction maps are generated using [8, 57]. The mixing mechanism for this dataset is based on physical imaging models [73, 51, 30, 8, 57].

Figure 6: Images (both SRD and ISTD) and masks (Shadow, Watermark) produced by BIDeN for three cases, (a), (c), and (abc). All generated masks are faithful to the ground-truth (GT). However, the generated SRD/ISTD images suffer from a color shift. Note that we do not require the reconstruction of reflection layer images.
Method Version one (V1), ISTD Version two (V2), SRD
Shadow Non-Shadow All Acc Shadow Non-Shadow All Acc
Bilateral [72] 19.82 14.83 15.63 - 23.43 22.26 22.57 -
Regions [26] 18.95 7.46 9.30 - 29.89 6.47 12.60 -
Interactive [23] 14.98 7.29 8.53 - 19.58 4.92 8.73 -
DSC [32] 9.48 6.14 6.67 - 10.89 4.99 6.23 -
DHAN [10] 8.14 6.04 6.37 - 8.94 4.80 5.67 -
Auto-Exposure [17] 7.77 5.56 5.92 - 8.56 5.75 6.51 -
BIDeN (a) 11.55 10.24 10.45 0.359 12.06 7.47 8.73 0.919
BIDeN (ab) 12.96 10.77 11.12 0.661 14.10 8.16 9.79 0.911
BIDeN (ac) 11.89 10.23 10.50 0.694 13.29 8.08 9.51 0.943
BIDeN (abc) 13.20 10.76 11.16 0.929 15.28 8.85 10.62 0.936
BIDeN (b) - - 10.85 0.559 - - 8.01 0.891
BIDeN (c) - - 10.20 0.461 - - 7.92 0.914
BIDeN (bc) - - 10.77 0.727 - - 8.71 0.879
Table 4: Results on Task III (Joint shadow/reflection/watermark removal) Version one (V1) and Version two (V2). We employ RMSE to measure shadow region, non-shadow region, and all region. For BIDeN, we report the performance of all cases. a,b,c denotes shadow, reflection, and watermark, respectively. BIDeN (ab) is the result of BIDeN tested on shadow + reflection inputs. Results for all baselines are reported by [10, 17]. The generality of BIDeN and the BID training setting limit the performance of BIDeN.
Input DHAN Auto-Exp BIDeN GT





Figure 7: Results of Task III (Joint shadow/reflection/watermark removal). Row 1-3 show the results of Version one (V1) and Row 4-6 present the results of Version two (V2). We compare BIDeN to DHAN [10] and Auto-Exposure [17] on shadow removal task. Though the performance of BIDeN is constrained by the generality of BIDeN and BID training setting, BIDeN still efficiently removes the shadow and shows competitive results. However, BIDeN suffers from a color shift whereas DHAN and Auto-Exposure better keep the fidelity of the original colors.

Experiments and results. We train both BIDeN and MPRNet [74] under the BID setting. For MPRNet, we do not require the prediction of the source component and the generation of masks. We report the results for 6 cases, as the examples presented in Figure 1. Note that only haze is divided into light/moderate/heavy intensities. Both training set and test set of rain streak/snow/raindrop already contain different intensities. We report the results in SSIM and PSNR for CityScape images, masks, and transmission maps.

For all 6 cases, we report the detailed results of BIDeN in Table 2. BIDeN shows excellent quantitative results in accuracy metric. For the PSNR/SSIM metrics on all components, BIDeN performs well except for the raindrop masks. An example of all components generated by BIDeN is shown in Figure 4. Table 3 and Figure 5 presents the comparison between BIDeN and MPRNet [74]. For better visualization, we resize the resolution of visual examples to match the original CityScape resolution.

5.3 Task III: Joint shadow/reflection/watermark removal

Dataset. This task is designed to jointly perform multiple tasks with uncertainty in one go. Once a model is trained on this dataset under the BID setting, the model is capable of performing multiple tasks in one go. We construct two versions for this dataset, Version one (V1) is based on ISTD [68], and Version two (V2) is based on SRD [59, 10]. We use paired shadow masks, shadow-free images, and shadow images from ISTD/SRD. ISTD consists of 1330 training images and 540 test images while SRD contains 2680 training images and 408 testing images. The algorithm for adding reflection to images is acquired from [80], we select 3120 images from the reflection subset [80] as the reflection layer. The watermark generation algorithm as well as the paired watermark masks, RGB watermark images are acquired from LVM [49], we select 3000 paired watermark images and masks from the training set of LVW [49]. Following the data split of ISTD and SRD, V1 contains 1330 shadow-free images/shadow masks, 2580 reflection layer images, and 2460 watermark images/masks as the training set. The test set contains 540 images for every source component. V2 includes 2680 shadow-free images/shadow masks, 2580 reflection layer images, and 2460 watermark images/masks as the training set. The test set contains 408 images for every source component. We do not require the reconstruction of reflection layer images.

Experiments and results. We mainly compare the shadow removal results to multiple shadow removal baselines, including [72, 26, 23, 59, 32, 10, 17]. We train BIDeN under the BID setting. The trained BIDeN is capable of dealing with all combinations between shadow/reflection/watermark removal tasks. At test time, we report the results for all cases. We employ the root mean square error (RMSE) in LAB color space, following [10, 17].

Figure 6 presents the visual examples of three cases. The quantitative results for Task III are reported in Table 4. Constrained by the generality of BIDeN and the BID training setting, BIDeN does not show superior quantitative results compared to other baselines designed for the shadow removal task only. Also, the performance of BIDeN varies on the dataset, being better at Version two (V2), especially using the accuracy metric. Figure 7 presents the qualitative comparison between BIDeN and two state-of-the-art baselines [10, 17].

6 Ablation Study

We perform ablation experiments to analyze the effectiveness of each component inside BIDeN. Evaluation is performed on Task I (Mixed image decomposition across multiple domains). We set the maximum number of source components to be 4 throughout all ablation experiments. The results are shown in Table 5.

Multi-scale encoder (I). We present the results of using a single-scale encoder to replace the multi-scale encoder. BIDeN yields better performance when the multi-scale encoder is employed, which validates the effectiveness of our design.

Choice of losses (II, III, IV, V). BIDeN consists of four different losses, we show that removing either one of the losses leads to a performance drop.

Ablation Fruit (A) Animal (B) Acc (AB) Acc (All)
PSNR PSNR
I 17.26 17.05 0.730 0.732
II 17.95 17.41 0.566 0.616
III 16.67 16.34 0.706 0.722
IV 15.56 13.65 0.733 0.753
V 18.04 17.98 0.0 0.06
VI 18.19 17.97 0.634 0.698
VII 18.13 17.98 0.520 0.609
VIII 15.68 15.64 0.716 0.683
BIDeN 18.19 18.03 0.716 0.733
Table 5: Ablation study on the design choices of BIDeN. (I) Single-scale encoder. (II) No adversarial loss. (III) No perceptual loss. (IV) No L1/L2 loss. (V) No binary cross-entropy loss. (VI) Source prediction branch inside the generator. (VII) No weights sharing between two branches of discriminator. (VIII) Zeroed loss.

Source prediction branch (VI, VII). We move the prediction branch to the generator. Such a change degrades the performance, showing that the source prediction task is better performed inside the discriminator. We report the results for a variant where does not share weights with the separation branch . The performance of this variant is worse than vanilla BIDeN, which indicates that the embedding learned by is beneficial to .

Zeroed loss (VIII). Taking the example of Figure 2, four heads are , , , and . BIDeN ignores the outputs from and . Here, we encourage the outputs from these two heads to be zero pixel. Such a zeroed loss forces the generator to perform the source prediction task implicitly, however, as the source prediction task is challenging, the results after applying zeroed loss are not comparable to default BIDeN.

7 Discussion

BID is a novel task designed to advance real-world vision systems. However, some limitations remain in the setting of the BID task: (1) The training process is constrained by a maximum number of source components. Also, for multiple heads, each head is specific to a certain component/domain. That is, when testing out of distribution examples, the BID setting, as well as BID methods may fail. (2) In Task I, Mixed image decomposition across multiple domains, if the input is a mixture of multiple images from one domain, such as a mixture of bird, cat, and dog. Our constructed dataset recognizes them as domain animal. Thus, BIDeN is not able to perform a precise source component prediction and generate separated outputs with high fidelity. (3) The training data highly relies on synthetic data. How to break the gap between synthetic data and real data remains a promising research direction [69].

Our goal is to invite the community to explore the novel BID task, including discovering interesting areas of application, developing novel methods, extending the BID setting, and constructing benchmark datasets. We hope to see the BID task driving innovation in image decomposition and its applications. We expect more application areas related to image decomposition to consider our new BID setting, especially in image deraining. Novel deraining methods under the BID setting can be developed.

Finally, the BID task may also be beneficial to the learning with other types of data, such as video, speech, 3D visual data, or even natural language. We hope our proposed BID task brings insights to future research in the field of Artificial Intelligence.

References

  • [1] H. A. Alhaija, S. K. Mustikovela, J. Thies, V. Jampani, M. Nießner, A. Geiger, and C. Rother (2020)

    Intrinsic autoencoders for joint deferred neural rendering and intrinsic image decomposition

    .
    In 2020 International Conference on 3D Vision (3DV), pp. 1176–1185. Cited by: §1.
  • [2] S. Alpert, M. Galun, A. Brandt, and R. Basri (2011) Image segmentation by probabilistic bottom-up aggregation and cue integration. IEEE transactions on pattern analysis and machine intelligence 34 (2), pp. 315–327. Cited by: §1.
  • [3] C. Asha, S. K. Bhat, D. Nayak, and C. Bhat (2019) Auto removal of bright spot from images captured against flashing light source. In 2019 IEEE International Conference on Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER), pp. 1–6. Cited by: §1.
  • [4] D. Berman, S. Avidan, et al. (2016) Non-local image dehazing. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1674–1682. Cited by: §1, §1.
  • [5] J. F. Blinn (1982) A generalization of algebraic surface drawing. ACM transactions on graphics (TOG) 1 (3), pp. 235–256. Cited by: §A.3, §5.2.
  • [6] X. Chen, W. Wang, C. Bender, Y. Ding, R. Jia, B. Li, and D. Song (2021)

    Refit: a unified watermark removal framework for deep learning systems with limited data

    .
    In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, pp. 321–335. Cited by: §1.
  • [7] A. Cichocki and S. Amari (2002) Adaptive blind signal and image processing: learning algorithms and applications. John Wiley & Sons. Cited by: §2.
  • [8] J. Cohen, M. Olano, and D. Manocha (1998) Appearance-preserving simplification. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp. 115–122. Cited by: §A.3, §A.3, §5.2.
  • [9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §5.1, §5.2.
  • [10] X. Cun, C. Pun, and C. Shi (2020) Towards ghost-free shadow removal via dual hierarchical aggregation network and shadow matting gan. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 10680–10687. Cited by: §A.3, Figure 7, §5.3, §5.3, §5.3, Table 4.
  • [11] B. Ding, C. Long, L. Zhang, and C. Xiao (2019) Argan: attentive recurrent generative adversarial network for shadow detection and removal. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10213–10222. Cited by: §1, §1.
  • [12] C. Du, B. Kang, Z. Xu, J. Dai, and T. Nguyen (2018)

    Accurate and efficient video de-fencing using convolutional neural networks and temporal information

    .
    In 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §1.
  • [13] M. J. Fadili, J. Starck, J. Bobin, and Y. Moudden (2009) Image decomposition and separation using sparse representations: an overview. Proceedings of the IEEE 98 (6), pp. 983–994. Cited by: §1.
  • [14] A. Faktor and M. Irani (2013) Co-segmentation by composition. In Proceedings of the IEEE international conference on computer vision, pp. 1297–1304. Cited by: §1.
  • [15] Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf (2017) A generic deep architecture for single image reflection removal and image smoothing. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3238–3247. Cited by: §1.
  • [16] G. D. Finlayson, M. S. Drew, and C. Lu (2009) Entropy minimization for shadow removal. International Journal of Computer Vision 85 (1), pp. 35–57. Cited by: §1, §1.
  • [17] L. Fu, C. Zhou, Q. Guo, F. Juefei-Xu, H. Yu, W. Feng, Y. Liu, and S. Wang (2021) Auto-exposure fusion for single-image shadow removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10571–10580. Cited by: §1, §1, Figure 7, §5.3, §5.3, Table 4.
  • [18] K. Gai, Z. Shi, and C. Zhang (2008) Blindly separating mixtures of multiple layers with spatial shifts. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.
  • [19] K. Gai, Z. Shi, and C. Zhang (2009) Blind separation of superimposed images with unknown motions. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1881–1888. Cited by: §2.
  • [20] A. Galdran, D. Pardo, A. Picón, and A. Alvarez-Gila (2015) Automatic red-channel underwater image restoration. Journal of Visual Communication and Image Representation 26, pp. 132–145. Cited by: §1.
  • [21] Y. Gandelsman, A. Shocher, and M. Irani (2019) ” Double-dip”: unsupervised image decomposition via coupled deep-image-priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11026–11035. Cited by: §A.1, §1, §1, §2, Table 1, §5.1.
  • [22] X. Glorot and Y. Bengio (2010)

    Understanding the difficulty of training deep feedforward neural networks

    .
    In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §A.1.
  • [23] H. Gong and D. Cosker (2014) Interactive shadow removal and ground truth for variable scene categories.. In BMVC, pp. 1–11. Cited by: §5.3, Table 4.
  • [24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, Cited by: §1, §2, §4.1, §4.
  • [25] S. Gu, D. Meng, W. Zuo, and L. Zhang (2017) Joint convolutional analysis and synthesis sparse representation for single image layer separation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1708–1716. Cited by: §1, §1.
  • [26] R. Guo, Q. Dai, and D. Hoiem (2012) Paired regions for shadow detection and removal. IEEE transactions on pattern analysis and machine intelligence 35 (12), pp. 2956–2967. Cited by: §5.3, Table 4.
  • [27] T. Halperin, A. Ephrat, and Y. Hoshen (2019) Neural separation of observed and unobserved distributions. In

    International Conference on Machine Learning

    ,
    pp. 2566–2575. Cited by: §1, §1.
  • [28] J. Han, M. Shoeiby, T. Malthus, E. Botha, J. Anstee, S. Anwar, R. Wei, M. A. Armin, H. Li, and L. Petersson (2021) Underwater image restoration via contrastive learning and a real-world dataset. arXiv preprint arXiv:2106.10718. Cited by: §1.
  • [29] J. Han, M. Shoeiby, L. Petersson, and M. A. Armin (2021) Dual contrastive learning for unsupervised image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §2, §4.
  • [30] K. He, J. Sun, and X. Tang (2010) Single image haze removal using dark channel prior. IEEE transactions on pattern analysis and machine intelligence 33 (12), pp. 2341–2353. Cited by: §A.3, §1, §5.2.
  • [31] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, Cited by: §5.1.
  • [32] X. Hu, C. Fu, L. Zhu, J. Qin, and P. Heng (2019) Direction-aware spatial context features for shadow detection and removal. IEEE transactions on pattern analysis and machine intelligence 42 (11), pp. 2795–2808. Cited by: §5.3, Table 4.
  • [33] A. Hyvärinen and E. Oja (1997) A fast fixed-point algorithm for independent component analysis. Neural computation 9 (7), pp. 1483–1492. Cited by: §2.
  • [34] A. Hyvärinen and E. Oja (2000) Independent component analysis: algorithms and applications. Neural networks 13 (4-5), pp. 411–430. Cited by: §2.
  • [35] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2, §4.1, §4.
  • [36] V. Jayaram and J. Thickstun (2020) Source separation with deep generative priors. In International Conference on Machine Learning, pp. 4724–4735. Cited by: §1, §2.
  • [37] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    .
    In European conference on computer vision, pp. 694–711. Cited by: §A.2, §4.1, §4.
  • [38] L. Kang, C. Lin, and Y. Fu (2011) Automatic single-image-based rain streaks removal via image decomposition. IEEE transactions on image processing 21 (4), pp. 1742–1755. Cited by: §1, §1.
  • [39] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §2.
  • [40] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. International Conference on Learning Representations (ICLR). Cited by: §4.2.
  • [41] Q. Kong, Y. Xu, W. Wang, P. J. Jackson, and M. D. Plumbley (2019) Single-channel signal separation and deconvolution with generative adversarial networks. arXiv preprint arXiv:1906.07552. Cited by: §1.
  • [42] H. Le and D. Samaras (2019) Shadow removal via shadow image decomposition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8578–8587. Cited by: §1, §1.
  • [43] T. Lee, M. S. Lewicki, and T. J. Sejnowski (2000) ICA mixture models for unsupervised classification of non-gaussian classes and automatic context switching in blind signal separation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (10), pp. 1078–1089. Cited by: §2.
  • [44] C. Li, Y. Yang, K. He, S. Lin, and J. E. Hopcroft (2020) Single image reflection removal through cascaded refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3565–3574. Cited by: §1.
  • [45] R. Li, J. Pan, Z. Li, and J. Tang (2018) Single image dehazing via conditional generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8202–8211. Cited by: §1, §1, §2.
  • [46] S. Li, I. B. Araujo, W. Ren, Z. Wang, E. K. Tokuda, R. H. Junior, R. Cesar-Junior, J. Zhang, X. Guo, and X. Cao (2019) Single image deraining: a comprehensive benchmark analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3838–3847. Cited by: §1, §1.
  • [47] X. Li, J. Wu, Z. Lin, H. Liu, and H. Zha (2018) Recurrent squeeze-and-excitation context aggregation net for single image deraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 254–269. Cited by: §1, §1.
  • [48] S. Lin, A. Ryabtsev, S. Sengupta, B. L. Curless, S. M. Seitz, and I. Kemelmacher-Shlizerman (2021) Real-time high-resolution background matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8762–8771. Cited by: §1.
  • [49] Y. Liu, Z. Zhu, and X. Bai (2021) WDNet: watermark-decomposition network for visible watermark removal. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3685–3693. Cited by: §A.3, §1, §5.3.
  • [50] Y. Liu, W. Lai, M. Yang, Y. Chuang, and J. Huang (2020) Learning to see through obstructions with layered decomposition. In , Cited by: §1.
  • [51] Y. Liu, D. Jaw, S. Huang, and J. Hwang (2018) DesnowNet: context-aware deep network for snow removal. IEEE Transactions on Image Processing 27 (6), pp. 3064–3073. Cited by: §A.3, §1, §5.2.
  • [52] D. Ma, R. Wan, B. Shi, A. C. Kot, and L. Duan (2019) Learning to jointly generate and separate reflections. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2444–2452. Cited by: §1, §2.
  • [53] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In IEEE international conference on computer vision (ICCV), pp. 2794–2802. Cited by: §4.1.
  • [54] S. K. Nayar and S. G. Narasimhan (1999) Vision in bad weather. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Vol. 2, pp. 820–827. Cited by: §1.
  • [55] M. Nilsback and A. Zisserman (2006) A visual vocabulary for flower classification. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1447–1454. Cited by: §5.1.
  • [56] P. R. Oliveira and R. A. Romero (2008) Improvements on ica mixture models for image pre-processing and segmentation. Neurocomputing 71 (10-12), pp. 2180–2193. Cited by: §2.
  • [57] H. Porav, T. Bruls, and P. Newman (2019) I can see clearly now: image restoration via de-raining. In 2019 International Conference on Robotics and Automation (ICRA), pp. 7087–7093. Cited by: §A.3, §5.2.
  • [58] R. Qian, R. T. Tan, W. Yang, J. Su, and J. Liu (2018) Attentive generative adversarial network for raindrop removal from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2482–2491. Cited by: §1, §1, §2.
  • [59] L. Qu, J. Tian, S. He, Y. Tang, and R. W. Lau (2017) Deshadownet: a multi-context embedding deep network for shadow removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4067–4075. Cited by: §A.3, §5.3, §5.3.
  • [60] R. Quan, X. Yu, Y. Liang, and Y. Yang (2021) Removing raindrops and rain streaks in one go. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9147–9156. Cited by: §1.
  • [61] W. Ren, J. Tian, Z. Han, A. Chan, and Y. Tang (2017) Video desnowing and deraining based on matrix decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4210–4219. Cited by: §1.
  • [62] C. Rother, V. Kolmogorov, and A. Blake (2004) ” GrabCut” interactive foreground extraction using iterated graph cuts. ACM transactions on graphics (TOG) 23 (3), pp. 309–314. Cited by: §1.
  • [63] C. Sakaridis, D. Dai, and L. Van Gool (2018) Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision 126 (9), pp. 973–992. Cited by: §5.2.
  • [64] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
  • [65] R. T. Tan (2008) Visibility in bad weather from a single image. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §1.
  • [66] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §4.2.
  • [67] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2018) Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9446–9454. Cited by: §2.
  • [68] J. Wang, X. Li, and J. Yang (2018) Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1788–1797. Cited by: §A.3, §1, §4.1, §5.3.
  • [69] M. Wang and W. Deng (2018) Deep visual domain adaptation: a survey. Neurocomputing 312, pp. 135–153. Cited by: §7.
  • [70] Y. Wu, Q. He, T. Xue, R. Garg, J. Chen, A. Veeraraghavan, and J. T. Barron (2021) How to train neural networks for flare removal. Proceedings of the IEEE/CVF International Conference on Computer Vision. Cited by: §1.
  • [71] T. Xue, M. Rubinstein, C. Liu, and W. T. Freeman (2015) A computational approach for obstruction-free photography. ACM Transactions on Graphics (TOG) 34 (4), pp. 1–11. Cited by: §1.
  • [72] Q. Yang, K. Tan, and N. Ahuja (2012) Shadow removal using bilateral filtering. IEEE Transactions on Image processing 21 (10), pp. 4361–4368. Cited by: §5.3, Table 4.
  • [73] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan (2017) Deep joint rain detection and removal from a single image. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1685–1694. External Links: Document Cited by: §A.3, §1, §1, §5.2.
  • [74] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M. Yang, and L. Shao (2021) Multi-stage progressive image restoration. In CVPR, Cited by: §A.1, §B.2, Table 12, §1, Figure 5, §5.2, §5.2, Table 3.
  • [75] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In International conference on machine learning, pp. 7354–7363. Cited by: §2.
  • [76] H. Zhang, V. Sindagi, and V. M. Patel (2019) Image de-raining using a conditional generative adversarial network. IEEE transactions on circuits and systems for video technology 30 (11), pp. 3943–3956. Cited by: §1, §1, §2.
  • [77] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §A.3, §5.1.
  • [78] K. Zhang, D. Li, W. Luo, W. Ren, L. Ma, and H. Li (2021)

    Dual attention-in-attention model for joint rain streak and raindrop removal

    .
    arXiv preprint arXiv:2103.07051. Cited by: §1.
  • [79] K. Zhang, W. Luo, W. Ren, J. Wang, F. Zhao, L. Ma, and H. Li (2020) Beyond monocular deraining: stereo image deraining via semantic understanding. In European Conference on Computer Vision, pp. 71–89. Cited by: §1, §1.
  • [80] X. Zhang, R. Ng, and Q. Chen (2018) Single image reflection separation with perceptual losses. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4786–4794. Cited by: §A.3, §1, §1, §2, §5.3.
  • [81] Y. Zhong, Y. Dai, and H. Li (2018) Stereo computation for a single mixture image. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 435–450. Cited by: §1.
  • [82] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. External Links: Document Cited by: §5.1.
  • [83] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §A.2, §2, §4, §5.1.
  • [84] Z. Zou, S. Lei, T. Shi, Z. Shi, and J. Ye (2020) Deep adversarial decomposition: a unified framework for separating superimposed images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12806–12816. Cited by: §A.1, §1, §1, §1, §2, Table 1, §5.1.

Appendix A Implementation details

a.1 Training details, time, and model size

BIDeN. We train BIDeN using a Tesla P100-PCIE-16GB GPU. The GPU driver version is 440.64.00 and the CUDA version is 10.2. We initialize weights using Xavier initialization [22]. For Task I (Mixed image decomposition across multiple domains), BIDeN (2) to BIDeN (8) takes approximately 37 hours, 50 hours, 61 hours, 71 hours, 82 hours, 91 hours, and 101 hours training time. For Task II (Real-scenario deraining), the runtime of BIDeN is approximately 96 hours.

Double-DIP. We follow the default training setting of Double-DIP [21]

. We use the official PyTorch implementation (

link). We train a single image for 8000 iterations on a Tesla P100-PCIE-16GB GPU, the GPU driver version is 415.27 and the CUDA version is 10.0. The runtime for a single input image is approximately 20 minutes.

DAD. We follow the default training setting (Epoch 200, batch size 2, image crop size 256) of DAD [84]. Experiments are based on the official PyTorch implementation (link). We train DAD on a Tesla P100-PCIE-16GB GPU. The GPU driver version is 440.64.00 and the CUDA version is 10.2. DAD takes 13 hours of runtime.

MPRNet. We follow the default training setting (Epoch 250, batch size 16, image crop size 256) of MPRNet [74]. For a fair comparison, we apply the same data augmentation strategy of BIDeN to MPRNet. We use the official PyTorch implementation (link) of MPRNet. We train MPRNet using 4 Tesla P100-PCIE-16GB GPU, the GPU driver version is 415.27 and the CUDA version is 10.0. The runtime of MPRNet is 20 hours and the model size of MPRNet is 41.8 MB.

a.2 Architecture of BIDeN

Following the naming convention used in CycleGAN [83] and perceptual loss [37], let c3s1-k

denote a 3×3 Convolution-InstanceNorm-ReLU layer with stride 1 and

filters. Rk denotes a residual block that contains two 3×3 convolutional layers with the same number of filters on both layer and Rk9 denotes nine continuous residual blocks. uk

denotes a 3×3 fractional-strided-Convolution-InstanceNorm-ReLU layer with

filters and stride . Let Ck denotes a 4×4 Convolution-InstanceNorm-LeakyReLU (slope 0.2) layer with

filters and stride 2. Both reflection padding and zero padding are employed.

Encoder. Our multi-scale encoder contains three branches, we name them , and .

consists of c3s2-256, Rk9, c1s1-128. consists of c7s1-64, c3s2-128, c3s2-256, Rk9. contains c15s1-64, c3s2-128, c3s2-256, c3s1-256, c3s1-256, Rk9, c1s1-128. The number of parameters is 33.908 million for the encoder.

Heads. The architecture of each head is: c1s1-256, c1s1-256, u128, u64, c7s1-3. Each head has 0.575 million parameters.

Discriminator. The discriminator contains two branches, (Separation) and (Prediction). Most weights are shared, the shared part includes C64, C128, C256.

The last layer of is C512.

contains c1s1-512 (LeakyReLu with slope 0.2), global max pooling, c1s1-N, where

is the maximum number of source components. The Discriminator has approximately 3.028 million parameters in total.

a.3 Tasks

Task I: Mixed image decomposition across multiple domains. We use linear mixup [77] as the mixing mechanism. We do not introduce additional non-balanced mixing factors or non-linear mixing as Task I is challenging enough. The mixed image is expressed as . The possibilities of every component to be selected vary with the maximum number of source components . We set the possibilities to be 0.9, 0.8, 0.7, 0.6, 0.5, 0.5, 0.5 for , respectively.

Task II: Real-scenario deraining. The mixing mechanism is based on the physical imaging models [73, 51, 30, 8, 57] and Koschmieder’s law. The model for rain streak and snow is:

,

and the model for haze is:

,

where is the pixel of images, is the observed intensity, is the scene radiance, is the global atmospheric light, and is the mask of rain streak and snow. denotes the transmission map. We set between during training, and fix at test time.

To render the raindrop effect, we define a statistical model to estimate the location and motion of the raindrops. We employ the meta-ball model 

[5] for the interaction effect between multiple raindrops.

For raindrop positions, we randomly sample it over the entire scene. The raindrop radius is also randomly sampled. A single raindrop is combined with another 1 to 3 smaller raindrops to form a realistic raindrop shape. Each composite raindrop could further merge with other raindrops on the scene. The velocity along the y-axis of the raindrop is proportional to the raindrop radius. Raindrop masks are randomly selected on the time dimension for diversity. A simple refractive model [8] is employed. We create a look-up table with three dimensions. The red and green channels together encode the texture of the raindrop, and the blue channel represents the thickness of the raindrops. Then, the texture table is masked by the alpha mask created by the meta-ball model. The masked table is dubbed . The location (x,y) of the world point that is rendered at image location (u, v) on the surface of a raindrop is modeled as:

, .

where , denote the pixel at location in the red and blue channels of .

We acquire the destination pixel coordinate for location (u,v) based on the above equations and generate the distorted image. We also apply random light reduction and blur to the distorted image. For the reduction, we set the rate between during training, and fix rate at test time. The reduction can be expressed as , where is the distorted image and is the rate. We use a kernel size of 3 for Gaussian blur.

At last, we merge the distorted image with the original image:

,

where denotes the pixel of images, is the observed intensity, is the original image, is the distorted image, is the value of the raindrop mask generated by the metaball model.

The probabilities of every component to be selected are 1.0, 0.5, 0.5, 0.5 for rain streak, snow, raindrop, and haze. The mixing orders are rain streak, snow, raindrop, and haze.

Task III: Joint shadow/reflection/watermark removal. We use paired shadow masks, shadow images, and shadow-free images provided in ISTD [68] and SRD [59, 10]. The original SRD does not offer shadow masks, we use the shadow masks generated by Cun et al. [10].

The algorithm of adding reflection to images [80] is expressed as:

,

where is the pixel of images, is the observed intensity, is the transmission layer, is the reflection layer, and denotes the vignette mask. The reflection image is processed by a Gaussian smoothing kernel with a random kernel size, where the size is in the range of 3 to 17 pixels during training, and fixed to 11 pixels during testing.

For watermarks, we follow the watermark composition model [49]. We use the RGB watermark images to add the watermark effect. We require the BID method to reconstruct the watermark mask. The watermark composition model is:

,

where denotes the pixel of images, is the observed intensity, is the scene radiance, is the global atmospheric light, and is the watermark image. We set between during training, and fix for testing.

The probabilities of every component to be selected are 0.6, 0.5, 0.5 for shadow, reflection, and watermark, respectively. The orders are shadow, reflection, and watermark.

Appendix B Additional results

b.1 Additional results of Task I

Detailed case results of BIDeN. When the maximum number of components increases, the number of possible increases rapidly. There are possible combinations between source components, that is, cases. We present the detailed case results of BIDeN on Task I. We show the results of (BIDeN (2), BIDeN (3), BIDeN (4), BIDeN (5), BIDeN (6)) in Table 6, Table 7, Table 8, Table 9, and Table 10. These results are the extensions of Table 1. Note that due to the difference in precision, the PSNR results reported here are slightly different (less than difference) from the PSNR results reported in the main paper.

Qualitative results of BIDeN. Here we present more qualitative results of BIDeN. We show the results of (BIDeN (2), BIDeN (3), BIDeN (4), BIDeN (5), BIDeN (6) in Figure 8, Figure 9, Figure 10, Figure 11, and Figure 12. The number of selected source components and the index set are randomly chosen. The eight source components in Task I are Fruit (A), Animal (B), Flower (C), Furniture (D), Yosemite (E), Vehicle (F), Vegetable (G), and CityScape (H).

Input A B Acc
a 25.26 - 0.940
b - 25.11 0.933
ab 20.09 19.93 1.000
Table 6: Detailed case results of BIDeN (2) on Task I (Mixed image decomposition across multiple domains).
Input A B C Acc
a 24.04 - - 0.906
b - 23.48 - 0.953
c - - 22.74 0.756
ab 19.07 18.78 - 0.833
ac 18.96 - 18.27 0.573
bc - 18.31 18.25 0.763
Avg 19.01 18.54 18.26 0.723
abc 16.66 15.89 16.27 0.983
Table 7: Detailed case results of BIDeN (3) on Task I (Mixed image decomposition across multiple domains).
Input A B C D Acc
a 23.07 - - - 0.896
b - 22.61 - - 0.886
c - - 21.93 - 0.770
d - - - 21.88 0.933
ab 18.22 18.06 - - 0.710
ac 18.20 - 17.41 - 0.560
ad 18.85 - - 18.51 0.783
bc - 16.98 17.56 - 0.710
bc - 18.06 - 17.79 0.856
cd - - 18.44 18.77 0.656
Avg 18.42 17.7 17.80 18.35 0.712
abc 16.12 15.46 15.70 - 0.660
abd 16.47 15.95 - 16.40 0.676
acd 16.36 - 16.11 16.78 0.396
bcd - 15.63 16.52 16.52 0.563
Avg 18.42 17.7 17.80 18.35 0.712
abcd 15.11 14.37 14.98 15.43 0.943
Table 8: Detailed case results of BIDeN (4) on Task I (Mixed image decomposition across multiple domains).
Input A B C D E Acc
a 22.75 - - - - 0.840
b - 22.02 - - - 0.853
c - - 22.30 - - 0.553
d - - - 22.52 - 0.930
e - - - - 20.77 0.923
ab 17.68 17.30 - - - 0.676
ac 17.60 - 17.19 - - 0.373
ad 18.34 - - 18.26 - 0.720
ae 18.94 - - - 18.36 0.700
bc - 16.98 17.56 - - 0.633
bd - 17.21 - 17.60 - 0.716
be - 17.37 - - 17.54 0.710
cd - - 18.42 18.72 - 0.443
ce - - 18.79 - 18.34 0.556
de - - - 18.18 17.94 0.796
Avg 18.14 17.21 17.99 18.19 18.04 0.632
abc 15.72 14.92 15.62 - - 0.570
abd 16.11 15.29 - 16.15 - 0.573
abe 16.38 15.12 - - 16.32 0.526
acd 15.91 - 15.94 16.53 - 0.320
ace 16.27 - 16.07 - 16.80 0.266
ade 16.72 - - 16.25 16.62 0.486
bcd - 15.04 16.46 16.32 - 0.576
bce - 14.94 16.62 - 16.38 0.553
bde - 15.24 - 16.07 16.15 0.736
cde - - 17.08 16.48 16.52 0.366
Avg 16.18 15.09 16.29 16.30 16.46 0.497
abcd 14.82 13.91 14.94 15.11 - 0.633
abce 14.97 13.80 15.06 - 15.50 0.480
abde 15.20 14.14 - 14.99 15.32 0.586
acde 15.03 - 15.23 15.11 15.62 0.180
bcde - 13.92 15.74 15.06 15.34 0.570
Avg 14.99 13.94 15.24 15.06 15.44 0.489
abcde 14.23 13.28 14.52 14.20 14.74 0.860
Table 9: Detailed case results of BIDeN (5) on Task I (Mixed image decomposition across multiple domains).
Input A B C D E F Acc
a 22.82 - - - - - 0.850
b - 21.80 - - - - 0.826
c - - 22.61 - - - 0.823
d - - - 22.50 - - 0.890
e - - - - 21.09 - 0.910
f - - - - - 21.97 0.893
ab 17.31 16.79 - - - - 0.646
ac 17.10 - 16.84 - - - 0.526
ad 18.14 - - 17.66 - - 0.613
ae 18.59 - - - 18.24 - 0.676
af 18.24 - - - - 17.76 0.583
bc - 16.60 17.46 - - - 0.613
bd - 16.95 - 17.05 - - 0.593
be - 16.81 - - 17.30 - 0.660
bf - 16.90 - - - 17.04 0.510
cd - - 18.23 18.11 - - 0.480
ce - - 18.53 - 18.16 - 0.670
cf - - 18.13 - - 17.48 0.583
de - - - 17.43 17.86 - 0.653
df - - - 16.18 - 16.01 0.686
ef - - - - 17.42 16.74 0.673
Avg 17.87 16.81 17.83 17.28 17.79 17.00 0.654
abc 15.29 14.56 15.34 - - - 0.606
abd 15.83 14.95 - 15.51 - - 0.390
abe 16.10 14.59 - - 16.04 - 0.513
abf 15.89 14.78 - - - 15.61 0.333
acd 15.49 - 15.56 15.98 - - 0.313
ace 15.88 - 15.67 - 16.55 - 0.350
acf 15.64 - 15.46 - - 15.61 0.303
ade 16.46 - - 15.59 16.45 - 0.330
adf 16.15 - - 14.60 - 14.73 0.450
aef 16.44 - - - 15.99 15.18 0.373
bce - 14.68 16.24 15.71 - - 0.370
bce - 14.43 16.43 - 16.08 - 0.500
bcf - 14.69 16.04 - - 15.42 0.346
bde - 14.90 - 15.45 15.89 - 0.516
bdf - 14.87 - 14.60 - 14.52 0.456
bef - 14.81 - - 15.60 15.04 0.400
cde - - 16.82 15.84 16.30 - 0.253
cdf - - 16.48 14.86 - 14.59 0.383
cef - - 16.64 - 15.91 15.15 0.443
def - - - 14.70 15.80 14.26 0.506
Avg 15.92 14.72 16.06 15.28 16.06 16.71 0.439
abcd 14.45 13.69 14.61 14.62 - - 0.420
abce 14.63 13.51 14.76 - 15.20 - 0.540
abcf 14.57 13.65 14.54 - - 14.34 0.313
abde 15.02 13.79 - 14.36 15.05 - 0.323
abdf 14.90 13.80 - 13.77 - 13.81 0.373
abef 15.13 13.70 - - 14.77 14.17 0.276
acde 14.71 - 14.88 14.60 15.34 - 0.126
acdf 14.61 - 14.67 13.97 - 13.75 0.220
acef 14.80 - 14.81 - 14.99 14.14 0.193
adef 15.13 - - 14.81 15.00 13.50 0.293
bcde - 13.64 15.55 14.56 14.98 - 0.296
bcdf - 13.76 15.26 13.96 - 13.66 0.336
bcef - 13.63 15.35 - 14.84 14.10 0.356
bdef - 13.86 - 13.78 14.74 13.49 0.543
cdef - - 15.62 13.99 14.99 13.44 0.206
Avg 14.79 13.70 15.00 14.24 14.99 13.84 0.321
abcde 13.97 13.12 14.24 13.73 14.41 - 0.350
abcdf 13.93 13.22 14.05 13.38 - 13.14 0.543
abcef 14.04 13.09 14.17 - 14.23 13.41 0.346
abedf 14.30 13.25 - 13.24 14.20 12.99 0.516
acdef 14.03 - 14.28 13.40 14.37 12.94 0.140
bcdef - 13.18 14.76 13.40 14.19 12.92 0.356
Avg 14.04 13.17 14.30 13.43 14.28 13.08 0.375
abcdef 13.54 12.86 13.82 12.95 13.80 12.57 0.846
Table 10: Detailed case results of BIDeN (6) on Task I (Mixed image decomposition across multiple domains).
Figure 8: Qualitative results of BIDeN (2). Fruit (A), Animal (B). Row 1-2: a. Row 3-4: b. Row 5: ab.
Figure 9: Qualitative results of BIDeN (3). Fruit (A), Animal (B), Flower (C). Row 1-2: ab. Row 3-4: bc. Row 5: abc.
Figure 10: Qualitative results of BIDeN (4). Fruit (A), Animal (B), Flower (C), Furniture (D). Row 1-2: d. Row 3-4: cd. Row 5: abd.
Figure 11: Qualitative results of BIDeN (5). Fruit (A), Animal (B), Flower (C), Furniture (D), Yosemite (E). Row 1-2: ce. Row 3-4: de. Row 5: bcde.
Figure 12: Qualitative results of BIDeN (6). Fruit (A), Animal (B), Flower (C), Furniture (D), Yosemite (E), Vehicle (F). Row 1-2: cf. Row 3-4: def. Row 5: abcdef.

b.2 Additional results of Task II

For Task II (Real-scenario deraining), more qualitative results are provided. Visual examples of CityScapes/masks/transmission maps generated by BIDeN are shown in Figure 13. We present more qualitative comparisons between BIDeN and MPRNet [74] in Figure 14 and Figure 15. The comparison presents the results of 6 cases of the same scene.

For the default training setting on Task II, the probabilities of every component to be selected are 1.0, 0.5, 0.5, 0.5 for rain streak, snow, raindrop, and haze. Moreover, we train both BIDeN and MPRNet again setting the possibility of the rain streak component to be selected as 0.8. The quantitative results of BIDeN and the comparison between MPRNet are provided in Table 11 and Table 12. Compared to BIDeN trained under the default training setting of Task II, BIDeN performs even better when the possibility of the rain streak component to be selected is set to 0.8.

b.3 Additional results of Task III

We provide additional results of BIDeN for all cases on Task III (Joint shadow/reflection/watermark removal). Results of Version one (V1) and Version two (V2) are shown in Figure 16 and Figure 17.

Figure 13: CityScape, masks (Rain Streak, Snow, Raindrop), and transmission map (Haze) generated by BIDeN for case (1), case (2), case (5), and case (6). Case (1): rain streak, case (2): rain streak + snow, case (5): rain streak + moderate haze + raindrop, case (6): rain streak + snow + moderate haze + raindrop.
Input MPRNet BIDeN GT





Figure 14: Additional results of Task II (Real-scenario deraining). Row 1-6 presents 6 cases of a same scene. The 6 cases are (1): rain streak, (2): rain streak + snow, (3): rain streak + light haze, (4): rain streak + heavy haze, (5): rain streak + moderate haze + raindrop, (6) rain streak + snow + moderate haze + raindrop. BIDeN remove all components of rain efficiently while MPRNet leaves some components that are not completely removed.
Input MPRNet BIDeN GT





Figure 15: Additional results of Task II (Real-scenario deraining). Row 1-6 presents 6 cases of a same scene. The 6 cases are (1): rain streak, (2): rain streak + snow, (3): rain streak + light haze, (4): rain streak + heavy haze, (5): rain streak + moderate haze + raindrop, (6) rain streak + snow + moderate haze + raindrop. BIDeN remove all components of rain efficiently while MPRNet leaves some components that are not completely removed.
Method CityScape Rain Streak Snow Haze Raindrop Acc
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
BIDeN (1) 33.30 0.930 31.55 0.917 - - - - - - 1.0
BIDeN (2) 29.55 0.896 28.80 0.836 25.74 0.689 - - - - 0.992
BIDeN (3) 29.38 0.919 30.98 0.907 - - 31.11 0.956 - - 0.994
BIDeN (4) 27.56 0.899 30.39 0.895 - - 31.83 0.944 - - 0.994
BIDeN (5) 27.89 0.900 30.17 0.891 - - 30.73 0.945 22.32 0.904 0.996
BIDeN (6) 27.05 0.869 28.05 0.815 24.81 0.653 30.02 0.940 21.55 0.888 0.990
Table 11: Results of BIDeN on Task II (Real-scenario deraining). The possibility of the rain streak component to be selected is 0.8. We employ PSNR and SSIM metrics for both CityScape images, masks, and transmission maps. We report the results for 6 test cases as presented in Figure 1, the 6 cases are (1): rain streak, (2): rain streak + snow, (3): rain streak + light haze, (4): rain streak + heavy haze, (5): rain streak + moderate haze + raindrop, (6) rain streak + snow + moderate haze + raindrop. Note that only haze is divided into light/moderate/heavy intensities. Both training set and test set of Rain Streak, Snow, and Raindrop already consist of different intensities.
Case Input MPRNet BIDeN
PSNR SSIM PSNR SSIM PSNR SSIM
 (1) 25.69 0.786 33.03 0.941 30.89 0.932
 (2) 18.64 0.564 30.44 0.902 29.34 0.899
 (3) 17.45 0.712 23.95 0.897 28.62 0.919
 (4) 11.12 0.571 17.32 0.810 26.77 0.898
 (5) 14.05 0.616 20.75 0.839 27.11 0.898
 (6) 12.38 0.461 19.74 0.798 26.44 0.870
Table 12: Comparison on task 2 (Real-scenario deraining) between BIDeN and MPRNet [74]. The possibility of the rain streak component to be selected is 0.8. MPRNet shows superior results for case (1) and case (2). In contrast, BIDeN is better at four other cases. For the details of 6 test cases, please refer to Table 11 and Figure 1.
Figure 16: All case results of Task III (Joint shadow/reflection/watermark removal), Version one (V1). ISTD images, shadow masks, and watermark masks generated by BIDeN for all cases. The order of all cases is identical to Table 4. The generated ISTD images suffer color shift, but all shadow/reflection/watermark are efficiently removed for all cases.
Figure 17: All case results of Task III (Joint shadow/reflection/watermark removal), Version two (V2). ISTD images, shadow masks, and watermark masks generated by BIDeN for all cases. The order of all cases is identical to Table 4. All shadow/reflection/watermark are efficiently removed for all cases.