Code for Blind Image Decomposition (BID) and Blind Image Decomposition network (BIDeN).
We present and study a novel task named Blind Image Decomposition (BID), which requires separating a superimposed image into constituent underlying images in a blind setting, that is, both the source components involved in mixing as well as the mixing mechanism are unknown. For example, rain may consist of multiple components, such as rain streaks, raindrops, snow, and haze. Rainy images can be treated as an arbitrary combination of these components, some of them or all of them. How to decompose superimposed images, like rainy images, into distinct source components is a crucial step towards real-world vision systems. To facilitate research on this new task, we construct three benchmark datasets, including mixed image decomposition across multiple domains, real-scenario deraining, and joint shadow/reflection/watermark removal. Moreover, we propose a simple yet general Blind Image Decomposition Network (BIDeN) to serve as a strong baseline for future work. Experimental results demonstrate the tenability of our benchmarks and the effectiveness of BIDeN. Code and project page are available.READ FULL TEXT VIEW PDF
Many images shared over the web include overlaid objects, or visual moti...
We present a novel formulation to removing reflection from polarized ima...
NMR spectral datasets, especially in systems with limited samples, can b...
QR bar codes are prototypical images for which part of the image is a pr...
Training a single deep blind model to handle different quality factors f...
We present an implementation of a blind source separation algorithm to r...
Recently, falsified images have been found in papers involved in researc...
Code for Blind Image Decomposition (BID) and Blind Image Decomposition network (BIDeN).
Various computer vision and computer graphics tasks[21, 84, 38, 42, 25, 36, 27, 1, 13] can be viewed as image decomposition, which aims to separate a superimposed image into distinct components with only a single observation. For example, foreground/background segmentation [14, 62, 2, 48] aims at decomposing a holistic image into foreground objects and background stuff. Image dehazing [30, 4, 45] can be treated as decomposing a hazy image into a haze-free image and a haze map (medium transmission map, atmosphere light). Shadow removal [42, 68, 11, 17, 16] decomposes a shadow image into a shadow-free image and a shadow mask. Other tasks like transparency separation [80, 15, 44, 52], watermark removal [49, 6], image deraining [79, 58, 73, 46, 76, 47, 74], texture separation , underwater image restoration [20, 28], image desnowing [51, 61], stereo mixture image decomposition , 3D intrinsic image decomposition , fence removal [12, 71, 50], flare removal [70, 3] are covered inside image decomposition.
Vanilla image decomposition tasks come with a fixed and known number of source components, and the number is most often set to two [21, 84, 38, 42, 25, 76, 41, 27, 4, 45, 11, 17, 16, 80]. Such a setting does capture some basic real-world cases. However, real-world scenarios are more complex. Consider autonomous driving on rainy days, where the visual perception quality is degraded by different forms of precipitation and the co-occurring components, shown in Figure 1. Some natural questions emerge: Can a vision system assume precipitations to be of a specific form? Should a vision system assume raindrops always exist or not? Shall a vision system assume the haze or snow comes along with rain or not? These questions are particularly important based on their relevance in real-world applications. The answer to these questions is trivial, no. A comprehensive vision system should be able to handle all possible circumstances [54, 65]. Yet, the previous setting in deraining [79, 58, 73, 46, 47, 78, 60] remains a gap toward complex real-world scenarios.
This paper aims at addressing the aforementioned gap, as an important step toward real-world vision systems. We propose a task that: (1) does not fix the number of source components, (2) considers the presence of every source component and varying intensities of source components, and (3) amalgamates every source component as potential combinations. To disambiguate with previous tasks, we refer to our proposed task as Blind Image Decomposition (BID). This name is inspired by the Blind Source Separation (BSS) task in the field of signal processing.
The task format is straightforward. We no longer set the number of the source components to a fixed value. Instead, we set a maximum number of potential source components, where each component can be arbitrarily part of the mixing. Let ‘A’, ‘B’, ‘C’, ‘D’, ‘E’ denote five source components and ‘a’, ‘b’, ‘c’, ‘d’, ‘e’ denote images from the corresponding source components. The mixed image can be either ‘a’, ‘d’, ‘ab’, ‘bc’, ‘abd’, ‘ade’, ‘acde’, ‘abcde’ , up to 31 possible combinations in total. Given any of the 31 possible combinations as input, a BID method is required to predict and reconstruct the individual source components involved in the mixing.
As different components can be arbitrarily involved in the mixing, no existing datasets support such a training protocol. Thus, we construct three new benchmark datasets supporting the applications of BID: (I) mixed image decomposition across multiple domains, (II) real-scenario deraining, and (III) joint shadow/reflection/watermark removal.
To perform multiple BID tasks, we design a simple yet general, strong, and flexible model, dubbed BIDeN (Blind Image Decomposition Network). BIDeN is a generic model that supports multiple BID tasks with distinct objectives. BIDeN is based on the framework of GANs (Generative Adversarial Networks), and we explore some critical design choices of BIDeN to present a strong model. Designed for a more challenging BID setting, BIDeN still outperforms the current state-of-the-art image decomposition model  and shows competitive results compared to models designed for specific tasks. Together, our proposed BID task, our proposed method BIDeN, our constructed benchmark datasets, and our comprehensive results and ablations form a solid foundation for the future study of this challenging topic.
Image Decomposition. This task is a general task covering numerous computer vision and computer graphics tasks. Double-DIP  couples multiple DIPs  to decompose images into their basic components in an unsupervised manner. Deep Generative Priors  employs a likelihood-based generative model as a prior to perform the image decomposition task. Deep Adversarial Decomposition (DAD)  purposed a unified framework for image decomposition by employing three discriminators. A crossroad L1 loss is introduced to support pixel-wise supervision when domain information is unknown.
, also known as the “cocktail party problem”, is an important research topic in the field of signal processing. Blind refers to a constraint where sources and mixing mechanism are both unknown. The task requires performing the separation of a mixture signal into the constituent underlying source signals with limited prior information. A representative algorithm is the Independent Component Analysis (ICA)[34, 43, 56]. The BID task shares some common properties with BSS, where the blind settings are similar but not identical. BID is constrained by unknown source components involved in mixing and unknown mixing mechanisms. The number of source components involved in mixing is also unknown. We do not set the domain information of the source components to be unknown. The goal of BID is to advance real-world vision systems. The setting of including known domain information can be better applied to computer vision tasks. For instance, the goal of the shadow removal task is to separate a shadow image into a shadow-free image and a shadow mask, where the domain information is clear.
Generative Adversarial Networks. GANs  include two key components, a generator, and a discriminator, where the generator is trying to generate realistic samples while the discriminator is trying to identify real samples and generated samples. The adversarial training mechanism helps the outputs from the generator to match the distribution of real data. GANs are especially successful in image generation tasks [75, 39]
and image-to-image translation tasks[35, 29, 83]. GANs are also a common tool for image decomposition tasks, where GANs have been successfully employed in image deraining [76, 58], transparency separation [52, 80], and image dehazing .
Given a set of () source components, i.e., image domains, denoted by . Each source component contains some images , . () source components are randomly selected from . Let indicate the index set for the selected source components, where . Hence, the selected source components are denoted by . Each selected source component contains some images . With a predetermined mixing function , the mixed image is given by . The mixed image can be identical to a single image when . The BID task requires the BID method to find a function to separate , as . Each reconstructed image is close to its corresponding image . That is, given a mixed image as input, the task requires the BID method to: (1) predict the source components involved in mixing, (2) reconstruct/generate images preserving the fidelity of the corresponding images involved in mixing.
For the example shown in Figure 2, where , , , can be , can be . Let and . Given as the input, knowing there are four different source components without any other information, the task requires the method to find a function , so that , where and . Also, the method should correctly predict the source components involved in the mixing, that is, giving a prediction [1,0,1,0], which is equivalent to predicting .
The BID task is challenging for the following reasons: (1) When increases, the number of possible increases rapidly. The BID setting forces the method to deal with possible combinations. For instance, when increases to 8, there are 255 variants of . (2) The task requires the method to predict the source components involved in the mixing. Source components are difficult to be predicted when is large and varies a lot. (3) The mixing mechanism, or the mixing function , is unknown to the method. The mixing function varies with different source components and can be non-linear/complex in specific circumstances, such as rendering raindrop images, adding shadows or reflections to images. (4) As increases, each source component contributes a decreasing amount of information to , making the task highly ill-posed.
To perform diverse BID tasks, a generic framework is required. Inspired by the success of image-to-image translation models [35, 29, 83], we design our Blind Image Decomposition Network (BIDeN) as follows. Figure 2 presents an overall architecture of BIDeN.
The generator consists of two parts: a multi-scale encoder and multiple heads . We design a multi-scale encoder containing three branches to capture multiple scales of features. Such a design is beneficial to the source components requiring different scales of features for reconstruction. We concatenate different scales of features and send them to multiple heads, where the number of heads is identical to the maximum number of source components. Each head is specific to reconstructing a particular kind of source component. The discriminator consists of two branches and most weights are shared. The reconstructed images and their corresponding real images are sent to the discriminator branch (Separation) individually. The function of is similar to a typical discriminator. The discriminator branch (Prediction) predicts the source components involved in the mixed image . A successful prediction unveils the correct index set of the selected source components .
Taking the example of Figure 2, we name four heads , , , and . For an input , and aims to reconstruct , so that and while , are free to output anything or simply turned off. We employ adversarial loss , perceptual loss , L1/L2 loss, and binary cross-entropy loss. The details of the objective are expressed below.
We employ the adversarial loss  to encourage the generator to output separated and realistic images, regardless of the source components. For the function , the GAN loss is expressed as:
where behaves as . It tries to separate the input mixed image and reconstruct separated outputs , while attempts to distinguish between and real samples . Note inside equations 1-5 denotes . We employ the LSGAN  loss and the Markovian discriminator .
The reconstructed images should be separated, as well as to be near the corresponding in a distance sense. Hence, we employ perceptual loss (VGG loss)  and L1/L2 loss. They are formalized as:
where is a trained VGG19  network, denotes a specific layer, and denotes the weights for the -th layer. The choice of layers and weights is identical to pix2pixHD . We use L2 loss for masks and L1 loss for other source components. For simplification, we denote L1/L2 loss as .
For the source prediction task, we find that the discriminator performs better than the generator. The goal of the discriminator is to classify between reconstructed samples and real samples. It naturally learns an embedding. Such an embedding is beneficial even when the input is a mixed image. The discriminator is capable of performing an additional source prediction task. Thus we design a source prediction branch . The binary cross-entropy loss is employed for the source prediction task:
where denotes the maximum number of source components, denotes the binary label of the source components involved in the mixing of input image . is the source prediction branch of the discriminator.
Our final objective function is:
We set to be 1, 10, 30, 1 respectively. This setting is a generic setting that is applied to all tasks.
Throughout all experiments, we use the Adam optimizer  with = 0.5 and = 0.999 for both
. BIDeN is trained for 200 epochs with a learning rate of 0.0003. The learning rate starts to decay linearly after half of the total epochs. We use a batch size of 1 and instance normalization. All training images are loaded as then cropped to patches. Horizontal flip is randomly applied. At test time, we load test images in a resolution. More details on the training settings, the architecture, the number of parameters, training speed are provided in the Appendices. The training details for baselines are also provided in the Appendices.
|Method||Fruit (A)||Animal (B)||Acc (AB)||Acc (All)||Model size|
|DAD ||17.59||0.72||137.66||17.52||0.62||126.32||0.996||-||669.0 MB|
|BIDeN (2)||20.07||0.79||62.99||19.89||0.69||69.35||1.0||0.957||144.9 MB|
|BIDeN (3)||19.04||0.75||74.68||18.75||0.61||88.23||0.836||0.807||147.1 MB|
|BIDeN (4)||18.19||0.73||79.03||18.03||0.58||97.16||0.716||0.733||149.3 MB|
|BIDeN (5)||17.66||0.71||81.17||17.27||0.54||114.40||0.676||0.603||151.5 MB|
|BIDeN (6)||17.28||0.69||85.64||16.57||0.51||118.00||0.646||0.483||153.7 MB|
|BIDeN (7)||16.70||0.68||97.26||16.54||0.49||126.66||0.413||0.310||155.9 MB|
|BIDeN (8)||16.49||0.67||105.61||15.79||0.45||191.29||0.383||0.278||158.1 MB|
We construct three benchmark datasets from different views to support practical usages of BID. To explore the generality of BIDeN and the tenability of constructed datasets, we test BIDeN on three new challenging datasets. BIDeN is trained under the BID setting, which is more difficult than the conventional image decomposition setting. During training, mixed images are randomly synthesized. At test time, the input mixed images are fixed. As BID is a novel task, no existing baselines are available for comparison. For different tasks, we choose different evaluation strategies and baselines.
Throughout all tasks, BIDeN is trained under the BID setting, that is, BIDeN is facing more challenging requirements than other baselines. Also, BIDeN is a generic model designed to perform all kinds of BID tasks. These two constraints limit the performance of BIDeN. We compare BIDeN to other baselines designed for specific tasks, where BIDeN is still able to show very competitive results on all tasks. All qualitative results are randomly picked. More task settings, results, including the detailed case results of BIDeN, are provided in the Appendices.
Dataset. This dataset contains eight different image domains, i.e., source components. Each domain has approximately 2500 to 3000 images in the training set, and the test set contains 300 images for all domains. Image domains are designed to be big and inclusive to cover multiple categories, like animal, fruit, vehicle, instead of being small domains such as horse, cat, car. The eight domains are Fruit (2653), Animal (2653), Flower (2950), Furniture (2582), Yosemite (2855), Vehicle (2670), Vegetable (2595), and CityScape (2975). CityScape and Flower are selected from the CityScape  dataset and the VGG flower  dataset. The remaining six image domains are mainly gathered from Flicker using the corresponding keyword, except for Yosemite, which also combines the Summer2Winter dataset from CycleGAN . The order of the eight domains is randomly shuffled. The mixing mechanism for this dataset is linear mixup .
Experiments and results. We compare BIDeN to Double-DIP  and DAD . DAD is trained on the first two domains (Fruit, Animal) with mixed input only. For BIDeN, we train it 7 times under the BID setting, varying from 2 domains to 8 domains. At test time, we evaluate the separation results on Fruit + Animal mixture using PSNR, SSIM , and FID . For BIDeN, the accuracy of the source prediction is reported for all tasks. We report case results for all tasks and the overall averaged result for Task I.
Table 1 presents the results for Task I. In terms of PSNR/SSIM, BIDeN outperforms Double-DIP with a large margin and outperforms DAD when is less than 5. For FID, BIDeN shows better results than DAD even when , suggesting the superiority of BIDeN. Figure 3 presents the qualitative results of Task I.
Dataset. We construct the real-scenario deraining dataset based on the CityScape  dataset. We use the test set from the original CityScape dataset as our training set (2975), and the validation set from the original CityScape dataset as our test set (500). The test set for all source components contains a fixed number of 500 images. We use three different masks, including rain streak (1620), raindrop (3500), and snow (3500). These masks cover different intensities. For haze, we use the corresponding transmission maps (2975 x 3) with three different intensities acquired from Foggy CityScape . The masks for rain streak are acquired from Rain100L and Rain100H  while the masks for snow are selected from Snow100K . For raindrop masks, we model the droplet shape and property using the metaball model . The locations/numbers/sizes of raindrops are randomly sampled. Paired refraction maps are generated using [8, 57]. The mixing mechanism for this dataset is based on physical imaging models [73, 51, 30, 8, 57].
|Method||Version one (V1), ISTD||Version two (V2), SRD|
Experiments and results. We train both BIDeN and MPRNet  under the BID setting. For MPRNet, we do not require the prediction of the source component and the generation of masks. We report the results for 6 cases, as the examples presented in Figure 1. Note that only haze is divided into light/moderate/heavy intensities. Both training set and test set of rain streak/snow/raindrop already contain different intensities. We report the results in SSIM and PSNR for CityScape images, masks, and transmission maps.
For all 6 cases, we report the detailed results of BIDeN in Table 2. BIDeN shows excellent quantitative results in accuracy metric. For the PSNR/SSIM metrics on all components, BIDeN performs well except for the raindrop masks. An example of all components generated by BIDeN is shown in Figure 4. Table 3 and Figure 5 presents the comparison between BIDeN and MPRNet . For better visualization, we resize the resolution of visual examples to match the original CityScape resolution.
Dataset. This task is designed to jointly perform multiple tasks with uncertainty in one go. Once a model is trained on this dataset under the BID setting, the model is capable of performing multiple tasks in one go. We construct two versions for this dataset, Version one (V1) is based on ISTD , and Version two (V2) is based on SRD [59, 10]. We use paired shadow masks, shadow-free images, and shadow images from ISTD/SRD. ISTD consists of 1330 training images and 540 test images while SRD contains 2680 training images and 408 testing images. The algorithm for adding reflection to images is acquired from , we select 3120 images from the reflection subset  as the reflection layer. The watermark generation algorithm as well as the paired watermark masks, RGB watermark images are acquired from LVM , we select 3000 paired watermark images and masks from the training set of LVW . Following the data split of ISTD and SRD, V1 contains 1330 shadow-free images/shadow masks, 2580 reflection layer images, and 2460 watermark images/masks as the training set. The test set contains 540 images for every source component. V2 includes 2680 shadow-free images/shadow masks, 2580 reflection layer images, and 2460 watermark images/masks as the training set. The test set contains 408 images for every source component. We do not require the reconstruction of reflection layer images.
Experiments and results. We mainly compare the shadow removal results to multiple shadow removal baselines, including [72, 26, 23, 59, 32, 10, 17]. We train BIDeN under the BID setting. The trained BIDeN is capable of dealing with all combinations between shadow/reflection/watermark removal tasks. At test time, we report the results for all cases. We employ the root mean square error (RMSE) in LAB color space, following [10, 17].
Figure 6 presents the visual examples of three cases. The quantitative results for Task III are reported in Table 4. Constrained by the generality of BIDeN and the BID training setting, BIDeN does not show superior quantitative results compared to other baselines designed for the shadow removal task only. Also, the performance of BIDeN varies on the dataset, being better at Version two (V2), especially using the accuracy metric. Figure 7 presents the qualitative comparison between BIDeN and two state-of-the-art baselines [10, 17].
We perform ablation experiments to analyze the effectiveness of each component inside BIDeN. Evaluation is performed on Task I (Mixed image decomposition across multiple domains). We set the maximum number of source components to be 4 throughout all ablation experiments. The results are shown in Table 5.
Multi-scale encoder (I). We present the results of using a single-scale encoder to replace the multi-scale encoder. BIDeN yields better performance when the multi-scale encoder is employed, which validates the effectiveness of our design.
Choice of losses (II, III, IV, V). BIDeN consists of four different losses, we show that removing either one of the losses leads to a performance drop.
|Ablation||Fruit (A)||Animal (B)||Acc (AB)||Acc (All)|
Source prediction branch (VI, VII). We move the prediction branch to the generator. Such a change degrades the performance, showing that the source prediction task is better performed inside the discriminator. We report the results for a variant where does not share weights with the separation branch . The performance of this variant is worse than vanilla BIDeN, which indicates that the embedding learned by is beneficial to .
Zeroed loss (VIII). Taking the example of Figure 2, four heads are , , , and . BIDeN ignores the outputs from and . Here, we encourage the outputs from these two heads to be zero pixel. Such a zeroed loss forces the generator to perform the source prediction task implicitly, however, as the source prediction task is challenging, the results after applying zeroed loss are not comparable to default BIDeN.
BID is a novel task designed to advance real-world vision systems. However, some limitations remain in the setting of the BID task: (1) The training process is constrained by a maximum number of source components. Also, for multiple heads, each head is specific to a certain component/domain. That is, when testing out of distribution examples, the BID setting, as well as BID methods may fail. (2) In Task I, Mixed image decomposition across multiple domains, if the input is a mixture of multiple images from one domain, such as a mixture of bird, cat, and dog. Our constructed dataset recognizes them as domain animal. Thus, BIDeN is not able to perform a precise source component prediction and generate separated outputs with high fidelity. (3) The training data highly relies on synthetic data. How to break the gap between synthetic data and real data remains a promising research direction .
Our goal is to invite the community to explore the novel BID task, including discovering interesting areas of application, developing novel methods, extending the BID setting, and constructing benchmark datasets. We hope to see the BID task driving innovation in image decomposition and its applications. We expect more application areas related to image decomposition to consider our new BID setting, especially in image deraining. Novel deraining methods under the BID setting can be developed.
Finally, the BID task may also be beneficial to the learning with other types of data, such as video, speech, 3D visual data, or even natural language. We hope our proposed BID task brings insights to future research in the field of Artificial Intelligence.
Intrinsic autoencoders for joint deferred neural rendering and intrinsic image decomposition. In 2020 International Conference on 3D Vision (3DV), pp. 1176–1185. Cited by: §1.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1674–1682. Cited by: §1, §1.
Refit: a unified watermark removal framework for deep learning systems with limited data. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, pp. 321–335. Cited by: §1.
Accurate and efficient video de-fencing using convolutional neural networks and temporal information. In 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §1.
Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §A.1.
International Conference on Machine Learning, pp. 2566–2575. Cited by: §1, §1.
Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §A.2, §4.1, §4.
Dual attention-in-attention model for joint rain streak and raindrop removal. arXiv preprint arXiv:2103.07051. Cited by: §1.
BIDeN. We train BIDeN using a Tesla P100-PCIE-16GB GPU. The GPU driver version is 440.64.00 and the CUDA version is 10.2. We initialize weights using Xavier initialization . For Task I (Mixed image decomposition across multiple domains), BIDeN (2) to BIDeN (8) takes approximately 37 hours, 50 hours, 61 hours, 71 hours, 82 hours, 91 hours, and 101 hours training time. For Task II (Real-scenario deraining), the runtime of BIDeN is approximately 96 hours.
We follow the default training setting of Double-DIP  . We use the official PyTorch implementation (
. We use the official PyTorch implementation (link). We train a single image for 8000 iterations on a Tesla P100-PCIE-16GB GPU, the GPU driver version is 415.27 and the CUDA version is 10.0. The runtime for a single input image is approximately 20 minutes.
DAD. We follow the default training setting (Epoch 200, batch size 2, image crop size 256) of DAD . Experiments are based on the official PyTorch implementation (link). We train DAD on a Tesla P100-PCIE-16GB GPU. The GPU driver version is 440.64.00 and the CUDA version is 10.2. DAD takes 13 hours of runtime.
MPRNet. We follow the default training setting (Epoch 250, batch size 16, image crop size 256) of MPRNet . For a fair comparison, we apply the same data augmentation strategy of BIDeN to MPRNet. We use the official PyTorch implementation (link) of MPRNet. We train MPRNet using 4 Tesla P100-PCIE-16GB GPU, the GPU driver version is 415.27 and the CUDA version is 10.0. The runtime of MPRNet is 20 hours and the model size of MPRNet is 41.8 MB.
Following the naming convention used in CycleGAN  and perceptual loss , let c3s1-k filters. Rk denotes a residual block that contains two 3×3 convolutional layers with the same number of filters on both layer and Rk9 denotes nine continuous residual blocks. uk denotes a 3×3 fractional-strided-Convolution-InstanceNorm-ReLU layer with filters and stride 2. Both reflection padding and zero padding are employed.
denotes a 3×3 fractional-strided-Convolution-InstanceNorm-ReLU layer withfilters and stride . Let Ck denotes a 4×4 Convolution-InstanceNorm-LeakyReLU (slope 0.2) layer with
filters and stride 2. Both reflection padding and zero padding are employed.
Encoder. Our multi-scale encoder contains three branches, we name them , and .
consists of c3s2-256, Rk9, c1s1-128. consists of c7s1-64, c3s2-128, c3s2-256, Rk9. contains c15s1-64, c3s2-128, c3s2-256, c3s1-256, c3s1-256, Rk9, c1s1-128. The number of parameters is 33.908 million for the encoder.
Heads. The architecture of each head is: c1s1-256, c1s1-256, u128, u64, c7s1-3. Each head has 0.575 million parameters.
Discriminator. The discriminator contains two branches, (Separation) and (Prediction). Most weights are shared, the shared part includes C64, C128, C256.
The last layer of is C512. contains c1s1-512 (LeakyReLu with slope 0.2), global max pooling, c1s1-N, where
contains c1s1-512 (LeakyReLu with slope 0.2), global max pooling, c1s1-N, whereis the maximum number of source components. The Discriminator has approximately 3.028 million parameters in total.
Task I: Mixed image decomposition across multiple domains. We use linear mixup  as the mixing mechanism. We do not introduce additional non-balanced mixing factors or non-linear mixing as Task I is challenging enough. The mixed image is expressed as . The possibilities of every component to be selected vary with the maximum number of source components . We set the possibilities to be 0.9, 0.8, 0.7, 0.6, 0.5, 0.5, 0.5 for , respectively.
and the model for haze is:
where is the pixel of images, is the observed intensity, is the scene radiance, is the global atmospheric light, and is the mask of rain streak and snow. denotes the transmission map. We set between during training, and fix at test time.
To render the raindrop effect, we define a statistical model to estimate the location and motion of the raindrops. We employ the meta-ball model
To render the raindrop effect, we define a statistical model to estimate the location and motion of the raindrops. We employ the meta-ball model for the interaction effect between multiple raindrops.
For raindrop positions, we randomly sample it over the entire scene. The raindrop radius is also randomly sampled. A single raindrop is combined with another 1 to 3 smaller raindrops to form a realistic raindrop shape. Each composite raindrop could further merge with other raindrops on the scene. The velocity along the y-axis of the raindrop is proportional to the raindrop radius. Raindrop masks are randomly selected on the time dimension for diversity. A simple refractive model  is employed. We create a look-up table with three dimensions. The red and green channels together encode the texture of the raindrop, and the blue channel represents the thickness of the raindrops. Then, the texture table is masked by the alpha mask created by the meta-ball model. The masked table is dubbed . The location (x,y) of the world point that is rendered at image location (u, v) on the surface of a raindrop is modeled as:
where , denote the pixel at location in the red and blue channels of .
We acquire the destination pixel coordinate for location (u,v) based on the above equations and generate the distorted image. We also apply random light reduction and blur to the distorted image. For the reduction, we set the rate between during training, and fix rate at test time. The reduction can be expressed as , where is the distorted image and is the rate. We use a kernel size of 3 for Gaussian blur.
At last, we merge the distorted image with the original image:
where denotes the pixel of images, is the observed intensity, is the original image, is the distorted image, is the value of the raindrop mask generated by the metaball model.
The probabilities of every component to be selected are 1.0, 0.5, 0.5, 0.5 for rain streak, snow, raindrop, and haze. The mixing orders are rain streak, snow, raindrop, and haze.
The probabilities of every component to be selected are 1.0, 0.5, 0.5, 0.5 for rain streak, snow, raindrop, and haze. The mixing orders are rain streak, snow, raindrop, and haze.
Task III: Joint shadow/reflection/watermark removal. We use paired shadow masks, shadow images, and shadow-free images provided in ISTD  and SRD [59, 10]. The original SRD does not offer shadow masks, we use the shadow masks generated by Cun et al. .
The algorithm of adding reflection to images  is expressed as:
where is the pixel of images, is the observed intensity, is the transmission layer, is the reflection layer, and denotes the vignette mask. The reflection image is processed by a Gaussian smoothing kernel with a random kernel size, where the size is in the range of 3 to 17 pixels during training, and fixed to 11 pixels during testing.
For watermarks, we follow the watermark composition model . We use the RGB watermark images to add the watermark effect. We require the BID method to reconstruct the watermark mask. The watermark composition model is:
where denotes the pixel of images, is the observed intensity, is the scene radiance, is the global atmospheric light, and is the watermark image. We set between during training, and fix for testing.
The probabilities of every component to be selected are 0.6, 0.5, 0.5 for shadow, reflection, and watermark, respectively. The orders are shadow, reflection, and watermark.
Detailed case results of BIDeN. When the maximum number of components increases, the number of possible increases rapidly. There are possible combinations between source components, that is, cases. We present the detailed case results of BIDeN on Task I. We show the results of (BIDeN (2), BIDeN (3), BIDeN (4), BIDeN (5), BIDeN (6)) in Table 6, Table 7, Table 8, Table 9, and Table 10. These results are the extensions of Table 1. Note that due to the difference in precision, the PSNR results reported here are slightly different (less than difference) from the PSNR results reported in the main paper.
Qualitative results of BIDeN. Here we present more qualitative results of BIDeN. We show the results of (BIDeN (2), BIDeN (3), BIDeN (4), BIDeN (5), BIDeN (6) in Figure 8, Figure 9, Figure 10, Figure 11, and Figure 12. The number of selected source components and the index set are randomly chosen. The eight source components in Task I are Fruit (A), Animal (B), Flower (C), Furniture (D), Yosemite (E), Vehicle (F), Vegetable (G), and CityScape (H).
For Task II (Real-scenario deraining), more qualitative results are provided. Visual examples of CityScapes/masks/transmission maps generated by BIDeN are shown in Figure 13. We present more qualitative comparisons between BIDeN and MPRNet  in Figure 14 and Figure 15. The comparison presents the results of 6 cases of the same scene.
For the default training setting on Task II, the probabilities of every component to be selected are 1.0, 0.5, 0.5, 0.5 for rain streak, snow, raindrop, and haze. Moreover, we train both BIDeN and MPRNet again setting the possibility of the rain streak component to be selected as 0.8. The quantitative results of BIDeN and the comparison between MPRNet are provided in Table 11 and Table 12. Compared to BIDeN trained under the default training setting of Task II, BIDeN performs even better when the possibility of the rain streak component to be selected is set to 0.8.