1 Introduction
Images of indoor/outdoor scenes are usually mixed by different meaningful information. Scenes contaminated with reflection, objects with different shadings are such phenomena, which due to illumination affected through different medium or materials in the environment. Specifically, the irradiance received by the camera from the scene point is blended with different cues along with the line of sight. Different factors make the image look real, but they make it more difficult for computer to understand the image Bi et al. (2015); Yu and Koltun (2016); Long et al. (2015). Even worse, some factors make images degraded, decrease the visibility of scenes Li and Brown (2014); He et al. (2011).
The separation of images into multiple layers is desired in both computational photography and various vision tasks like surface retexturing, 3D object compositing Bi et al. (2015), 3D point cloud processing JaeSeong Yun (2018). In this regard, the single image separation intents to extract two independent layers from an image, in which the input image can be constructed as the pixelwise addition of an image and another image , i.e.,
(1) 
Many traditional image separation problems can be formulated as Eqn. 1. For instance, reflection interference often arises when a photo of a scene is taken behind a glass window. This is a typical image separation problem and can be expressed as a linear combination of a reflection layer and the background scene , as . The intrinsic image model assumes an input image is the pixelwise product of an albedo (or reflectance) image and a shading image . This can be reformulated into the form in Eqn. 1 by taking the , i.e., .
While obviously useful, estimating such layers is fundamentally illposed as there exist infinitely many feasible solutions to Eqn.
1. To constrain the space of feasible solutions, lots of prior information is instantiated through carefully tailored image filters or energy terms Li and Brown (2014); Shih et al. (2015). For example, Li et al. Li and Brown (2014)assume that one output layer is more smooth than the other layer. Based on this assumption, they proposed relative smoothness prior to separate image into two. However, when the scene goes complicated, such handcrafted prior is no longer enough to describe the difference between these two output layers. On the other hand, given access to ground truth aligned dataset, deep convolutional neural networks (CNNs) provide a datadriven candidate for solving the illposed inverse problem with fewer potentially heuristic or handcrafted assumptions. However, existing databases are limited to single image separation problem in various aspects: (1) It’s hard to transfer information among different datasets for the same task. Existing synthetic datasets vary from each other because there are different application scenarios
Fan et al. (2017); Wan (2019); Zhang et al. (2018); Fan et al. (2018). (2) Ground truth data of real images is extremely hard to acquire for training a general CNNs model Fan et al. (2018). Consequently, each existing dataset is limited in different ways, and thus far, supervised deep network models built using them likewise display a high degree of datasettailed architectural variance.
To this end, we proposed an USIS (Unpaired Single Image Separation) method which takes three image domains which without groundtruth data for training. Based on the difference of distributions in different image sets and cycle consistency, the USIS learns the relationship among the different domains via generative adversarial manner. The USIS separates the single input image into two meaningful components which are independent to each other, after learning the feature statics distribution from the given output image sets.
Experimental results show that our proposed framework could separate the input image into two desired images which following the distribution of output image distributes properly. The proposed architecture can be adopt to single image reflection removal, intrinsic image decomposition tasks without accessing groundtruth. We also extend the proposed method with slightly modification on a more challenging, single image three layer separation task. Results demonstrate that the proposed USIS can handle such problem properly.
2 Previous Work
Unsupervised domain translation methods receive two sets of samples and learn a function that maps between a sample of one domain and the analogous sample of the other domain Chintala (2016); Yi (2017); Zhu (2017); Liu (2017). However, due to the relationships among three sets of images fundamentally based on the physical model as shown in Sec. 1, previous unpaired image to image methods cannot adopt into image separate task directly.
There are various problems on image separation in computer vision area, different prior and physical models are applied in different subproblems. For instance, relative smoothness
Li and Brown (2014), ghost cues Shih et al. (2015) and layer independence prior are introduced for reflection removal problem. Although such hypothesis work in many cases, but these are all lowlevel prior, which is constructed with image gradient or color changing, such priors are not adaptable with complex scenes.Recently, many deep learning based methods are carefully tailored for different datasets which contain groundtruth. In single image reflection removal task, full convolution neural networks with different guidance (like image gradient information
Fan et al. (2017), the face structure prior Wan (2019)) branches or losses (like perceptual losses Zhang et al. (2018)) designed. For intrinsic image decomposition, various Unet Ronneberger (2015) like, encoderdecoder with skip connections are proposed to tackle the decomposition.Due to it’s hard to collect real image database with ground truth label, unsupervised learning for single image separation is appealing because of inaccessibility of ground truth. However, the single image separation is even tough when the training image with unknown groundtruth. Michael
et al. Janner (2017) proposed selfsupervised intrinsic image decomposition by training on the few images with groundtruth data, then transfer the model to other unpaired images. However, their method require the training images to share the same reflectance layer in the same group. Ma et al. Torralba (2018) and Li et al. Zhengqi Li (2018) proposed unsupervised intrinsic image decomposition methods, but these methods need multiple input with same reflectance layer to train the model.3 Unsupervised Single Image Layer Separation
3.1 Problem Formulation
Let and
be three image domains. In supervised imagetoimage translation, we are given samples
drawn from a joint distribution
. In unsupervised single image separation, we are given samples drawn from the marginal distributions , and . Besides, based on the Eqn. 1, . However, as explained in Sec. 1, solving the problem is highly illposed. We can infer nothing about the joint distribution of the marginal samples without additional assumptions.Assumption 1. Shared information consistency. is blended by and , and (or ) share the same latent space, i.e. and , where is a function for mapping image from color space to latent space, is the corresponding latent space.
Computationally, to implement the information consistency assumption, the original image separation pipeline can be rewritten as:
(2) 
where is the feature sample in latent space which is analogous to , , . and are both mapping functions which are used to project the feature sample back to color space for reconstructing the analogous separated images and .
Assumption 2. Layer independence. The separated image and are independent to each other in latent space, i.e., samples from the same domain share more similar features in latent space than that between different domains.
Layer independence assumption can be implemented by minimizing distance between any two samples from the same domain in latent space, maximizing the distance between any two samples from different domain in latent space. Computationally, for any two samples , from latent space , any other two samples , from latent space , we have:
(3) 
where is distance between samples from different latent space, is distance between two samples of the same latent space.
3.2 Unsupervised Single Image Separation Learning
Self Supervising (SS). Based on shared information consistence assumption, contains all information constructing and . Furthermore, features of in latent space still contains all information of features of and . As shown in Fig.1 (b), an ideal encoder can encode from RGB space to the analogous feature code latent space . which contains all information of . Add on layer independence assumption, contains all information of without information of . i.e., in this way, we can disentangle and from through two encoders and , into two independent analogous feature of and in latent space and .
Following this idea, for , we minimize the L1 distance between and , where and are unpaired samples from and . For , we minimize the L1 distance between and , where and are unpaired samples from and . To make features of and more distinguishable, we also maximize the distance between features yielding from and .
GANs. Note that the selfsupervising constraint and same information consistency assumption do not guarantee that output corresponding images in two domains have different latent code are feasible to look like real image in domain and . Hence, we adopt generative adversarial framework to make the output and look as real as samples from the corresponding domain and in perceptual manner.
There two parts of subnetworks in generative adversarial networks. The generator aims to solve the problem of single image separation, this part aims to get the mapping function , the encoderdecoder based mapping technique Kingma and Welling (2014), two encoders and two decoders are learned. We denote the encoder for mapping image color space to latent space, , and the decoders of different output layers , . Specifically, we write as the generator of our framework. Whereas we use two discriminators and to discriminate whether the separated images belong to the domain respectively, e.g., for real images from domain , should output true, while from the generator , it should output false.
Cycle Consistency (CC). Since the shared information consistency assumption and Eqn. 1 imply the cycleconsistency constraint, we also enforce the cycleconsistency constraint in the proposed framework to further regularizes the illposed unsupervised single image separation problem. Specifically, we get and through mapping function , then , should consistent to . Furthermore, and are generated by mapping function again by given , () should consistent to ().
Learning. We jointly solve the learning problems of selfsupervising, GANs and cycle consistency for the image separation streams, the image reconstruction streams and the cyclereconstruction streams:
(4) 
Selfsupervised training aims to split sample from in the latent space. Specifically, split the sample into latent code for the analogous sample in and another latent code for the analogous sample in . Based on Eqn. 3, the selfsupervising object is:
(5) 
where the hyperparameters , and control the weights of different objective terms. We adopt in practice, i.e., we compute the L1 distance of two inputs . is the distance between the latent codes from and
, we use a modified sigmoid function as distance function:
(6) 
where controls the shape of distance curve, experimental results are provided in the top of Fig. 2 (b).
In Eqn. 4, the GAN objective functions are given by
(7) 
The objective functions in Eqn. 7 are conditional GAN objective functions. They are used to ensure the separated images resembling images in the target domains, respectively. The hyperparameter controls the impact of the GAN objective functions.
We use the L1 difference function to model the cycleconsistency constraint, which is given by
(8) 
where and . The hyperparameters , and control the weights of these three different objective terms.
Inheriting from GAN, training of the proposed framework results in solving a minmax problem where the optimization aims to find a saddle point. To make a training process stable, we apply gradient update scheme similar to the one described in Bengio (2014) and gradient penalization to solve Eqn. 4. Specifically, we first apply a gradient descent step to update , , , with and fixed. We then apply a gradient ascent step to update and with , , , fixed.
4 Network Architecture
Our model USIS follows the pipeline of generative adversarial network introduced by Bengio (2014) which consists of one generator and two discriminators. The generator aims to separate the input sample into the analogous and . Discriminators share the same architecture with different parameters, discriminate whether the sample (or ) is real or fake, i.e., discriminate whether the sample belongs to the distribution of real images.
Generator. As illustrated in Fig. 1 (b), there are two types of generators in the USIS.
The first generator is called separate generator which is constructed by two convolutional encoderdecoder networks {} with skip connections. These two generators share the same structure but different parameters. According to the difference of experiment task, we designed two types of generators for different tasks:
Toy problem task. See Fig. 2 (a) for for the task description. Both networks employ mirrorlink connections introduced by Ronneberger (2015)
, which connect layers of the encoder and decoder of the same size. These connections yield sharper results than the blurred outputs characteristic of many deconvolutional models. The encoder has 5 convolutional layers with {16, 32, 64, 128, 256} filters of size 4×4 and stride of 2. Batch normalization
Ioffe and Szegedy (2015)and leaky ReLU activation are applied after every convolutional layer. The layers in these two decoders have the same number of features as the encoder but in reverse order plus a final layer with 3 channels.
Reflection removal & Intrinsic image decomposition task. We adopt pretrained VGG19 Simonyan and Zisserman (2015). Selecting ‘conv1_2’, ‘conv2_2’, and ‘conv3_2’ as skip connected features, shown to be successful for image synthesis and enhancement Zhang et al. (2018). The first 4 blocks in decoder are cascade convolution layer and upsample operations to fuse featues from encoder. The next contextual block is a fully convolutional network with 64 filters of size 3×3, stride of 1 and dilation rate of {2, 4, 8, 16, 32, 1}, followed an output layer, which is a convolution layer with 3 filters of size 1×1. Instance normalization Dmitry Ulyanov and Lempitsky (2016) and leaky ReLU activation are applied after every convolutional layer.
The other generator is called the combine generator, with no no learnable parameters, is illustrated by . Combine generator aims to combine the predicted analogous and together based on the Eqn 1.
Discriminator. We adopt multiscale discriminator as UINT Liu (2017) to distinguish the real and fake images. For each discriminator network, which is constructed by multibranches of subnetworks, distinguishes the real and fake images in different scales. For toy problem task, the number of branches is 1, otherwise, we set it to 3. Specifically, each subbranch has 4 convolutional layers with {32, 64, 128, 32} filters of size 4×4 and stride of 2. Instance normalization Dmitry Ulyanov and Lempitsky (2016) and LeakyReLU activation are applied after every convolutional layer. For th branch, image is down sampled by times via average pooling operation as input. Finally, the features yielded from different branches are fused together and followed with sigmoid activation.
5 Experiments
We first analyze different components of the proposed framework based on a toy problem. We then present visual and numerical results on real image separation tasks. Finally, we extend our framework to a more challenging single image separation task.
5.1 Performance Analysis
Method  

w/o CC  71.71  71.52  
w/o SS  60.39  54.84  
Proposed  28.93  27.35  
(a)  (b)  (c) 
We used ADAM Kingma and Ba (2015) for training where the learning rate was set to 0.0001 and momentums were set to 0.0 and 0.9. Each minibatch consisted of an image from the domain , an image from the domain and an image from domain . Our framework had several hyperparameters as shown in Equ. 58. The default values were .
We introduce a toy problem (visualized in Fig. 2 (a)), in which: domain contains gray images with solution of 128×128, with squares in different {lightness, position, size} in each image. Shapes in domain are circles and each image in domain are generated via . Based on this toy problem, we generate a dataset which contains 5K sets of images in three domains (blended , rectangle and circle
), which is convenient for quantitative evaluation. For the toy problem, the goal is to separate the blended image into the image containing only rectangle and the residual image which contains only circle. Here we use the unsupervised scheme, where we choose 4K images in each domain randomly without grouping the analogs images for training. We train for 200 epoches and use the final model, specifically, test the separate generator on the test set. We then compare the difference between separated images and the corresponding ground truth images via mean square error (MSE). Note that image pixel values are in [0, 255] in our experiment.
5.2 Qualitative and quantitative results
Fig. 3, 4 show the result of the proposed framework on two image separation tasks against stateoftheart unsupervised methods with single input.
Reflection removal. We apply the proposed framework on single image reflection removal task. Most previous reflection removal works use synthesis data to train their CNN model Fan et al. (2017); Zhang et al. (2018) and the published real data with ground truth have too limited size (Zhang et al. Zhang et al. (2018) proposed a small dataset which contains 110 pair of images without ground truth reflection {input, background}). Wan et al. Wan et al. (2017) proposed a real dataset which contains 454 image sets with the corresponding background and reflection.
In this experiment, we train our model on the benchmark Wan et al. (2017) directly. Noted that we use 400 images sets for training and the remain images for evaluation. In the each iteration of training, we choose noncorresponding sample , and from the training set randomly. We train the network to separate reflection contaminated images of size 256×256 by random cropping patches from the image. We show the effectiveness of the encoders and for separating images in the latent space through clustering the features from encoders in Fig. 3(a), then following several separate results of qualitative comparisons. Cycle GAN and UNIT only provide the prediction of by given .
Intrinsic image decomposition. We use the 220 images in the MIT intrinsic dataset Freeman (2011). The dataset is extended by Narihira and Yu (2015). This data contains only 20 different objects, each of which has 11 images. We train the network to decompose intrinsic images of size 256×256. The cluster result, visual comparisons are illuminated in Fig. 4. The numerical comparison results are shown in Table 1.
Method  CycGAN Zhu (2017)  UNIT Liu (2017)  USIS (proposed)  

SSIM  MSE  SSIM  MSE  SSIM  MSE  
Reflection removal  0.622  97.04  0.738  68.15  0.842  51.14 
Intrinsic decomposition  0.572  48.95  0.821  37.41  0.893  30.10 
5.3 Multilayers separation
In this section we designed another toy but more challenging problem to evaluate the proposed USIS framework: separating one single image into three analogous layers, see Fig. 5 (a) for the visualization of the task. Same as separating an image into two layers, as shown in Fig. 5 (a), the task here aims to separate the image into three layers: square, circle and triangle. We generate 5K image sets and use the same settings as described in paragraph 5.1. We add a generator and a discriminator to USIS to address the problem which split an input image into three. Results in Fig. 5 (b) show that the proposed USIS can still handle such challenging problems.
6 Conclusion
In this paper, we propose an Unsupervised Single Image Separation network for single image separation task. We show that by learning both image consistency and independence of distributions in different layers, USIS can make use of the information of different layer distributions to separate single image into analogous different layers. USIS allows unlabeled data to be used in training. Experimental results show that the proposed framework can make unsupervised single image reflection removal and single intrinsic image decomposition properly.
References
 [1] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Cited by: §3.2, §4.
 [2] (2015) An l1 image transform for edgepreserving smoothing and scenelevel intrinsic decomposition. ACM Transactions on Graphics (TOG) 34 (4), pp. 78. Cited by: §1, §1.
 [3] (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, Cited by: §2.
 [4] (2016) Instance normalization: the missing ingredient for fast stylization. In arXiv preprint arXiv:1607.08022, Cited by: §4, §4.
 [5] (2017) A generic deep architecture for single image reflection removal and image smoothing. In ICCV, Cited by: §1, §2, §5.2.
 [6] (2018) Revisiting deep intrinsic image decompositions. In CVPR, pp. 8944–8952. Cited by: §1.
 [7] (2011) Ground truth dataset and baseline evaluations for intrinsic image algorithms. In ICCV, Cited by: §5.2.
 [8] (2011) Single image haze removal using dark channel prior. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (12), pp. 2341–2353. Cited by: §1.
 [9] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §4.
 [10] (2018) Reflection removal for largescale 3d point clouds. In CVPR, Cited by: §1.
 [11] (2017) Selfsupervised intrinsic image decomposition. In Advances in Neural Information Processing Systems, Cited by: §2.
 [12] (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §5.1.
 [13] (2014) Autoencoding variational bayes. In ICLR, Cited by: §3.2.
 [14] (2014) Single image layer separation using relative smoothness. In CVPR, Cited by: §1, §1, §2.
 [15] (2017) Unsupervised imagetoimage translation networks. In Advances in neural information processing systems 30, Cited by: §2, §4, Figure 3, Figure 4, Table 1.
 [16] (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1.
 [17] (2015) Direct intrinsics:learning albedoshading decomposition by convolutional regression. In CVPR, Cited by: §5.2.
 [18] (2015) Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention, Cited by: §2, §4.
 [19] (2015) Reflection removal using ghosting cues. In CVPR, Cited by: §1, §2.
 [20] (2015) Very deep convolutional networks for largescale image recognition. In ICLR, Cited by: §4.
 [21] (2018) Single image intrinsic decomposition without a single intrinsic image. In ECCV, Cited by: §2.

[22]
(2008)
Visualizing data using tsne.
Journal of machine learning research
9 (12), pp. 2579–2605. Cited by: Figure 3.  [23] (2017) Benchmarking singleimage reflection removal algorithms. In IEEE ICCV, Cited by: §5.2, §5.2.
 [24] (2019) Face image reflection removal. In CVPR, Cited by: §1, §2.
 [25] (2017) Dualgan: unsupervised dual learning for imagetoimage translation. In ICCV, Cited by: §2.
 [26] (2016) Multiscale context aggregation by dilated convolutions. In ICLR, Cited by: §1.
 [27] (2018) Single image reflection separation with perceptual losses. In CVPR, Cited by: §1, §2, §4, §5.2.
 [28] (2018) Learning intrinsic image decomposition from watching the world. In CVPR, Cited by: §2.
 [29] (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13 (4), pp. 600–612. Cited by: Table 1.
 [30] (2017) Unpaired imagetoimage translation using cycleconsistent adversarial networks. In CVPR, Cited by: §2, Figure 3, Figure 4, Table 1.
Comments
There are no comments yet.