Stereo computation (stereo matching) is a well-known and fundamental vision problem, in which a dense depth map is estimated from two images of the scene from slightly different viewpoints. Typically, one of the cameras is in the left (denoted by ) and the other in the right (denoted by ), just like we have left and right two eyes. Given a single image, it is generally impossible to infer a disparity map, unless using strong semantic-dependent image priors such as those single-image depth-map regression works powered by deep-learning [1, 2, 3]. Even though these learning based monocular depth estimation methods could predict a reasonable disparity map from a single image, they all assume the input image to be an original color image.
In this paper, we propose a novel and original problem, assuming instead one is provided with one single mixture image (denoted by ) which is a composition of an original stereo image pair and , i.e. , and the task is to simultaneously recover both the stereo image pair and , and an accurate dense depth-map . Under our problem definition, denotes different image composition operators that generate the mixture image, which is to be defined in details later. This is a very challenging problem, due to the obvious ill-pose (under-constrained) nature of the task, namely, from one input mixture image one effectively wants to recover three images (, , and ).
In theory it appears to be a blind signal separation (BSS) task, i.e.
, separating an image into two different component images. However, conventional methods such as BSS using independent component analysis (ICA) are unsuitable for this problem as they make strong assumptions on the statistical independence between the two components. Under our problem definition,
are highly correlated. In computer vision, image layer separation such as reflection and highlight removal[5, 6] are also based on the difference in image statistics, again, unsuitable. Another related topic is image matting , which refers to the process of accurate foreground estimation from an image. However it either needs human interaction or depends on the difference between foreground object and background, which cannot be applied to our task.
In this paper, we advocate a novel deep-learning based solution to the above task, by using a simple network architecture. We could successfully solve for a stereo pair and a dense depth map from a single mixture image . Our network consists of an image separation module and a stereo matching module, where the two modules are optimized jointly. Under our framework, the solution of one module benefits the solution of the other module. It is worth-noting that the training of our network does not require ground truth depth maps.
At a first glance, this problem while intrigue, has pure intellectual interest only, not perhaps no practical use. In contrast, we show this is not the case: in this paper, we show how to use it to solve for three very different vision problems: double vision, de-analygphy and even monocular depth estimation.
The requirement for de-anaglyph is still significant. If search on Youtube, there are hundreds if not thousands of thousands of anaglyph videos, where the original stereo images are not necessarily available. Our methods and the previous work  enable the recovery of the stereo images and the corresponding disparity map, which will significantly improve the users’ real 3D experience. As evidenced in the experiments, our proposed method clearly outperforms the existing work with a wide gap. Last but not least, our model could also handle the task of monocular depth estimation and it comes as a surprise to us: Even with one single mixture image, trained on the KITTI benchmark, our method produces the state of the art depth estimation, with results more better than those traditional two images based methods.
2 Setup the stage
In this paper we study two special cases of our novel problem of joint image separation and stereo computation, namely anaglyph (red-cyan stereo) and diplopia (double vision) (see Figure. 1), which have not been well-studied in the past.
Double vision (aka. diplopia): Double vision is the simultaneous perception of two images (a stereo pair) of a single object in the form of a single mixture image. Specifically, under the double vision (diplopia) model (c.f. Fig. 1 (left column)), the perceived image , i.e., the image composition is a direct average of the left and the right images. Note that the above equation shares similarity with the linear additive model in layer separation [5, 10, 11] for reflection removal and raindrop removal, we will discuss the differences in details later.
Red-Cyan stereo (aka. anaglyph): An anaglyph (c.f. Fig. 1 (right column)) is a single image created by selecting chromatically opposite colors (typically red and cyan) from a stereo pair. Thus given a stereo pair , the image composition operator is defined as , where the red channel of is extracted from the red channel of while its green and blue channels are extracted from . De-anaglyph [8, 9] aims at estimating both the stereo pair (color restoration) and computing its disparity maps.
At a first glance, the problem seems impossible as one has to generate two images plus a dense disparity map from one single input. However, since the two constitute images are not arbitrary but related by a valid disparity map. Therefore, they must be able to aligned well along the scanlines horizontally. For anaglyph stereo, existing methods [8, 9] exploit both image separation constraint and disparity map computation to achieve color restoration and stereo computation. Joulin and Kang  reconstructed the original stereo pairs given the input anaglyph by using a modified SIFT-flow method . Williem et.al.  presented a method to solve the problem within iterations of color restoration and stereo computation. These works suggest that by properly exploiting the image separation and stereo constraints, it is possible to restore the stereo pair images and compute the disparity map from a single mixture image.
There is little work in computer vision dealing with double vision (diplopia), which is nonetheless an important topic in ophthalmology and visual cognition. The most related works seem to be layer separation [5, 10], where the task is to decompose an input image into two layers corresponding to the background image and the foreground image. However, there are significant differences between our problem and general layer separation. For layer separation, the two layers of the composited image are generally independent and statistically different. In contrast, the two component images are highly correlated for double vision.
Even though there have been remarkable progresses in monocular depth estimation, current state-of-the-art network architectures [1, 2] and  cannot be directly applied to our problem. This is because that they depend on a single left/right image input, which is unable to handle image mixture case investigated in this work. Under our problem definition, the two tasks of image separation and stereo computation are tightly coupled: stereo computation is not possible without correct image separation; on the other hand, image separation will benefit from disparity computation.
In this paper, we present a unified framework to handle the problem of stereo computation for a single mixture image, which naturally unifies various geometric vision problems such as anaglyph, diplopia and even monocular depth estimation. Our network can be trained with the supervision of stereo pair images only without the need for ground truth disparity maps, which significantly reduces the requirements for training data. Extensive experiments demonstrate that our method achieves superior performances.
3 Our Method
In this paper, we propose an end-to-end deep neural network to simultaneously learn image separation and stereo computation from a single mixture image. It can handle a variety forms of problems such as anaglyph, de-diplopia and even monocular depth estimation. Note that existing work designed for either layer-separation or stereo-computation cannot be applied to our problem directly. This is because these two problems are deeply coupled,i.e., the solution of one problem affects the solution of the other problem. By contrast, our formulation to be presented as below, jointly solves both problems.
3.1 Mathematical Formulation
Under our mixture model, quality of depth map estimation and image separation are evaluated jointly and therefore, the solution of each task can benefit from each other. Our network model (c.f., Fig. 2) consists of two modules, i.e. an image separation module and a stereo computation module. During network training, only the ground-truth stereo pairs are needed to provide supervisions for both image separation and stereo computation.
By considering both the image separation constraint and the stereo computation constraint in network learning, we define the overall loss function as:
where denote the network parameters corresponding to the image separation module (left image prediction and right image prediction) and the stereo computation module. A joint optimization of gives both the desired stereo image pair and the disparity map.
3.2 Image Separation
The input single mixture image encodes the stereo pair image as , where is the image composition operator known a prior. To learn the stereo image pair from the input single mixture image, we present a unified end-to-end network pipeline. Specifically, denote as the learned mapping from the mixture image to the predicted left or right image parameterized by or . The objective function of our image separation module is defined as,
where is the input single mixture image, are the ground truth stereo image pair. The loss function measures the discrepancy between the predicted stereo images and the ground truth stereo images. The object function for the right image is defined similarly.
In evaluating the discrepancy between images, various loss functions such as loss , classification loss  and adversarial loss  can be applied. Here, we leverage the pixel-wise regression loss as the content loss of our image separation network,
This loss allows us to perform end-to-end learning as compatible with the stereo matching loss and do not need to consider class imbalance problem or add an extra network structure as a discriminator.
Researches on natural image statistics show that a typical real image obeys sparse spatial gradient distributions . According to Yang et.al. , such a prior can be represented as the Total Variation (TV) term in energy minimization. Therefore, we have our image prior loss:
where is the gradient operator.
We design a U-Net architecture 
for image separation, which has been used in various conditional generation tasks. Our image separation module consists of 22 convolutional layers. Each convolutional layer contains one convolution-relu pair except for the last layer and we use element-wise add for each skip connection to accelerate the convergence. For the output layer, we utilize a “tanh” activation function to map the intensity value betweenand . A detailed description of our network structure is provided in the supplemental material.
The output of our image separation module is a 6 channels image, where the first 3 channels represent the estimated left image and the rest 3 channels for the estimated right image
. When the network converges, we could directly use these images as the image separation results. However, for the de-anaglyph task, as there is extra constraint (the mixture happens at channel level), we could leverage the color prior of an anaglyph that the desired image separation (colorization) can be further improved by warping corresponding channels based on the estimated disparity maps.
For the monocular depth estimation task, only the right image will be needed as the left image has been provided as input.
3.3 Stereo Computation
The input to the stereo computation module is the separated stereo image pair from the image separation module. The supervision of this module is the ground truth stereo pairs rather than the inputs. The benefit of using ground truth stereo pairs for supervision is that it makes the network not only learn how to find the matching points, but also makes the network to extract features that are robust to the noise from the generated stereo images.
Fig. 2 shows an overview of our stereo computation architecture, we adopt a similar stereo matching architecture from Zhong et.al.  without its consistency check module. The benefit for choosing such a structure is that their model can converge within 2000 iterations which makes it possible to train the entire network in an end-to-end fashion. Additionally, removing the need of ground truth disparity maps enables us to access much more accessible stereo images.
Our loss function for stereo computation is defined as:
where denote the image warping appearance loss, express the smoothness constraint on the disparity map.
, we form a loss in evaluating the image similarity by computing the pixel-wisedistance between images. We also add a structural similarity term SSIM  to improve the robustness against illumination changes across images. The appearance loss is derived as:
where is the total number of pixels and is the reconstructed left image. balance between structural similarity and image appearance difference. According to , can be fully differentially reconstructed from the right image and the right disparity map through bilinear sampling .
For the smoothness term, similar to , we leverage the Total Variation (TV) and weight it with image’s gradients. Our smoothness loss for disparity field is:
3.4 Implementation Details
We implement our network in TensorFlow
with 17.1M trainable parameters. Our network can be trained from scratch in an end-to-end fashion with a supervision of stereo pairs and optimized using RMSProp with an initial learning rate of . Input images are normalized with pixel intensities level ranging from -1 to 1. For the KITTI dataset, the input images are randomly cropped to , while for the Middlebury dataset, we use . We set disparity level to 96 for the stereo computation module. For weighting loss components, we use . We set throughout our experiments. Due to the hardware limitation (Nvidia Titan Xp), we only use batch size 1 during network training.
4 Experiments and Results
In this section, we validate our proposed method and present experimental evaluation for both de-anaglyph and de-diplopia (double vision). For experiments on anaglyph images, given a pair of stereo images, the corresponding anaglyph image can be generated by combining the red channel of the left image and the green/blue channels of the right image. Any stereo pairs can be used to quantitatively evaluate the performance of de-anaglyph. However, since we also need to quantitatively evaluate the performance of anaglyph stereo matching, we use two stereo matching benchmarking datasets for evaluation: Middlebury dataset  and KITTI stereo 2015 . Our network is initially trained on the KITTI Raw dataset with 29000 stereo pairs that listed by  and further fine-tuned on Middlebury dataset. To highlight the generalization ability of our network, we also perform qualitative experiments on random images from Internet. For de-diplopia (double vision), we synthesize our inputs by averaging stereo pairs. Qualitative and quantitative results are reported on KITTI stereo 2015 benchmark  as well. Similar to the de-anaglyph experiment, we train our initial model on the KITTI raw dataset.
4.1 Advantages of joint optimization
Our framework consists of image separation and stereo computation, where the solution of one subtask benefits the solution of the other subtask. Direct stereo computation is impossible for a single mixture image. To analyze the advantage of joint optimization, we perform ablation study in image separation without stereo computation and the results are reported in Table 1. Through joint optimization, the average PSNR increases from to , which demonstrates the benefit of introducing the stereo matching loss in image separation.
|Metric||Image separation only||Joint optimization|
4.2 Evaluation of Anaglyph Stereo
We compare the performance of our method with two state-of-the-art de-anaglyph methods: Joulin et.al.  and Williem et.al. . Evaluations are performed on two subtasks: stereo computation and image separation (color restoration).
Stereo Computation. We present qualitative comparison of estimated disparity maps in Fig. 3 for Middlebury  and in Fig. 4 for KITTI 2015 . Stereo pairs in Middlebury are indoor scenes with multiple handcrafted layouts and the ground truth disparities are captured by highly accurate structural light sensors. On the other hand, the KITTI stereo 2015 consists of 200 outdoor frames in their training set, which is more challenging than the Middlebury dataset. The ground truth disparity maps are generated by sparse LIDAR points and CAD models.
On both datasets, our method can generate more accurate disparity maps than previous ones from visual inspection. It can be further evidenced by the quantitative results of bad pixel percentage that shown in Table. 2 and Fig. 5. For the Middlebury dataset, our method achieves performance leap than Williem et.al.  and performance leap than Joulin et.al. . This is reasonable as Joulin et.al.  did not add disparity into its optimization. For the KITTI dataset, we achieve an average bad pixel ratio (denoted as D1_all) of with 3 pixel thresholding across 200 images in the training set as opposed to by Joulin et.al.  and by Williem et.al. .
Image Separation. As an anaglyph image is generated by concatenating the red channel from the left image and the green and blue channels from the right image, the original color can be found by warping the corresponding channels based on the estimated disparity maps. We leverage such a prior for de-anaglyph and adopt the post-processing step from Joulin et.al.  to handle occluded regions. Qualitative and quantitative comparison of image separation performance are conducted on the Middlebury and KITTI datasets. We employ the Peak Signal-to-Noise Ratio (PSNR) to measure the image restoration quality.
Qualitative results for both datasets are provided in Fig. 6 and Fig. 7. Our method is able to recover colors in the regions where ambiguous colorization options exist as those areas rely more on the correspondence estimation, while other methods tend to fail in this case.
Table 3 and Table 4 report the performance comparison between our method and state-of-the-art de-anaglyph colorization methods: Joulin et.al.  and Williem et.al.  on the Middlebury dataset and on the KITTI dataset correspondingly. For the KITTI dataset, we calculated the mean PSNR throughout the total 200 images of the training set. Our method outperforms others with a notable margin. Joulin et.al.  is able to recover relatively good restoration results when the disparity level is small, such as Tsukuba, Venus, and KITTI. When the disparity level doubled, its performance drops quickly as for Cone and Teddy images. Different with Williem et.al. , which can only generate disparity maps at pixel level, our method is able to further optimize the disparity map to sub-pixel level, therefore achieves superior performance in both stereo computation and image restoration (separation).
|Dataset||View||Joulin ||Williem ||Ours|
Anaglyph in the wild. One of the advantages of conventional methods is their generalization capability. They can be easily adapt to different scenarios with or without parameter changes. Deep learning based methods, on the other hand, are more likely to have a bias on specific dataset. In this section, we provide qualitative evaluation of our method on anaglyph images downloaded from the Internet to illustrate the generalization capability of our method. Our method, even though trained on the KITTI dataset which is quite different from all these images, achieves reliable image separation results as demonstrated in Fig. 8. This further confirms the generalization ability of our network model.
4.3 Evaluation for double-vision unmixing
Here, we evaluate our proposed method for unmixing of double-vision image, where the input image is the average of a stereo pair. Similar to anaglyph, we evaluate our performance based on the estimated disparities and reconstructed stereo pair on the KITTI stereo 2015 dataset. For disparity evaluation, we use the oracle disparity maps (that are computed with clean stereo pairs) as a reference in Fig. 9. The mean bad pixel ratio of our method is , which is comparable with the oracle’s performance as . For image separation, we take a layer separation method  as a reference. A quantitative comparison is shown in Table 5. Conventional layer separation methods tend to fail in this scenario as the statistic difference between the two mixed images is minor which violates the assumption of these methods. Qualitative results of our method are shown in Fig. 10.
5 Beyond Anaglyph and Double-Vision
Our problem definition also covers the problem of monocular depth estimation, which aims at estimating a depth map from a single image [27, 2, 28, 3]. Under this setup, the image composition operator is defined as or , i.e., the mixture image is the left image or the right image. Thus, monocular depth estimation is a special case of our problem definition.
We evaluated our framework for monocular depth estimation on the KITTI 2015 dataset. Quantitative results and qualitative results are provided in Table 6 and Fig. 11, where we compare our method with state-of-the-art methods ,  and . Our method, even designed for a much more general problem, outperforms both  and  and achieves quite comparable results with .
|Methods||Abs Rel||Sq Rel||RMSE||RMSE log|
by 10 epochs. Depth metrics are from Eigenet.al. . Our performance is better than the state-of-the-art method .
and our results. Since the ground truth depth points are very sparse, we interpolated it with a color guided depth painting method for better visualization.
This paper has defined a novel problem of stereo computation from a single mixture image, where the goal is to separate a single mixture image into a pair of stereo images–from which a legitimate disparity map can be estimated. This problem definition naturally unifies a family of challenging and practical problems such as anaglyph, diplopia and monocular depth estimation. The problem goes beyond the scope of conventional image separation and stereo computation. We have presented a deep convolutional neural network based framework that jointly optimizes the image separation module and the stereo computation module. It is worth noting that we do not need ground truth disparity maps in network learning. In the future, we will explore additional problem setups such as “alpha-matting”. Other issues such as occlusion handling and extension to handle video should also be considered.
Acknowledgements Y. Zhong’s PhD scholarship is funded by Data61. Y. Dai is supported in part by National 1000 Young Talents Plan of China, Natural Science Foundation of China (61420106007, 61671387), and ARC grant (DE140100180). H. Li’s work is funded in part by ACRV (CE140100016). The authors are very grateful to NVIDIA’s generous gift of GPUs to ANU used in this research.
-  Garg, R., Kumar, B.V., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: Proc. Eur. Conf. Comp. Vis. (2016) 740–756
-  Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2017)
Li, B., Shen, C., Dai, Y., van den Hengel, A., He, M.:
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs.In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (June 2015)
-  Bronstein, A.M., Bronstein, M.M., Zibulevsky, M., Zeevi, Y.Y.: Sparse ica for blind separation of transmitted and reflected images. Int’l J. Imaging Science and Technology 15(1) (2005) 84–91
-  Yang, J., Li, H., Dai, Y., Tan, R.T.: Robust optical flow estimation of double-layer images under transparency or reflection. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (June 2016)
-  Li, Z., Tan, P., Tan, R.T., Zou, D., Zhou, S.Z., Cheong, L.F.: Simultaneous video defogging and stereo reconstruction. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (June 2015) 4988–4997
-  Levin, A., Lischinski, D., Weiss, Y.: A closed-form solution to natural image matting. IEEE Trans. Pattern Anal. Mach. Intell. 30(2) (Feb 2008) 228–242
-  Williem, Raskar, R., Park, I.K.: Depth map estimation and colorization of anaglyph images using local color prior and reverse intensity distribution. In: Proc. IEEE Int. Conf. Comp. Vis. (Dec 2015) 3460–3468
-  Joulin, A., Kang, S.B.: Recovering stereo pairs from anaglyphs. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2013) 289–296
-  Li, Y., Brown, M.S.: Single image layer separation using relative smoothness. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (June 2014)
-  Fan, Q., Yang, J., Hua, G., Chen, B., Wipf, D.: A generic deep architecture for single image reflection removal and image smoothing. In: Proc. IEEE Int. Conf. Comp. Vis. (2017)
-  Liu, C., Yuen, J., Torralba, A.: Sift flow: Dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5) (May 2011) 978–994
-  Xie, J., Girshick, R., Farhadi, A.: Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In: Proc. Eur. Conf. Comp. Vis. (2016) 842–857
-  Iizuka, S., Simo-Serra, E., Ishikawa, H.: Let there be color!: Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Trans. Graph. 35(4) (July 2016) 110:1–110:11
-  Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Proc. Eur. Conf. Comp. Vis. (2016)
-  Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2017)
-  Levin, A., Weiss, Y.: User assisted separation of reflections from a single image using a sparsity prior. IEEE Trans. Pattern Anal. Mach. Intell. 29(9) (September 2007) 1647–1654
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. International Conference Medical Image Computing and Computer-Assisted Intervention (2015) 234–241
-  Zhong, Y., Dai, Y., Li, H.: Self-supervised learning for stereo matching with self-improving ability. In: arXiv:1709.00930. (2017)
-  Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 13(4) (April 2004) 600–612
-  Jaderberg, M., Simonyan, K., Zisserman, A., kavukcuoglu, k.: Spatial transformer networks. In: Proc. Adv. Neural Inf. Process. Syst. (2015) 2017–2025
-  Abadi, M., Agarwal, A., et.al., P.B.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR abs/1603.04467 (2016)
Tieleman, T., Hinton, G.:
Lecture 6.5—RmsProp: Divide the gradient by a running average of
its recent magnitude.
COURSERA: Neural Networks for Machine Learning (2012)
-  Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comp. Vis. 47(1-3) (April 2002) 7–42
-  Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2015)
-  Li, Y., Brown, M.S.: Single image layer separation using relative smoothness. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2014) 2752–2759
-  Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Proc. Adv. Neural Inf. Process. Syst. (2014)
-  Li, B., Dai, Y., He, M.: Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recognition 83 (2018) 328 – 339
-  Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2017)
Yang, J., Ye, X., Li, K., Hou, C., Wang, Y.:
Color-guided depth recovery from rgb-d data using an adaptive autoregressive model.In: IEEE Trans. Image Proc. Volume 23. (Aug 2014) 3443–3458