Learnable Exposure Fusion for Dynamic Scenes

04/04/2018 ∙ by Fahd Bouzaraa, et al. ∙ HUAWEI Technologies Co., Ltd. Technische Universität München 0

In this paper, we focus on Exposure Fusion (EF) [ExposFusi2] for dynamic scenes. The task is to fuse multiple images obtained by exposure bracketing to create an image which comprises a high level of details. Typically, such images are not possible to obtain directly from a camera due to hardware limitations, e.g., a limited dynamic range of the sensor. A major problem of such tasks is that the images may not be spatially aligned due to scene motion or camera motion. It is known that the required alignment by image registration problems is ill-posed. In this case, the images to be aligned vary in their intensity range, which makes the problem even more difficult. To address the mentioned problems, we propose an end-to-end Convolutional Neural Network (CNN) based approach to learn to estimate exposure fusion from 2 and 3 Low Dynamic Range (LDR) images depicting different scene contents. To the best of our knowledge, no efficient and robust CNN-based end-to-end approach can be found in the literature for this kind of problem. The idea is to create a dataset with perfectly aligned LDR images to obtain ground-truth exposure fusion images. At the same time, we obtain additional LDR images with some motion, having the same exposure fusion ground-truth as the perfectly aligned LDR images. This way, we can train an end-to-end CNN having misaligned LDR input images, but with a proper ground truth exposure fusion image. We propose a specific CNN-architecture to solve this problem. In various experiments, we show that the proposed approach yields excellent results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High Dynamic Range imaging (HDRI) emerged in recent years as a major research topic in the context of computational photography, where the main purpose is to bridge the gap between the dynamic range native to the scene and the relatively limited dynamic range of the camera. As a result, the level of details in the captured LDR image is significantly enhanced, so that the final image presents a balanced contrast and saturation in all parts of the scene. The most common approach to render an HDR image with a camera presenting a limited dynamic range is called Exposure Bracketing [17, 2, 18]. It relies on merging several LDR images of the same scene captured with different exposures. By alternating the exposure settings between under-exposure and over-exposure, the input stack of LDR images contains various sets of details in different areas of the scene. These details are combined together into a single HDR image by estimating the inverse of the Camera Response Function (CRF). The resulting image finally undergoes a tone-mapping operation for display purposes.

Another alternative is called Exposure Fusion (EF) [16]. The main difference between both approaches is that exposure fusion directly merges the input LDR images to produce a final high-quality LDR image without the usage of the CRF. The visual characteristics of the resulting fused image are similar to a tone-mapped HDR image (pseudo-HDR image). The direct merging of the input images represents a clear advantage over exposure bracketing since prior information about the exposure settings are not needed and no estimation of the inverse CRF is required. This results in a decrease of the computational complexity while yielding high quality enhancement results.

Both exposure bracketing and exposure fusion are based on the assumption that the input LDR images are aligned. However, misalignment due to camera- of scene-motion will almost always occur, especially when the input images are captured sequentially. As a result, the output image contains strong artifacts where several instances of the same object can be seen. These artifacts are known as the Ghost Effect. Whether the HDRI system is based on exposure bracketing or exposure fusion, removing these artifacts from the final image is a very challenging task.

In this work, we aim at taking advantage of the latest advances achieved by ConvNets in classification and image enhancement topics. In a nutshell, our main goal is to combine the tasks of details enhancement and the removal of motion-induced ghost artifacts into a single framework. This is achieved by creating an end-to-end mapping which learns the exposure fusion for dynamic scenes. In other words, our trained model yields a final artifact-free image which disposes of a wider range of details, based on input LDR images presenting motion-related scene differences and color/exposure differences. Similar to exposure fusion, the output of our trained model is a LDR image, as no true HDR transformation is occurring. However, the visual attributes of the resulting image allows it to be labeled as a pseudo-HDR image.

We test our learnable exposure fusion approach for dynamic scenes on several indoor and outdoor scenes and we show that the quality of our results improves upon state-of-the-art approaches. We show as well that our approach is capable of handling extreme cases in terms of motion and exposure difference between the input images, while maintaining a very low execution time. This makes it suitable for low-end capturing devices such as smartphones.

2 Related Work

Rendering an artifact-free HDR image or pseudo-HDR image of a dynamic scene is a thoroughly investigated topic. Several approaches claim to successfully handle the misalignment and the associated inconsistencies, so that the final image is ghost- and blur-free. The sphere of these methods can be split into two major categories.

The first category falls under the scope of the De-ghosting methods. The idea behind these approaches is to select a LDR image from the input stack and use it as a reference in order to detect inconsistencies caused by dynamic pixels in the non-reference images. The subsequent merging procedure aims at discarding the detected inconsistencies from the resulting image. De-ghosting approaches are the methods of choice in scenarios where the computational cost of the enabling algorithm needs to be low. Nonetheless, scenes with large exposure and scene differences might be very challenging for these methods. Motion-related artifacts can still be seen in case of non-rigid motion or large perspective differences in comparison to the reference LDR image. A detailed examination of these methods is provided in [23].

The second category is composed of approaches relying on sparse and/or dense pixel correspondences in order to align the input images. The alignment can be either spatial where the non-reference LDR images are warped to the view of the selected reference image, or color-based by aligning the reference LDR image to each non-reference LDR image separately using color mapping. In both cases, the goal is to form a stack of aligned but differently exposed LDR images corresponding to the view of the reference image.

In [11], Kang et al. introduced an approach which uses optical flow in order to align the input differently exposed images, in the context of video HDRI. Likewise, Zimmer et al. propose in [26] a joint framework for Super-Resolution and HDRI by aligning all images to the reference view using optical flow. The described approach gets around the issue of color inconsistency for optical flow by including a gradient constancy assumption in the data term of the energy function. Alternatively, Sen et al. describe in [21] a solution for simultaneous HDR image reconstruction and alignment of the input images using a joint patch-based minimization framework. The alignment is based on a modified version of the PatchMatch (PM) [1] algorithm. The final HDR image is rendered from the well-exposed regions of the reference LDR image and from the remaining stack of LDR images for low-exposed regions in the reference. Likewise, Hu et al. propose in [9] to align every non-reference LDR image to the selected reference image, which typically has the highest number of well-exposed pixels. The patch-based alignment approach uses the generalized PM algorithm for well-exposed patches in the reference LDR image and suggests an additional modification to PM for over- or under-exposed patches in the reference image. The final HDR image is composed using the Exposure Fusion algorithm. More recently, Gallo el al. proposed in [5] an approach based on the matching of sparse feature points between the designated reference and non-reference images. The matcher

developed for this purpose is robust towards saturation. Once a dense flow field is interpolated, the warped images and the reference LDR image are merged using a modified exposure fusion algorithm, which minimizes the effects of faulty alignment.

These methods usually achieve accurate alignment results, which in turn helps creating an artifact-free final HDR or pseudo-HDR image. However, the main limitation of these approaches is related to the high computational cost, which hinders their deployment on devices with limited computational resources such as smartphones. In addition, smaller stacks of input LDR images with significant exposure and scene differences due to large motion, hampers the generation of an artifact-free final image.

2.1 Convolutional Neural Networks

Recently, Convolutional Neural Networks [15] were successfully deployed to low-level image processing applications such as Image Denoising [25, 14] or Image Super Resolution [3]. The notable quality enhancement brought by CNNs to these applications explains our interest in developing a fast and robust learnable exposure fusion for challenging dynamic scenes.

Considering our target application, taking the global input properties into account is fundamental. The common CNN approach consisting of applying a sliding kernel window for single pixel prediction is expected to be computationally demanding and limited in terms of accounting for the global properties of the input images. Alternative CNN architectures have been recently proposed. Among these, the FlowNet architecture which was introduced by Dosovitskiy et al. in [4]

for motion vectors estimation seems to be well-suited to our learnable exposure fusion application. The concept of

FlowNet is based on a two-stage architecture with a contractive part and a following refinement part. Both parts are connected by means of long-range links. The contractive side of the network is composed of a succession of convolutional layers with a gradually decreasing spatial resolution of the feature maps. Accordingly, the refinement part contains a sequence of de-convolutional layers with gradual increase of the corresponding spatial resolution.

The contractive part of the network is responsible for extracting high-level features and spatially down-sizing the feature maps. This in turn enables the effective aggregation of information over large areas of the input images. However, the output feature maps at the bottom of this part have a low spatial resolution. Consequently, the role of the refinement part is to simulate a coarse-to-fine reconstruction of the downsized representations by gradually up-sampling the feature maps and concatenating them with size-matching feature maps from the contractive part. This allows for a more reliable recovery of the details lost on the contractive side of the network. The final output is typically a dense per-pixel representation with the same resolution as the input images.

In the next sections, we first experiment with a basic FlowNet architecture for the purpose of learning exposure fusion for dynamic scenes. Next, based on the analysis of the results from the FlowNet-based results, we propose a more elaborate architecture which fits the requirements of our application.

3 FlowNet-based Experiments

3.1 Dataset

The set of images used to train our learnable exposure fusion model typically consists of several scenes. Each scene comprises differently-exposed LDR images depicting various scene contents due to motion, together with the corresponding artifact-free exposure fusion image, which is rendered from aligned but differently-exposed instances of the selected reference LDR image. However, capturing differently exposed but aligned LDR images in a sequential manner is a challenging task, as no motion can be tolerated. We found that we can circumvent this issue by relying on stereo datasets. In fact, the stereo setup provides us with the required configuration for our training set, as one camera (left or right) can be set as the reference view and hence used to obtain the aligned but differently exposed input images as well as the reference LDR image. On the other hand, the second camera provides the needed motion as a result of the spatial shift between both capturing devices. Consequently, the images captured using the second camera are differently exposed and depict a different scene content in comparison to the selected reference image from the first camera.

In our work, our training set is a combination different datasets. The dataset is based of the , and Middlebury Stereo-sets proposed by Scharstein and Pal in [20] and Scharstein et al. in  [19]. The and sets contain several scenes composed of differently exposed LDR images of the left view as well as additional exposures of the right view. The set offers different exposures of each view. The Middlebury datasets enclose challenging scenes, especially in terms of exposure differences and saturated images, but lacks the required scenes diversity as they are captured in an indoor controlled environment. In order to compensate for the lack of outdoor scenes, we create a second stereo-based dataset using IDS uEye cameras with identical settings. The complementary second dataset we captured is composed of several outdoor scenes, each containing LDR images of the left view and additional LDR images of the right view. Therefore we are able to train our model on indoor and outdoor scenes simultaneously. Additionally, we apply data augmentation consisting of flipping each image from the training set along the vertical, horizontal and diagonal axes in order to increase the size of training set.

For our initial testings, we limit the number of input images to a pair of under-exposed and over-exposed LDR images. For all pairs of inputs in the training set we set the under-exposed LDR image from the left camera as the corresponding reference input image, and the over-exposed LDR image from the right camera as the non-reference input image. This means that the ground-truth exposure fused image used for training in each sequence is gained using the reference image and the over-exposed image from the left camera. Moreover, we only select LDR image pairs where the exposure ratio between the left dark and right bright images is at least . This guarantees that the trained model is capable of handling input images with large exposure ratios. This results in training pairs of input LDR images together with the ground-truth exposure fused image of the reference view. Additionally, pairs of dark and bright LDR images are available for validation purposes

3.2 FlowNet-Based Tests

We experiment in this section with a basic FlowNet architecture composed of convolutional layers in the contractive part and deconvolutional layers in the refinement part. We use the Caffe framework [10] for the implementation of the layers in the FlowNet network. Concerning the network parameters, we set the filter number for all convolutional and deconvolutional layers to , with a

convolutional (and deconvolutional) kernel. The stride

for all convolutions and deconvolutions is set to . This enables the down-sampling (or up-sampling) of the feature maps, acting therefore as a pooling layer. The input LDR images are spatially resized to

and presented to the network as a concatenated input tensor. Furthermore, we set the learning rate to

which decreases according to a polynomial decay scheme with power equal to . The momentum is set to . The training and testing of our models are conducted on a NVIDIA TITAN X.

Figure 1: Learnable exposure fusion results (d) on input LDR images (a,b) from the validation set using the FlowNet architecture, alongside the corresponding ground-truth exposure fusion image (c) of the reference view (a). Although The rendered CNN-based image presents an improved level of details, artifacts in many areas in the image are visible, as shown in the highlighted regions. Input images courtesy of Daniel Scharstein.

An example of a gained output image from a pair of LDR images belonging to the validation set is shown in Fig. 1. Although the trained model significantly expands the range of depicted details in comparison to the input reference LDR image, clear square-shaped artifacts can be seen. These artifacts are explained by the fact that the refinement stage of the network is unable to reconstruct the image details lost due to down-sampling in the contractive part. In addition, the trained model is conflicted between learning to improve the representation of details in all regions of the reference image, and learning to suppress the inconsistencies between the input images due to the motion/scene difference. Finally, the ground-truth exposure fusion images used for training our model might constraints a wider expansion of the range of presented details. This is particularly observed when the exposure ratio between the input LDR images used to create the ground-truth exposure image, is very high. In such cases, the exposure fusion algorithm used to create our ground-truth image produces visible artifacts in areas which are simultaneously under-exposed and over-exposed in the input images.

Considering all these observations, we propose several modifications to the FlowNet architecture used previously. The main goals of the modifications are:

  • Reduce the square-shaped artifacts in the output image by improving the connection between the contractive and refinement parts of the network.

  • Propose an alternative formulation for the task of learnable exposure fusion for dynamic scenes, by breaking it down to several sub-problems that are easier to model.

  • Ensure a high image quality for the ground-truth exposure fusion images in the training set. The goal here is to increase the level of depicted details in the ground-truth images for all possible scenarios including challenging cases in terms of motion and color differences.

  • Integrate available priors such as the exposure fusion images created directly from the input LDR images. Although these images contain ghost-artifacts, they present valuable priors to our model. In the following, we will refer to these images as ghost-fused images.

4 Proposed Modifications

4.1 Reducing Reconstruction Artifacts

One of the limitations noticed on the results of the basic FlowNet architecture is to the loss of details in the contractive part, which hindering their accurate reconstruction during the refinement stage of the network. To tackle this issue, we propose to modify the basic FlowNet architecture as suggested in [6].

Similar to the basic FlowNet, the network architecture proposed in [6] is composed of a contractive part and a refinement part. However, the difference to the original FlowNet lies in the additional long-range links (concatenations) which connect both parts. The added inputs enforce the redundancy of inherent information before each layer, which in turn enables a better recovery of the details at the output. At each convolutional or deconvolutional layer, the feature maps representing different high-level abstractions from previous layers are combined (concatenated) after accordingly adjusting the resolutions into a single input chunk, as illustrated in. The output of each layer is either convolved and/or deconvolved using the corresponding stride in order to match the dimensions of the target layers. The resulting feature maps are then concatenated to the input of the corresponding later layer.

4.2 Simplifying the Task Formulation

Figure 2: Representation of the proposed architecture composed of the color mapping, exposures merging and guided de-ghosting sub-networks.

Numerous state-of-the-art approaches [9, 21, 8] perform several pre-processing operations on the input images prior to the actual HDR/pseudo-HDR merging step. These operations aim at aligning the input images to the selected reference view. By analogy, we propose to split the CNN-based rendering task into three main sub-problems: color mapping, exposures merging and guided de-ghosting. Each sub-problem is represented through a FlowNet-inspired sub-network with the modifications proposed in [6]. Accordingly, these sub-networks are connected together so that they form the desired end-to-end mapping between the input pair of LDR images and the output image of the reference view. An illustration of the proposed configuration is shown in Fig. 2.

The first convolutional sub-network learns the color mapping model between the differently exposed input LDR images. More specifically, this step aims at estimating the over-exposed instance of the reference under-exposed LDR image. Training such a model is possible since our dataset contains the differently exposed instances of each view, which we originally used to create the ground-truth exposure fusion images.

Next, the estimate of the over-exposed version of the reference LDR image is forwarded to the subsequent exposures merging sub-network. In theory, the reference LDR image and the output of the previous color mapping sub-network are the only images required for generating the output (pseudo-HDR) image of the reference view. However, our tests have shown that providing the input non-reference image significantly enhances the quality of the output image, as it contains scene details which might not be present in the reference LDR image or in the output of the color mapping stage. On the other hand the perspective shift in comparison to the reference image causes no visible artifacts in the final image. Furthermore, we provide the so-called ghost-fused image as an additional input to the second sub-network. As mentioned earlier, the ghost-fused image is gained from the input LDR images using the exposure fusion algorithm and contains ghost artifacts due to the difference in terms of scene content between the input images.

The final guided de-ghosting sub-network enables the enhancement of the previously estimated pseudo-HDR image, using to this end the ghost-fused image as an additional input. Thus the input to the third sub-network is composed of the ghost-fused image and the initial output (pseudo-HDR) image estimate from the previous sub-network. The final output image of the guided de-ghosting sub-network contains more details and hence a wider dynamic range than the output of the second sub-network estimate.

4.3 Improving the Quality of the Ground-Truth Images

As mentioned previously, the generation of the ground-truth exposure fusion images used for training is based on the exposure fusion algorithm. Apart from its relative straightforwardness and performance stability, exposure fusion does not require any priors on the input stack of LDR images, such as the corresponding exposure times or ratio. However, in the case of input LDR images with large exposure difference, the resulting output image contains visible artifacts as shown in Fig. 3.

Figure 3: Illustration of the difference between the -LDR-based ground-truth image (a) and the -LDR-based ground-truth image (b). Clearly, the -LDR-based image is more suitable to train the desired exposure fusion for dynamic scenes model.

Therefore, training on low-quality ground-truth images impacts negatively the performance of the model in terms of details expansion. To solve this problem, we propose to use ground-truth exposure fusion images composed from the full stack of available LDR images. For example, in the case of the dataset we created using the uEye cameras, the ground-truth exposure fusion image for each scene is gained from the merging of the differently exposed instances of the dark (under-exposed) view. This way, our model does not only render a ghost-free visually enhanced image, but also simulates the case where more than input LDR images are available. This allows to deal with very challenging cases in terms of exposure differences and number of input images. Fig. 3 shows the quality difference between the -LDR-based ground-truth image and the -LDR-based ground-truth image.

5 Experiments

Taking into account the proposed modifications on the basic FlowNet architecture, we train an exposure fusion for dynamic scenes model according to the architecture presented in Fig. 2. The color mapping sub-network is composed of convolutional layers and deconvolutional layers using the design modifications proposed in [6], whereas the remaining sub-networks are composed of convolutional layers and deconvolutional layers (also according to the design of  [6]). We set the filter numbers of all layers to for the color mapping sub-network and reduce this number to for the remaining sub-networks.The network configuration is similar to the initial tests made using the basic FlowNet architecture.

Figure 4: Visual comparison between the methods of Sen et al. (b), Hu et al. (c) and our results (d), together with the corresponding execution times (only for the alignment part in the case of Sen et al. and Hu et al.. Note that we used exposure fusion to generate the corresponding result images for Sen et al. and Hu et al.. Clearly, our results are free from artifacts and yield the highest expansion of the range of details, despite the challenging nature of the scene in terms of exposure times difference between the input LDR images (a) and the depicted motion. Input images courtesy of Okan Tarhan Tursun [24].

Figure 4 contains a comparison between the methods of Sen et al. [21], Hu et al. [9] and our proposed model. The methods of Sen et al. and Hu et al. propose a simiar framework as ours, namely by initially aligning the input non-reference LDR images to the view of the reference image. For these comparisons, we used the MATLAB implementations provided by the authors. Based on these implementations, the input LDR images are aligned colorwise to the designated reference LDR image and merged subsequently into the final pseudo-HDR image using exposure fusion. We sought as well to compare our results to the method of Gallo et al. introduced in [5], but no source code/executable of the mentioned approach were available.

The scenes presented in Fig. 4 neither belong to the training nor to the validation sets. Various types of motion are also represented in these images such as object and/or scene motion. In addition, the ground-truth exposure fusion images for these scenes are not available, as the aligned and differently exposed instances of the reference LDR images are not provided [24]. Despite the large exposure and perspective differences between the input LDR images, our model successfully extends the dynamic range of the reference LDR image with no artifacts related to moving objects in the scenes. This is however not the case for the results of Sen et al. and Hu et al., where the image alignment operation based on PatchMatch fails to track dynamic objects especially in the over- or under-exposed areas. This results in clear artifacts as shown in Fig. 4. Furthermore, the improved details representation of our results come with low execution times in comparison to the other methods. All experiments were conducted on a computer with standard configuration.

Moreover, our results exhibit an extended dynamic range, far beyond the dynamic range available in the input LDR images. This can be also observed on the challenging scene shown in Fig. 5. Our output image presents a well balanced quality in terms of details and color representation. This is particularly notable in the indoor part of the scene (see the highlighted areas in Fig. 5), where the comparison shows that our model is the only method capable of retrieving and depicting the details in this area. Accordingly, the extended dynamic range is a result of the above introduced modifications, where we train our model on ground-truth exposure fusion images merged from the full stack of available LDR images.

Figure 5: Visual comparison between the methods of Sen et al. [21] (a), Hu et al. (2012) [8] (b), Hu et al. (2013) [9] (c) and our results (d), together with the corresponding execution times. Note that we used exposure fusion to generate the results for Sen et al. and Hu et al. (2013). Our rendering model trained on ground-truth exposure fusion images images gained from an extended stack of input LDR images significantly extends the dynamic range of the reference LDR image, so that details in all possible areas are visible and most importantly in regions which are under- and/or over-exposed in the input images (highlighted areas). Images courtesy of Jun Hu [8].

Additionally, we tested the performance of our proposed model on scenarios where the input stack consists of LDR images. In such cases, the input stack is composed of an under-exposed (dark) image, a mid-exposed image and an over-exposed (bright) image. The mid-exposed LDR image is designated as the corresponding reference view, as it contains more scene details in comparison to the under- and over-exposed images. This implies that in our case, the network configuration discussed earlier needs to be modified since color mapping sub-networks are now required in order to obtain the estimates of the under-exposed as well as the over-exposed instances of the reference image.

Accordingly, the -LDR-based rendering scenario imposes different constraints on the training set as well as on the CNN architecture of each sub-network. Consequently, we reshuffle the training set used for the -LDR-based cases by changing the configuration of each scene so that the reference LDR image is mid-exposed in comparison to the non-reference images. In addition, we make sure that the views of the under- and over-exposed images are different from the view of the mid-exposed reference image, in order to simulate the required difference in terms of scene content between the reference and non-reference images. For example, if the reference LDR image is set to the left view, the under- and over-exposed images are set to the right view, and vice versa. In addition, we enlarge the training set by including additional scenes from the free-motion dataset provided by Karaduzovic-Hadziabdic et al. in [13]. This dataset contains several scenes with differently exposed instances of a specific view which we select as the reference view, as well as additional differently exposed LDR images depicting various types of object motion. The combination of the stereo datasets and the free-motion dataset from [13] is crucial for the generalization performance of the desired exposure fusion for dynamic scenes model.

All sub-networks in the case of -LDR based learnable exposure fusion are composed of convolutional and deconvolutional layers, including the modifications suggested in [6]. Each convolutional layer has filters. Furthermore, we modify the network architecture of each sub-network by including additional convolutions after each level in the contractive and refinement parts except for the third convolutional layer corresponding to the lowest resolution, where we add such convolutions. This modification is suggested in the original FlowNet [4] design as well as other works [7, 22]. The additional convolutions do not change the spatial resolution of the input data, and guarantee a better feature abstraction. This modification is essential for the -LDR scenarios as the inputs to the exposures merging and the subsequent guided de-ghosting sub-networks are presented as a tensor involving multiple concatenated images.

Figure 6 contains a the results of our exposure fusion for dyamnic scenes model for the case of images, together with the results of the methods of Kang et al. [11], Sen et al. [21] and Hu et al. [9]. Our rendering model is capable of handling several types of motion as well as large exposure difference between the input images. In fact, our approach performs particularly well when dealing with regions that are under- and/or over-exposed in the input images. Whereas the state-of-the-art approaches fail to accurately reconstruct the details in such regions. Together with the high quality output image produced by our model, the computation time required is very low in comparison to other methods.

Figure 6: Additional comparison of ours results on scenes composed of LDR images, against the methods of Kang et al. [11] (first scene)

, Sen et al. (second scene) and Hu et al.. In addition we provide the corresponding executions times of our approach as well as the methods of Sen et al. and Hu et al. (for the alignment part). Our exposure fusion for dynamic scenes model yields artifact-free images of the corresponding reference view, despite the challenging conditions of the scenes in terms of the exposure difference and the amount of motion. This is achieved in a short time span. Images for the first scene (first row) are courtesy of Kang et al. [11], while the images of the second scene (second row) are courtesy of Kanita Karaduzovic-Hadziabdic [12].

6 Conclusion

We propose an end-to-end multi-module CNN architecture which learns to perform exposure fusion on input LDR images presenting scene and color differences. A distinctive aspect of our approach is that our images are used as input at multiple different stages of the CNN architecture. We propose solutions for -image and -image LDR input cases, where each case is provided with a different architecture. In various comparisons with state-of-the-art on multiple datasets, we show that the proposed approach yields excellent results. It successfully removes ghost artifacts and maintains high contrast in the obtained output image.

References

  • [1] C. Barnes, E. Shechtman, D. B. Goldman, and A. Finkelstein. The Generalized PatchMatch Correspondence Algorithm. In

    11th European Conference on Computer Vision ECCV

    , pages 29–43, 2010.
  • [2] P. Debevec and J. Malik. Recovering High Dynamic Range Radiance Maps from Photographs. In Proceedings SIGGRAPH, pages 369–378, 1997.
  • [3] C. Dong, C. Loy, K. He, and X. Tang. Image Super-Resolution Using Deep Convolutional Networks. European Conference on Computer Vision, 2014.
  • [4] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. V. D. Smagt, D. Cremers, and T. Brox. FlowNet: Learning Optical Flow with Convolutional Networks. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [5] O. Gallo, A. Troccoli, J. Hu, K. Pulli, and J. Kautz. Locally Non-rigid Registration for Mobile HDR Photography. In

    IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , pages 49–56, 2015.
  • [6] I. Halfaoui, F. Bouzaraa, and O. Urfalioglu. CNN-Based Initial Background Estimation. In IEEE 23rd International Conference on Pattern Recognition. In press, 2016.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [8] J. Hu, O. Gallo, and K. Pulli. Exposure stacks of live scenes with hand-held cameras. In IEEE European Conference on Computer Vision, pages 499–512, 2012.
  • [9] J. Hu, O. Gallo, K. Pulli, and X. Sun. HDR Deghosting: How to Deal with Saturation? In IEEE Conference on Computer Vision and Pattern Recognition, pages 1163–1170, 2013.
  • [10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678, 2014.
  • [11] S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski. High dynamic range video. In ACM Transactions on Graphics (TOG), volume 22, pages 319–325, 2003.
  • [12] K. Karaduzovic-Hadziabdic, J. Hasic Telalovic, and R. Mantiuk. Expert Evaluation of Deghosting Algorithms for Multi-Exposure High Dynamic Range Imaging. In Second International Conference and SME Workshop on HDR imaging, pages 1–4, 2014.
  • [13] K. Karaduzovic-Hadziabdic, J. H. Telalovic, and R. Mantiuk. Subjective and objective evaluation of multi-exposure high dynamic range image deghosting methods. Eurographics 2016, 35(2), 2016.
  • [14] A. Karbasi, A. H. Salavati, and A. Shokrollahi. Iterative Learning and Denoising in Convolutional Neural Associative Memories. In

    Proceedings of The 30th International Conference on Machine Learning

    , pages 445–453, 2013.
  • [15] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
  • [16] T. Mertens, J. Kautz, and F. Van Reeth. Exposure Fusion. In Pacific Graphics, pages 369–378, 2007.
  • [17] T. Mitsunaga and S. K. Nayar. Radiometric Self Calibration. In IEEE Computer Society Conference On Computer Vision and Pattern Recognition, volume 1, pages 374–380, 1999.
  • [18] M. A. Robertson, S. Borman, and R. L. Stevenson. Dynamic Range Improvement Through Multiple Exposures. In International Conference on Image Processing, volume 3, pages 159–163, 1999.
  • [19] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling. High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth. In German Conference on Pattern Recognition GCPR, pages 31–42, 2014.
  • [20] D. Scharstein and C. Pal. Learning Conditional Random Fields for Stereo. In IEEE International Conference on Computer Vision and Pattern Recognition., pages 1–8, 2007.
  • [21] P. Sen, N. Khademi Kalantari, M. Yaesoubi, S. Darabi, D. Goldman, and E. Shechtman. Robust Patch-Based HDR Reconstruction of Dynamic Scenes. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2012), 31(6):1–11, 2012.
  • [22] E. Simo-Serra, S. Iizuka, K. Sasaki, and H. Ishikawa. Learning to simplify: fully convolutional networks for rough sketch cleanup. ACM Transactions on Graphics (TOG), 35(4):121, 2016.
  • [23] O. T. Tursun, A. O. Akyüz, A. Erdem, and E. Erdem. The state of the art in hdr deghosting: A survey and evaluation. In Computer Graphics Forum, volume 34, pages 683–707, 2015.
  • [24] O. T. Tursun, A. O. Akyüz, A. Erdem, and E. Erdem. An Objective Deghosting Quality Metric for HDR Images. In Computer Graphics Forum, pages 139–152, 2016.
  • [25] J. Viren and S. Sebastian. Natural Image Denoising with Convolutional Networks. In Advances in Neural Information Processing Systems 21, pages 769–776, 2009.
  • [26] H. Zimmer, A. Bruhn, and J. Weickert. Freehand HDR Imaging of Moving Scenes with Simultaneous Resolution Enhancement. In Computer Graphics Forum, volume 30, pages 405–414, 2011.