Joint High Dynamic Range Imaging and Super-Resolution from a Single Image (IEEE Access, 2019)
This paper presents a new framework for jointly enhancing the resolution and the dynamic range of an image, i.e., simultaneous super-resolution (SR) and high dynamic range imaging (HDRI), based on a convolutional neural network (CNN). From the common trends of both tasks, we train a CNN for the joint HDRI and SR by focusing on the reconstruction of high-frequency details. Specifically, the high-frequency component in our work is the reflectance component according to the Retinex-based image decomposition, and only the reflectance component is manipulated by the CNN while another component (illumination) is processed in a conventional way. In training the CNN, we devise an appropriate loss function that contributes to the naturalness quality of resulting images. Experiments show that our algorithm outperforms the cascade implementation of CNN-based SR and HDRI.READ FULL TEXT VIEW PDF
Joint High Dynamic Range Imaging and Super-Resolution from a Single Image (IEEE Access, 2019)
With the advent of ultra high definition television (UHDTV) with HDR rendering , the techniques for capturing high resolution (HR) and high dynamic range (HDR) images have become important. In addition to developing the methods to capture the new UHD contents, it is also important to convert the vast amount of existing low-resolution (LR) and low dynamic range (LDR) images to the HDR-HR contents for rendering them on the UHDTV. For these purposes, there have been many methods to convert the LR to HR, which is called the single image SR (SISR). Also, there have been some single image HDRI algorithms to produce an HDR image from a single LDR input.
Since the SISR is an important problem that finds many useful applications, it has long been studied by many researchers [55, 48, 58, 16, 50, 5, 29, 30, 60]. Some earlier works exploited the statistical priors of images for the SISR [48, 46]. The learning based methods, specifically the ones based on the neighbor embedding  and sparse coding methods [58, 57, 50] were also introduced for better SR. Recently, the state-of-the-art methods are based on the CNN [5, 22, 29, 28, 51, 30, 60, 17, 15], which show further enhanced results than the previous ones. Generative methods [29, 43, 54] based on generative adversarial networks (GAN)  have also been introduced for the better perceptual quality of SR images.
For the HDRI, multiple images or sensors with different exposures are used, or a single image is reversely tonemapped to generate an HDR output [38, 35, 2, 3, 26, 33, 6, 21, 7]. For the enhancement of undesirably illuminated images, some methods generated virtual exposure images from a single input and applied the conventional HDRI process [49, 53, 52, 8, 14, 19, 24, 40]. Recently, deep CNN-based methods have also been proposed [6, 21, 7]. There are also some works that use both HDRI and SR for image enhancement: Schubert  developed a method for the HDRI-SR which consists of several steps. Park tried several cascade implementations of HDRI-SR in different color spaces .
Recently, deep CNNs have shown dramatic improvements in most of low to high-level vision tasks including super-resolution [5, 22, 29, 47, 28, 30, 60, 15], Gaussian denoising , and JPEG artifact reduction [4, 12]. For these problems, the deep CNN is shown to work as a proper mapping function from the degraded image to the original. In this respect, we adopt the CNN for the joint HDRI-SR problem.
The most straightforward method of using a CNN is to design an end-to-end network, , a deep network which takes LDR-LR image as the input and generates the corresponding HDR-HR. However, in the case of joint HDRI-SR task, we have a problem that there is no standardized labeled data. Precisely, the HDR images’ luminance range and tonemapping function from the HDR to the LDR are not well specified in the current HDR datasets. Moreover, the dynamic ranges of target images are usually different from each other, and the tonemapping functions are also different and nonlinear due to the use of locally adaptive nonlinear mapping in most images. Thus, by directly training a discriminative CNN to map LDR images to HDR ones, the network usually fails to converge. Hence, we need to use a transformed image or find another domain that is less affected by the luminance range and tonemapping function.
Some of the previous works also showed that it is important to find an appropriate domain when applying a CNN for image enhancement. For example, it is shown in  that applying the dual domain representation, , using the image and DCT domain priors increases the performance of JPEG artifacts reduction compared to the methods without the DCT prior. Also, the SR with wavelet domain priors  enhanced the performance compared to the conventional image domain methods. Additionally, recent SR and denoising CNNs such as VDSR  and DnCNN 
focus on the residual structure, because the SR task is to find the high-frequency details and the Gaussian denoising is also to estimate the noise which is the residual signal.
The HDRI also focuses on the reconstruction of image details rather than the low-frequency components. Specifically, recent single image HDRI methods process the illumination and reflectance components separately [8, 14, 19, 24, 40], where the illumination corresponds to the low-frequency component and the reflectance amounts to the image details. The illumination is simply scaled according to certain virtual exposure levels, while the image details are locally and sophisticatedly manipulated to reveal the details in the saturated regions.
Considering that both HDRI and SR try to find the lost image details in the high-frequency region rather than the low-frequency, we design our joint HDRI-SR CNN to focus on the high-frequencies. Furthermore, it is possible to address the above-stated problem of HDRI training datasets by excluding the low-frequency illumination component in the training and inference, because the illuminations usually have very wide and different ranges from each other. Specifically, we propose a CNN architecture and its training schemes for enlarging the resolution and dynamic range of reflectance component. We also adopt the generative scheme , for generating better details and textures.
In summary, the main contributions of this paper are as follows.
The proposed method performs HDRI and SR using a single CNN with high generalization performance, especially without well-organized labeled datasets for HDRI.
The proposed method performs better than adopting separate CNNs for HDRI and SR in terms of perceptual quality and some no-reference metrics.
The proposed single CNN requires much fewer parameters than the cascade of state-of-the-art SR and HDRI networks.
Unlike the conventional single image HDRI that needs to generate virtual multi-exposure images, the proposed method directly produces the HDR image through the CNN which simultaneously performs the SR.
In recent years, the CNN-based SISR algorithms outperform most of the conventional non-CNN based methods. Since Dong first proposed a CNN for the SISR , many other deep networks have been proposed. For some examples, the residual structure is used for better performance in , and the sub-pixel convolution layer is introduced in  to speed up. The SRGAN proposed in  generates the photorealistic SR images by exploiting the generative adversarial losses. Guo  proposed a CNN for the SR in the wavelet domain, and have shown that Haar wavelet domain is an efficient one for the SISR. There are also many other structures and methods such as the recursive architecture [23, 47] and the Laplacian pyramids . Very recently, enormously deep structures have been proposed and achieved the state-of-the-art performances [30, 60].
For generating an HDR image from a single input, most of the conventional methods generate the virtual multi-exposure images by applying brightness enhancement functions and then fuse them with the appropriate weight maps obtained from each of the virtual images [49, 53, 8, 14, 19, 24, 40]. Among these techniques, the Retinex based approaches [8, 14, 19, 24, 40] enhance the illumination and reflectance components separately. The undesirably illuminated regions are compensated by the estimated illumination, while the saturated details are enhanced by controlling the reflectance component. For generating a real (not the exposure fusion in low dynamic range) HDR image, the reverse tone mapping operators (rTMOs) with well-designed expanding maps are presented in [3, 18, 26, 33, 34]. Recently, a few CNN-based approaches [6, 21, 7] have been proposed. In , a CNN-based multi-exposure fusion scheme is proposed. In , a CNN-based reverse tone mapping method is proposed where virtual multi-exposure images are predicted by a CNN, and then they are fused for generating the HDR image. In [21, 7], the HDR image is generated as a weighted sum of multi-exposure images (real or virtually generated). In , the interest is on the saturated highlights and the CNN predicts these saturated regions to generate an HDR image.
The overview of our method is illustrated in Figure 2, which shows the process of only component because the
components are just bicubic interpolated (not shown in the figure). The figure shows that theis decomposed into the illumination () and reflectance (), which undergo different processes and finally fused again to be an enhanced .
For explaining the process formally in the rest of paper, we denote the HDR-HR image as , HDR-LR as , LDR-HR as , and LDR-LR as , where is an RGB image. Also, with one of these subscripts mean the corresponding components of with the same subscript. Through the manipulation of these components, the final goal is to find a plausible from the given .
Image decomposition process has an important role in our scheme because it enables to exploit domain properties and to train CNN without consistent HDR ground-truth.
The reflectance is obtained as the difference between the luminance and the estimated illumination as
The illumination is estimated by the filtering of with a filter , , , where is usually a normalized Gaussian in the conventional work. However, since it is known that using the Gaussian filter often makes the halo artifacts, we adopt the weighted least square (WLS) filter instead of the Gaussian. The WLS shows less halo artifacts because it preserves the edges better than the Gaussian filter. Precisely, the output image is obtained by solving an optimization problem, which is seeking the minimum of
where denotes the pixel location, and indicate the input and output respectively. Also, the smoothness weights are defined as
where is the log of , is the parameter to control the sensitivity of the gradients of , and is a small number to prevent the divide by zero. For all the training and test images, we set the parameters as , , and . We also use linearized luminance values for decomposition. Figure 3 visualizes an example of luminance, estimated illumination, and reflectance obtained by the WLS filter, where we can see that the barely seen details in are revealed in .
For the illumination enhancement, we first bicubicly interpolate the to , and then simply compensate non-linearity to generate from . For directly generating from , we may use the CNN that is trained for the HDRI-SR of , or we may design another CNN. However, using the CNNs do not improve the overall performance compared to the interpolation and stretching according to our experiments, because the illumination component is a smoothed image which contains little information.
To be specific with the compensation, we first bicubicly interpolate the illumination values, then apply a simple gamma function to compensate for the non-linearity. In summary, the ILL-E in Figure 2 is just the bicubic interpolation followed by gamma function.
In this subsection, we present a CNN that maps to , which is named REF-Net. The overall architecture of the proposed REF-Net is shown in Figure 4 which has the stacked hourglass structure. Although U-Net  is shown to be effective in semantic segmentation, it may not be satisfiable for the prediction of reflectance components that have abundant textures. Hence, inspired by the work of successive stack of hourglasses [32, 39, 56], we stack two U-Nets for better prediction. For details, the size of all convolution layers are and transposed convolution layers are for REF-Net, respectively.
In manipulating the component, we should note that the reflectance is the log ratio between luminance and illumination, which may ranges from to and hence not a suitable input to the CNN. It is found in our experiments that our CNN fails to predict high-quality reflectance when we feed to the CNN without any preprocessing. To stabilize the CNN and to address this issue, we employ the tangent-hyperbolic function as a preprocessing step, which bounds the input to the CNN into . In summary, our REF-Net predicts the reflectance of HDR-HR for the given reflectance of LDR-LR, which can be described as
where is our REF-Net with the parameter .
With the and , we recombine both components to build enhanced luminance as
where denotes element-wise multiplication. Then the final HDR irradiance map is obtained by
where and are bicubic interpolation of and .
Finally, to display the to an HDR display, it can be linearly stretched to the luminance value of the target HDR display. On the other hand, to display the to an LDR display, the HDR irradiance map is tonemapped to be an enhanced LDR image .
In this section, we introduce loss functions that we use for our REF-Net.
We adopt the reconstruction loss term as mean absolute error (MAE):
where is the image index and denotes the batch size.
For better sharpness and details, we adopt adversarial loss inspired by recent successful generative super-resolution models [29, 43, 54]. The generative scheme not only generates better sharpness and details but also enables to predict saturated regions such as washed-out areas and diminished dark pixels based on training dataset. We adopt recently proposed adversarial loss with relativistic discriminator , which shows great image quality with relativistic average standard GAN , named RaGAN. Specifically, the RaGAN loss for our scheme is described as:
where and are empirical distributions of and respectively, and stand for real and generated data respectively, and
denotes the output logit of discriminator. We show the discriminator architecture for the GAN training in thesupplementary material.
We present two models: one is the basic model (HDRI-SR-B) without adversarial loss and the other is the complex model (HDRI-SR-C) with the adversarial loss. Formally, the overall loss for the basic model is in eq. (8), and the loss for the complex model is
where we set for the training.
For training the overall HDRI-SR, we use the MMPSG  dataset which consists of multi-exposure images and the HDR images constructed from the multi-exposures. Note that the input to the network is the LDR-LR and the output is the corresponding HDR-HR reflectance for the training. For constructing such set of input-output, we select 40 sets of multi-exposure images, and down-sample the standard-exposed images (not the over or under-exposed ones) as the input LDR-LR. The HDR images corresponding to the input LDR ones are selected as the output, where they are tonemapped.
To generate reflectance dataset pairs , we obtain each reflectance from the linearized luminance value. To account for the variations and non-linearities caused by non-linear camera response functions (CRFs) , the inverse of CRF is required. However, since the CRFs are usually unknown and vary diversely, we assume it just a gamma function with and use the inverse of CRF for the linearization. Specifically, we use for the linearization, where is the assumed camera response function.
It is worth pointing out that such inconsistent characteristics of HDR datasets are alleviated by removing the illumination component from the LDR images and corresponding tonemapped HDR images. We have also tried to use the original HDR image for extracting the (for removing illumination), however, we found that extraction from the tonmapped HDR image contains much better detail information, and eventually yields much better performance.
From the reflectance pairs so obtained, we extract the patches with the size of in LR domain, which are augmented by rotation and flip. We only consider for the scaling factor of super-resolution. The mini-batch size for training is set as , and the learning rate is decayed with the scale factor of , starting from and halved once to
. We implement the model using Tensorflow library with Titan Xp GPU.
|Metric||Dataset||LDR-LR||Kovaleski-SR||Eilertsen-SR||HDRI-SR-B (Ours)||HDRI-SR-C (Ours)|
To evaluate the proposed method, we perform the experiments on several test sets. First, we select eight well-known images from the multi-exposure images shown in Figure 5, which will be called the “MESet8” set. We also perform the experiments on three datasets called “Part I,” “Part II,” and “Part III” from Wang ’s paper , which include 29, 18, and 30 images respectively.
We compare our method with the cascades of reverse tone mapping operators (rTMOs) and super-resolution (SR). As the reverse tone mapping operators, a conventional method (Kovaleski ’s ) and a CNN-based method (Eilertsen ’s ) are adopted. For the super-resolution algorithm, one of the recent CNN-based state-of-the-art methods, EDSR  is used. For these cascade implementations, we empirically selected the best combination, , we first apply HDRI and then super-resolution because this order gives slightly better quality than its reverse order which is also evidenced in . For our proposed algorithms, two variations are demonstrated: HDRI-SR-B and HDRI-SR-C.
Since there is no reference image for the objective evaluation, we adopt three widely used NR-IQAs (no-reference image quality assessments). Specifically, we adopt natural image quality evaluator (NIQE) , HDR image gradient-based evaluator (HIGRADE) , and no-reference quality metric for single-image super-resolution (NQSR) .
Table 1 shows the overall quantitative measures for all the comparisons. As expected, the input LDR-LR shows the worst result in all metrics. In the case of NIQE  which is devised to reflect the naturalness, our complex model HDRI-SR-C shows the best results. Also, our proposed basic model HDRI-SR-B shows comparable results to the others. The HIGRADE is designed to measure the quality of tonemapped images , where two different measures HIGRADE-1 and HIGRADE-2 are defined which differ in the features they use. For these two measures, the proposed HDRI-SR-C achieves the best results, and HDRI-SR-B shows competitive results with Kovaleski-SR. In the case of NQSR  which is designed for the quality measure of super-resolved image, Kovaleski-SR demonstrates the best result on most sets while our HDRI-SR-C achieves the second best. The HDRI-SR-C shows the best result on MESet8. As shown, our HDRI-SR-C model shows considerable super-resolution ability comparable to cascades of the HDRI and state-of-the-art EDSR.
About the computational complexity of compared methods, our CNN for the joint HDRI and SR needs 8 M parameters, Eilertsen ’s  HDRI requires about 32 M, and EDSR  needs about 43 M parameters. Hence, the proposed method needs the least amount of CNN parameters among the compared ones while achieving comparable or better results as shown above.
For the qualitative evaluation, we visualize the result images by tonemapping in Figures 6, 7, and 8. It can be seen that ours show abundant textures and details while preserving natural tones in all figures. Also, the color and the global contrast are well enhanced with our algorithm.
Specifically in Figure 6, by comparing the building facades, we can see that the high frequency details are enhanced with our HDRI-SRs. Also, the texture and the volume of water flow are much enhanced compared to other algorithms. Additionally, the volume of the cloud became much realistic. In Figure 7, we also visualized “Peyrou” image of MESet8. As shown, the overall contrast is enhanced within the sky and the lake. Also, the texture and details of building and the trees are enriched with ours. Additional zoomed result is shown in Figure 1 where it is shown that the high-frequency details and edges are enhanced and the saturated details of trees are generated. In Figure 8, visualized results of “House” of MESet8 are shown. For the red boxes, our algorithms show much thicker clouds than the cascade of Kovaleski ’s  and the EDSR . As HDRI-SR-C adopts the generative scheme, it generates much better cloud details compared to HDRI-SR-B. By comparing the green box regions, we can see that the diminished details due to low illumination are enhanced with our algorithms.
In this paper, we have proposed a CNN-based method for the joint HDRI and SR from a single image, where the domain knowledges from the existing HDRI and SR methods are exploited in designing the framework. We also considered the issue that there is no ground-truth image for the HDRI. Specifically, we found that the key to the joint task is the reconstruction of high-frequency details. We have also found that we can get rid of inconsistent characteristics of various HDR dataset by removing the illumination. Hence, we decompose the image into the illumination and reflectance, and process the reflectance by the CNN while we just bicubic interpolate and stretch the illumination. The final output is generated by synthesizing the processed components and then linearly stretching the synthesized image according to the target display luminance. We have also proposed a generative approach and training strategies for the joint HDRI-SR task. Experiments show that the proposed methods yield better performance than the cascade of conventional CNN-based HDRI and SR.
Tensorflow: A system for large-scale machine learning.In OSDI, volume 16, pages 265–283, 2016.
Proceedings of the IEEE International Conference on Computer Vision, pages 576–584, 2015.
Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 2, pages II–602. IEEE, 2003.