HDR Video Reconstruction: A Coarse-to-fine Network and A Real-world Benchmark Dataset

03/27/2021 ∙ by Guanying Chen, et al. ∙ 0

High dynamic range (HDR) video reconstruction from sequences captured with alternating exposures is a very challenging problem. Existing methods often align low dynamic range (LDR) input sequence in the image space using optical flow, and then merge the aligned images to produce HDR output. However, accurate alignment and fusion in the image space are difficult due to the missing details in the over-exposed regions and noise in the under-exposed regions, resulting in unpleasing ghosting artifacts. To enable more accurate alignment and HDR fusion, we introduce a coarse-to-fine deep learning framework for HDR video reconstruction. Firstly, we perform coarse alignment and pixel blending in the image space to estimate the coarse HDR video. Secondly, we conduct more sophisticated alignment and temporal fusion in the feature space of the coarse HDR video to produce better reconstruction. Considering the fact that there is no publicly available dataset for quantitative and comprehensive evaluation of HDR video reconstruction methods, we collect such a benchmark dataset, which contains 97 sequences of static scenes and 184 testing pairs of dynamic scenes. Extensive experiments show that our method outperforms previous state-of-the-art methods. Our dataset, code and model will be made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compared with low dynamic range (LDR) images, high dynamic range (HDR) images can better reflect the visual details of a scene in both bright and dark regions. Although significant progress has been made in HDR image reconstruction using multi-exposure images [23, 58, 60], the more challenging problem of HDR video reconstruction is still less explored. Different from HDR image reconstruction, HDR video reconstruction has to recover the HDR for every input frame (see Fig. 1), but not just for a single reference frame (e.g., the middle exposure image). Existing successful HDR video reconstruction techniques often rely on costly and specialized hardware (e.g., scanline exposure/ISO, or internal/external beam splitter) [56, 31, 63], which hinders their wider applications among ordinary consumers. A promising direction for low-cost HDR video reconstruction is to utilize video sequences captured with alternating exposures (e.g., videos with a periodic exposure of {EV-3, EV+3, EV-3, }). This is practical as many off-the-shelf cameras can alternate exposures during recording.

EV+2 EV-2 EV+0 EV+2

Kalantari [24]
Ours

Figure 1: HDR video reconstruction from sequences captured with three alternating exposures. Row 1 shows four input LDR frames. Rows 2–3 are the reconstructed (tonemapped) HDR frames.

Conventional reconstruction pipeline along this direction often consists of two steps [26]. In the first step, neighboring frames with different exposures are aligned to the current frame using optical flow. In the second step, the aligned images are fused to produce the HDR image. However, accurate alignment and fusion are difficult to achieve for LDR images with different exposures as there are saturated pixel values in the over-exposed regions, and noise in the under-exposed regions. Recently, Kalantari and Ramamoorthi [24]

proposed to estimate the optical flow with a deep neural network, and used another network to predict the fusion weights for merging the aligned images. Although improved results over traditional methods 

[25, 39, 26, 33] have been achieved, their method still relies on the accuracy of optical flow alignment and pixel blending, and suffers from ghosting artifacts in regions with large motion (see the second row of Fig. 1). It remains a challenging problem to reconstruct ghost-free HDR videos from sequences with alternating exposures.

Recently, deformable convolution [8]

has been successfully applied to feature alignment in video super-resolution 

[57, 55]. However, they are not tailored for LDR images with different exposures. Motivated by the observation that accurate image alignment between LDR images with different exposures is difficult, and the success of deformable feature alignment for videos with constant exposure, we introduce a two-stage coarse-to-fine framework for this problem. The first stage, denoted as CoarseNet, aligns images using optical flow in the image space and blends the aligned images to reconstruct the coarse HDR video. This stage can recover/remove a large part of missing details/noise from the input LDR images, but there exist some artifacts in regions with large motion. The second stage, denoted as RefineNet, performs more sophisticated alignment and fusion in the feature space of the coarse HDR video using deformable convolution [8] and temporal attention. Such a two-stage approach avoids the need of estimating highly accurate optical flow from images with different exposures, and therefore reduces the learning difficulty and removes ghosting artifacts in the final results.

As there is no publicly available real-world video dataset with ground-truth HDR for evaluation, comprehensive comparisons among different methods are difficult to achieve. To alleviate this problem, we create a real-world dataset containing both static and dynamic scenes as a benchmark for quantitative and qualitative evaluation.

In summary, the key contributions of this paper are as follows:

  • We propose a two-stage framework, which first performs image alignment and HDR fusion in the image space and then in feature space, for HDR video reconstruction from sequences with alternating exposures.

  • We create a real-world video dataset captured with alternating exposures as a benchmark to enable quantitative evaluation for this problem.

  • Our method achieves state-of-the-art results on both synthetic and real-world datasets.

2 Related Work

HDR image reconstruction

Merging multi-exposure LDR images is the most common way to reconstruct HDR images [9, 40]. To handle dynamic scenes, image alignment is employed to reduce the ghosting artifacts [52, 20, 49, 37]. Recent methods apply deep neural networks to merge multi-exposure images [23, 6, 58, 60, 61, 48]. However, these methods rely on a fixed reference exposure (e.g., the middle exposure) and cannot be directly applied to reconstruct HDR videos from sequences with alternating exposures. Burst denoising technique [36, 18, 34] can also be applied to produce HDR images by denoising the low-exposure images. However, this technique cannot make use of the cleaner details that exist in high-exposure images and have difficulty in handling extremely dark scenes.

There are methods for HDR reconstruction from a single LDR image. Traditional methods expand the dynamic range of the LDR images by applying image processing operations (e.g., function mapping, and filtering) [1, 2, 3, 4, 21, 30]. These methods generally cannot recover the missing details in the clipped regions. Recent methods proposed to adopt CNNs for single image reconstruction [10, 11, 32, 62, 45, 42, 35, 51]. However, these methods focus on hallucinating the saturated regions and cannot deal with the noise in the dark regions of a low-exposure image.

Recently, Kim et al[27, 28] proposed to tackle the problem of joint super-resolution and inverse tone-mapping. Instead of reconstructing the linear luminance image like previous HDR reconstruction methods, their goal was to convert a standard dynamic range (SDR) image to HDR display format (i.e., from BT.709 to BT.2020).

HDR video reconstruction

Many existing HDR video reconstruction methods rely on specialized hardware. For example, per-pixel exposure [47], scanline exposure/ISO [16, 19, 7], internal [56, 31] or external [43] beam splitter that can split the light to different sensors, modulo camera [63], and neuromorphic camera [17]. The requirement of specialized hardware limits the widespread application of these methods. Recent methods also explore the problem of joint optimization of the optical encoder and CNN-based decoder for HDR imaging [44, 54].

There are works for HDR video reconstruction from sequences with alternating exposures. Kang et al[26] introduced the first algorithm of this approach by first aligning neighboring frames to the reference frame using optical flow, and then merging the aligned images to an HDR image. Mangiat and Gibson improved this method by a block-based motion estimation and refinement stage [38, 39]. Kalantari et al[25] introduced a patch-based optimization method that synthesizes the missing exposures at each image and then reconstructs the final HDR image. Gryaditskaya et al[15] improved [25] by introducing an adaptive metering algorithm that can adjust the exposures to reduce artifacts caused by motion. Li et al[33]

formulated this problem as a maximum a posteriori estimation. Recently, Kalantari and Ramamoorthi 

[24] introduced an end-to-end deep learning framework that contains a flow network for alignment and a weight network for pixel blending in image space. Different from [24], our coarse-to-fine network performs alignment and fusion sequentially in the image space and feature space for better reconstruction.

Figure 2: Network architecture of the proposed coarse-to-fine framework for videos captured with two alternating exposures.

3 The Proposed Coarse-to-fine Framework

3.1 Overview

Given an input LDR video captured with alternating exposures 111For example, the exposure can be alternated periodically in the order of {EV-3, EV+3, EV-3, } or {EV-2, EV+0, EV+2, EV-2, }., our goal is to reconstruct the corresponding HDR video , as shown in  Fig. 1.

Preprocessing

Following previous methods [25, 33, 24], we assume the camera response function (CRF) [14] of the original input images is known. In practice, the CRF of a camera can be robustly estimated using a linear method [9]. As in [24], we replace the CRF of the input images with a fixed gamma curve as , where . This can unify input videos captured under different cameras or configurations. Global alignment is then performed using a similarity transformation to compensate camera motions among neighboring frames.

Pipeline

Due to the existence of noise and missing details, accurate image alignment between images with different exposures is difficult. To overcome these challenges, we introduce a two-stage framework for more accurate image alignment and fusion (see Fig. 2). For simplicity, we illustrate our method for handling videos captured with two alternating exposures in this paper, and describe how to extend our method for handling three exposures in the supplementary material.

The first stage, named CoarseNet, aligns images using optical flow and performs HDR fusion in the image space. It takes three frames as input and estimates a -channel HDR image for the reference (i.e., center) frame. This stage can recover/remove a large part of the missing details/noise for the reference LDR image. Given five consecutive LDR frames with two alternating exposures, our CoarseNet can sequentially reconstruct the coarse HDR images for the middle three frames (i.e., , , and ). The second stage, named RefineNet, takes these three coarse HDR images as input to produce a better HDR reconstruction for the reference frame (i.e., ). It performs a more sophisticated alignment using deformable convolution and temporal fusion in the feature space.

3.2 Coarse Reconstruction in the Image Space

The CoarseNet follows the design of [24], containing an optical flow estimation network, named flow network, and a blending weight estimation network, named weight network (see Fig. 3). It first warps two neighboring frames to the center frame using optical flows, and then reconstructs the HDR image by blending the aligned images. The network details can be found in the supplementary materials.

Figure 3: Overview of the CoarseNet.

Optical flow for image alignment

Since the reference frame and neighboring frames () have different exposures, we adjust the exposure of the reference frame to be the same as the neighboring frame. The reference LDR image is first converted to the linear radiance domain using its exposure :

(1)

It is then converted to the LDR domain using the neighboring exposure as , where the clip function clips the values to the range of .

Traditional flow estimation method takes two images as input and estimates a flow map. However, in our problem, the center frame has a different exposure as the neighboring frames, such that the adjusted reference frame often contains missing contents or noise. We therefore take three consecutive images, i.e., , as input and estimate two flow maps ) as in [24]. Two neighboring frames can then be aligned to the reference frame () using backward warping with bilinear sampling [22].

Pixel-blending for HDR reconstruction

The HDR image can be computed as a weighted average of the pixels in the aligned images [9, 23]. Note that the two original neighboring frames are also taken into account for pixel blending, as it is reported to be helpful for reducing artifacts in the background regions [24].

Specifically, the input image number for weight network is , i.e., . We provide these five images as input in both the LDR and linear radiance domain, resulting in a stack of images as inputs. The network predicts five per-pixel weighted maps, i.e., . The coarse HDR at frame can then be reconstructed as the weighted average of five input images in the linear radiance domain:

(2)

Similar to [24], we adopt an encoder-decoder architecture to estimate the blending weights.

Loss function

As HDR images are typically displayed after tonemapping, we compute the loss in the tonemapped HDR space. Following [23, 58, 60, 24], we adopt the differentiable -law function:

(3)

where is the tonemapped HDR image, and is a parameter controlling the compression level and is set to . We train CoarseNet with the L1 loss , where is the ground-truth tonemapped HDR image. Since both the flow network and weight network are differentiable, the CoarseNet can be trained end-to-end.

3.3 HDR Refinement in the Feature Space

Taking three coarse HDR images (i.e., , , and ) estimated by the CoarseNet as input, the RefineNet performs alignment and fusion in the feature space to produce better HDR reconstruction for the center frame, as the problem of missing contents or noise has been largely solved in the first stage (see the right part of Fig. 2).

Our RefineNet first extracts a -channel feature for each input (i.e., , , and ) using a share-weight feature extractor. Features of the neighboring frames are then aligned to the center frame using a deformable alignment module [8, 57]. The aligned features are fused using a temporal attention fusion module for the final HDR reconstruction.

Figure 4: Structure of the (a) deformable alignment module and (b) temporal attention fusion module.

Deformable feature alignment

Deformable convolution [8] has recently been successfully applied to feature alignment for the problem of video super-resolution (e.g., EDVR [57] and TDAN [55]). The core idea of deformable feature alignment is as follows. Given two features (e.g., and ) as input, an offset prediction module (can be general convolutional layers) predicts an offset:

(4)

With the learned offset, the neighboring feature can be sampled and aligned to the reference frame using deformable convolution [8]:

(5)

We adopt the pyramid, cascading and deformable (PCD) alignment module [57], which performs deformable alignment in three pyramid levels, as our feature alignment module (see Fig. 4 (a)). This alignment process is implicitly learned to optimize the final HDR reconstruction.

Multi-feature fusion

Given the aligned features (, , and ), we propose a temporal attention fusion module for suppressing the misaligned features and merging complementary information for more accurate HDR reconstruction (see Fig. 4 (b)). Each feature is concatenated with the reference feature as the input for two convolutional layers to estimate an attention map that has the same size as the feature. Each feature is then weighted by their corresponding attention map. Last, three attended features are concatenated and fused using a convolutional layer.

HDR reconstruction

The reconstruction branch takes the fused feature as input and regresses the HDR image (). Two skip connections are added to concatenate encoder features of the reference frame to decoder features that have the same dimensions.

Note that our RefineNet aims to refine the results of CoarseNet in the not well-exposed regions. For a low-exposure image, we empirically define that regions with LDR pixel values smaller than are not well-exposed, while for a high-exposure image, regions with pixel values larger than are not well-exposed [25]. The final predicted HDR is then computed as

(6)

where is a mask indicating the well-exposed regions of the reference frame , and is the element-wise product. Figure 5 shows how is computed for low- and high-exposure reference image. For example, the well-exposed mask of a low-exposure reference image is computed as

(7)

(a) Low Exp. (b) High Exp.

Figure 5: Weight curves for computing the well-exposed regions for (a) low- and (b) high-exposure reference image. is the pixel value of the reference LDR image.

Loss function

We adopt L1 loss and perceptual loss to compute the loss for RefineNet as . The L1 loss is defined as

(8)

where is the tonemapped image of . The loss is normalized by the number of not well-exposed pixels. The perceptual loss is defined as where extracts image features from the th layer of VGG16 network [53]. We use three layers {relu1_2, relu2_2, relu3_3} to compute the loss.

4 Real-world Benchmark Dataset

In this section, we introduce a real-world benchmark dataset for qualitative and quantitative evaluation.

Existing real-world video dataset

Currently, there is no benchmark dataset with ground-truth HDR for this problem. The only public real-world dataset is the Kalantari13 dataset [25], which consists of

videos for dynamic scenes in RGB image format. However, due to the lack of ground-truth HDR, previous works can only evaluate their methods qualitatively on this dataset. In addition, this dataset is too small to be used for possible semi-supervised or unsupervised learning in the future.

Static Scenes w/ GT Dynamic Scenes w/ GT Dynamic Scenes w/o GT
frames frames frames
Data Size 2-Exp 3-Exp 2-Exp 3-Exp 2-Exp 3-Exp
[25] - - - - 5 4
Ours 49 48 76 108 37 38
Table 1: Comparison between our dataset and the Kalantari13 dataset [25]. Frame number shows the image number. 2-Exp and 3-Exp indicate videos with two and three exposures, respectively.

Dataset overview

To facilitate a more comprehensive evaluation on real data, we captured a real-world dataset and generated reliable ground truth HDR for evaluation. We used an off-the-shelf Basler acA4096-30uc camera for capturing videos with alternating exposures (i.e., two and three exposures) in a variety of scenes, including indoor, outdoor, daytime, and nighttime scenes.

Three different types of video data are captured, namely, static scenes with GT (), dynamic scenes with GT (), and dynamic scenes without GT ().222GT is short for the ground-truth HDR. Table 1 compares the statistics between our dataset and Kalantari13 dataset.

Static scenes with GT

For static scenes, we captured two-exposure and three-exposure sequences, each with frames. The ground-truth HDR frames for static scenes were generated by merging multi-exposure images [9]. We first averaged images having the same exposure to reduce noise, and then merged multi-exposure images using a weighting function similar to [23]. For each scene, we will release captured frames and the generated HDR frame.

Dynamic scenes with GT

Generating per-frame ground-truth HDR for dynamic videos is very challenging. Following the strategy used for capturing dynamic HDR image [23], we propose to create image pairs consisting of input LDR frames and the HDR of the center frame. We considered static environment and used a human subject to simulate motion in videos.

For each scene, we first asked the subject to stay still for seconds, where we can find consecutive still frames (or frames for three-exposure) without motions for generating the HDR image for this timestamp. We then asked the subject to move back-and-forth (e.g., waving hands or walking). We selected an image sequence whose center frame was the static frame, and arranged this sequence to be the proper LDRs-HDR pairs (see Fig. 6 for an example). For each reference frame with GT HDR, we also created a pair with a larger motion by sampling the neighboring frames in a frame interval of , which doubles the number of pairs. In total, we created and pairs for the case of two-exposure ( input frames) and three-exposure ( input frames), respectively.

Figure 6: Illustration of generating the LDRs-HDR pairs for a two-exposure scene (3 frames). Row 1 shows the selected image sequence. Rows 2 and 3 are two sample pairs with low-exposure and high-exposure reference frames, respectively.

Dynamic scenes without GT

We captured a larger scale dataset containing uncontrolled dynamic scenes for qualitative evaluation. Specifically, we captured two-exposure and three-exposure sequences, each contains around frames. This dataset can also be used for semi-supervised or unsupervised training in the future.

Data processing

We saved the raw data of the captured videos and performed demosaicing, white balancing, color correction, and gamma compression () to convert the raw data to RGB data using the recorded metadata. In this paper, we rescaled the images to for evaluation. Both the captured raw data and processed images will be released.

5 Experiments

In this section, we conduct experiments on synthetic and real-world datasets to verify the effectiveness of the proposed method. We compared our methods with Kalantari13 [25], Kalantari19 [24], and Yan19 [60]. Kalantari13 [25] is an optimization-based method and we used the publicly available code for testing. Note that Yan19 [60] is a state-of-the-art method for multi-exposure HDR image reconstruction, and we adapted it for video reconstruction by changing the network input. We re-implemented [24, 60] and trained them using the same dataset as our method.

We evaluated the estimated HDR in terms of PSNR (in the -law tonemapped domain), HDR-VDP-2 [41], and HDR-VQM [46]. HDR-VQM is designed for evaluating the quality of HDR videos. All visual results in the experiment are tonemapped using Reinhard et al.’s method [50] following [24, 25, 26]. In addition, a user study [5] (i.e., pair comparison test) was conducted.

5.1 Training Datasets and Details

Synthetic training dataset

Since there is no publicly available real video dataset with alternating exposures and their ground-truth HDR, we resort to synthetic data for training. Following [24], we selected HDR videos [12, 31] to synthesize the training dataset. Since the size of the HDR video dataset is limited, we also adopted the high-quality Vimeo-90K dataset [59] to be the source videos. Please refer to our supplementary material for more details.

Data augmentation

As the training data was generated from clean HDR videos, the resulting input sequences lack noise in the low-exposure images. To close this gap, we randomly added zero-mean Gaussian noise () in the linear domain of the inputs. We also perturbed the tone of the reference image using a gamma function () to simulate the possibly inaccurate CRF [24, 13]. Random horizontal/vertical flipping and rotation were applied. Patches of size were cropped out to be the network input.

Implementation details

We trained our method using Adam optimizer [29] with default parameters. We first trained the CoarseNet with epochs using a batch size of , and then trained the RefineNet with epochs using a batch size of . The learning rate were initially set to and halved every epochs for both networks . We then end-to-end finetuned the whole network for epochs using a learning rate of .

2-Exposure 3-Exposure
Method PSNR HDR-VDP2 HDR-VQM PSNR HDR-VDP2 HDR-VQM
Kalantari13 [25] 37.53 59.07 84.51 30.36 56.56 65.90
Yan19 [60] 39.05 70.61 71.27 36.28 65.47 72.20
Kalantari19 [24] 37.48 70.67 84.57 36.27 65.51 72.58
Ours 40.34 71.79 85.71 37.04 66.44 73.38
Table 2: Averaged results on synthetic dataset.

(a) Results on static scenes with GT () augmented with random global motion. 2-Exposure 3-Exposure Low-Exposure High-Exposure All-Exposure Low-Exposure Middle-Exposure High-Exposure All-Exposure Method PSNR HDR-VDP2 PSNR HDR-VDP2 PSNR HDR-VDP2 HDR-VQM PSNR HDR-VDP2 PSNR HDR-VDP2 PSNR HDR-VDP2 PSNR HDR-VDP2 HDR-VQM Kalantari13 [25] 40.00 73.70 40.04 70.08 40.02 71.89 76.22 39.61 73.24 39.67 73.24 40.01 67.90 39.77 70.37 79.55 Yan19 [60] 34.54 80.22 39.25 65.96 36.90 73.09 65.33 36.51 77.78 37.45 69.79 39.02 64.57 37.66 70.71 70.13 Kalantari19 [24] 39.79 81.02 39.96 67.25 39.88 74.13 73.84 39.48 78.13 38.43 70.08 39.60 67.94 39.17 72.05 80.70 Ours 41.95 81.03 40.41 71.27 41.18 76.15 78.84 40.00 78.66 39.27 73.10 39.99 69.99 39.75 73.92 82.87

(b) Results on dynamic scenes with GT (). 2-Exposure 3-Exposure Low-Exposure High-Exposure All-Exposure Low-Exposure Middle-Exposure High-Exposure All-Exposure Method PSNR HDR-VDP2 PSNR HDR-VDP2 PSNR HDR-VDP2 HDR-VQM PSNR HDR-VDP2 PSNR HDR-VDP2 PSNR HDR-VDP2 PSNR HDR-VDP2 HDR-VQM Kalantari13 [25] 37.73 74.05 45.71 66.67 41.72 70.36 85.33 37.53 72.03 36.38 65.37 34.73 62.24 36.21 66.55 84.43 Yan19 [60] 36.41 85.68 49.89 69.90 43.15 77.79 78.92 36.43 77.74 39.80 67.88 43.03 64.74 39.75 70.12 87.93 Kalantari19 [24] 39.94 86.77 49.49 69.04 44.72 77.91 87.16 38.34 78.04 41.21 66.07 42.66 64.01 40.74 69.37 89.36 Ours 40.83 86.84 50.10 71.33 45.46 79.09 87.40 38.77 78.11 41.47 68.49 43.24 65.08 41.16 70.56 89.56

Table 3: Quantitative results on the introduced real dataset. The averaged results for each exposure and all exposures are shown. Red text indicates the best and blue text indicates the second best result, respectively.

Overlapped Input Kalantari13 Kalantari19 Ours GT HDR


Figure 7: Visual results on the synthetic dataset.

5.2 Evaluation on Synthetic Dataset

We first evaluated our method on a synthetic dataset generated using two HDR videos (i.e., Poker Fullshot and Carousel Fireworks[12], which are not used for training. Each video contains frames and has a resolution of . Random Gaussian noise was added on the low-exposure images. Table 2 clearly shows that our method outperforms previous methods in all metrics on the this dataset. Figure 7 visualizes that our method can effectively remove the noise (top row) and ghosting artifacts (bottom row) in the reconstructed HDR.

5.3 Evaluation on Real-world Dataset

To validate the generalization ability of our method on real data, we then evaluated the proposed method on the introduced real-world dataset and Kalantari13 dataset [25].

Overlapped Input Kalantari13 Kalantari19 Ours GT HDR


(a) Results on static scenes augmented with random global motion.


(b) Results on dynamic scenes with GT.

Figure 8: Visual results on the captured dataset. For each dataset, row 1 is for two-exposure scene and row 2 is for three-exposure.

Evaluation on static scenes

We evaluated our method on augmented with random global motions (i.e., random translation for each frame in the range of pixels). We did not pre-align the input frames for all methods to investigate their robustness against input with inaccurate global alignment. Table 3 (a) shows that our method achieves the best results for two-exposure scenes and the most robust results for three-exposure scenes. Although Kalantari13 [25] shows slightly better averaged PSNR values for three-exposure scenes (i.e., vs. ), it suffers from the ghosting artifacts for over-exposed regions (see Fig. 8 (a)).

Evaluation on dynamic scenes

Table 3 (b) summarizes the results on , where our method performs the best in all metrics. Compared with our method, the performance of Kalantari13 [25] drops quickly for dynamic scenes, as this dataset contains the more challenging local motions. Figure 8 (b) shows that the results of methods performing alignment and fusion in the image space [25, 24] produce unpleasing artifacts around the motion boundaries. In contrast, our two-stage coarse-to-fine framework enables more accurate alignment and fusion, and is therefore robust to regions with large motion and produces ghost-free reconstructions for scenes with two and three exposures.

(a) Input Reference frame (b) Kalantari13 [25] (c) Kalantari19 [24] (d) Ours



Figure 9: Visual comparison on Throwing Towel 2Exp scene from Kalantari13 dataset.

Evaluation on Kalantari13 dataset

We then evaluated our method on Kalantari13 dataset. Note that the result of Kalantari19 [24] for this dataset is provided by the authors. Figure 9 compares the results for three consecutive frames from Throwing Towel 2Exp scene, where our method achieves significantly better visual results. For a high-exposure reference frame, our method can recover the fine details of the over-exposed regions without introducing artifacts (see rows 1 and 3). In comparison, methods based on optical flow alignment and image blending [25, 24] suffers from artifacts for the over-exposed regions. For a low-exposure reference frame, compared with Kalantari13 [25], our method can remove the noise and preserve the structure for the dark regions (see row 2).

Figure 10: User study results.

User study

We also conducted a user study on the dynamic scene dataset (3-Exp) to further demonstrate the visual quality of our results (see Fig. 10). participants were invited to give preference on pairs of image. Note that the GT HDR was also shown for reference. Overall, % and % of the users preferred results of our method over Kalantari13 [25] and Kalantari19 [24], reiterating the effectiveness of our method.

5.4 Network Analysis

We first discussed the network parameter and runtime, and then conducted ablation study for the proposed method.

2-Exposure 3-Exposure
Method # Parameter
Kalantari13 [25] - 125s 185s 300s 520s
Kalantari19 [24] 0.35s 0.59s 0.42 0.64
Ours 0.51s 0.97s 0.64 1.09s
Table 4: Model parameter and runtime for producing an HDR frame of different resolutions.

Parameters and runtime

Table 4 compares the parameter and runtime of three methods. Note that Kalantari19 [24] and our method were run on a NVIDIA V100 GPU, while Kalantari13 [25] was run on CPUs. Our model contains million parameters, including parameters for CoarseNet and for RefineNet. It takes around second for our method to produce an HDR frame with a resolution of , which is comparable to Kalantari19 [24] and significantly faster than Kalantari13 [25].

Coarse-to-fine architecture

To verify the design of our coarse-to-fine architecture, we compared our method with two baselines. The first one was CoarseNet, which performs optical flow alignment and fusion in the image space (similar to [24]). The second one was RefineNet that directly takes the LDR frames as input and performs alignment and fusion in the feature space. Experiments with IDs 0-2 in Table 5 show that our method achieves the best results on three datasets, demonstrating the effectiveness of our coarse-to-fine architecture.

Network design of the RefineNet

To investigate the effect of deformable alignment (DA) module and temporal attention fusion (TAF) module, we trained two variant models, one without DA module and one replacing DAF module with a convolution after feature concatenation. Experiments with IDs 2-4 in Table 5 show that removing either component will result in decreased performance, verifying the network design of the RefineNet.

Synthetic Dataset
ID Method PSNR HDR-VDP2 PSNR HDR-VDP2 PSNR HDR-VDP2
0 CNet 39.25 70.81 40.62 74.51 44.43 77.74
1 RefineNet 39.69 70.95 37.61 75.30 43.70 78.97
2 CNet + RNet 40.34 71.79 41.18 76.15 45.46 79.09
3 CNet + RNet w/o DA 39.72 71.38 40.52 74.79 45.09 78.24
4 CNet + RNet w/o TAF 40.03 71.66 40.80 76.12 45.17 78.99
Table 5: Ablation study on three datasets with two alternating exposures. CNet and RNet are short for CoarseNet and RefineNet.

6 Conclusion

We have introduced a coarse-to-fine deep learning framework for HDR video reconstruction from sequences with alternating exposures. Our method first performs coarse HDR video reconstruction in the image space and then refines the coarse predictions in the feature space to remove the ghosting artifacts. To enable more comprehensive evaluation on real data, we created a real-world benchmark dataset for this problem. Extensive experiments on synthetic and real datasets show that our method significantly outperforms previous methods.

Currently, our method was trained on synthetic data. Since we have captured a large-scale dynamic scene dataset, we will investigate self-supervised training or finetuning using real-world videos in the future.

References

  • [1] A. O. Akyüz, R. Fleming, B. E. Riecke, E. Reinhard, and H. H. Bülthoff (2007) Do HDR displays support LDR content? a psychophysical evaluation. TOG. Cited by: §2.
  • [2] F. Banterle, K. Debattista, A. Artusi, S. Pattanaik, K. Myszkowski, P. Ledda, and A. Chalmers (2009) High dynamic range imaging and low dynamic range expansion for generating HDR content. In Computer Graphics Forum, Cited by: §2.
  • [3] F. Banterle, P. Ledda, K. Debattista, and A. Chalmers (2006) Inverse tone mapping. In Proceedings of the 4th international conference on Computer graphics and interactive techniques in Australasia and Southeast Asia, Cited by: §2.
  • [4] F. Banterle, P. Ledda, K. Debattista, and A. Chalmers (2008) Expanding low dynamic range videos for high dynamic range applications. In Proceedings of the 24th Spring Conference on Computer Graphics, Cited by: §2.
  • [5] M. Bertalmío (2019) Vision models for high dynamic range and wide colour gamut imaging: techniques and applications. Academic Press. Cited by: §5.
  • [6] J. Cai, S. Gu, and L. Zhang (2018) Learning a deep single image contrast enhancer from multi-exposure images. TIP. Cited by: §2.
  • [7] I. Choi, S. Baek, and M. H. Kim (2017) Reconstructing interlaced high-dynamic-range video using joint learning. TIP. Cited by: §2.
  • [8] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, Cited by: §1, §3.3, §3.3.
  • [9] P. E. Debevec and J. Malik (1997) Recovering high dynamic range radiance maps from photographs. In SIGGRAPH, Cited by: §2, §3.1, §3.2, §4.
  • [10] G. Eilertsen, J. Kronander, G. Denes, R. K. Mantiuk, and J. Unger (2017) HDR image reconstruction from a single exposure using deep cnns. TOG. Cited by: §2.
  • [11] Y. Endo, Y. Kanamori, and J. Mitani (2017) Deep reverse tone mapping. TOG. Cited by: §2.
  • [12] J. Froehlich, S. Grandinetti, B. Eberhardt, S. Walter, A. Schilling, and H. Brendel (2014) Creating cinematic wide gamut HDR-video for the evaluation of tone mapping operators and HDR-displays. In Digital Photography X, Cited by: §5.1, §5.2.
  • [13] R. Gil Rodríguez, J. Vazquez-Corral, and M. Bertalmío (2019) Issues with common assumptions about the camera pipeline and their impact in hdr imaging from multiple exposures. SIAM Journal on Imaging Sciences. Cited by: §5.1.
  • [14] M. D. Grossberg and S. K. Nayar (2003) What is the space of camera response functions?. In CVPR, Cited by: §3.1.
  • [15] Y. Gryaditskaya, T. Pouli, E. Reinhard, K. Myszkowski, and H. Seidel (2015) Motion aware exposure bracketing for HDR video. In Computer Graphics Forum, Cited by: §2.
  • [16] S. Hajisharif, J. Kronander, and J. Unger (2015) Adaptive dualiso HDR reconstruction. EURASIP Journal on Image and Video Processing 2015. Cited by: §2.
  • [17] J. Han, C. Zhou, P. Duan, Y. Tang, C. Xu, C. Xu, T. Huang, and B. Shi (2020) Neuromorphic camera guided high dynamic range imaging. In CVPR, Cited by: §2.
  • [18] S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz, J. Chen, and M. Levoy (2016) Burst photography for high dynamic range and low-light imaging on mobile cameras. TOG. Cited by: §2.
  • [19] F. Heide, M. Steinberger, Y. Tsai, M. Rouf, D. Pajak, D. Reddy, O. Gallo, J. Liu, W. Heidrich, K. Egiazarian, et al. (2014) FlexISP: a flexible camera image processing framework. TOG. Cited by: §2.
  • [20] J. Hu, O. Gallo, K. Pulli, and X. Sun (2013) HDR deghosting: how to deal with saturation?. In CVPR, Cited by: §2.
  • [21] Y. Huo, F. Yang, L. Dong, and V. Brost (2014) Physiological inverse tone mapping based on retina response. The Visual Computer. Cited by: §2.
  • [22] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In NIPS, Cited by: §3.2.
  • [23] N. K. Kalantari and R. Ramamoorthi (2017) Deep high dynamic range imaging of dynamic scenes. TOG. Cited by: §1, §2, §3.2, §3.2, §4, §4.
  • [24] N. K. Kalantari and R. Ramamoorthi (2019) Deep HDR video from sequences with alternating exposures. In Computer Graphics Forum, Cited by: Figure 1, §1, §2, §3.1, §3.2, §3.2, §3.2, §3.2, §3.2, Figure 9, §5.1, §5.1, §5.3, §5.3, §5.3, §5.4, §5.4, Table 2, Table 3, Table 4, §5, §5.
  • [25] N. K. Kalantari, E. Shechtman, C. Barnes, S. Darabi, D. B. Goldman, and P. Sen (2013) Patch-based high dynamic range video. TOG. Cited by: §1, §2, §3.1, §3.3, §4, Table 1, Figure 9, §5.3, §5.3, §5.3, §5.3, §5.3, §5.4, Table 2, Table 3, Table 4, §5, §5.
  • [26] S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski (2003) High dynamic range video. In TOG, Cited by: §1, §2, §5.
  • [27] S. Y. Kim, J. Oh, and M. Kim (2019) Deep SR-ITM: joint learning of super-resolution and inverse tone-mapping for 4K UHD HDR applications. Cited by: §2.
  • [28] S. Y. Kim, J. Oh, and M. Kim (2020) JSI-GAN: gan-based joint super-resolution and inverse tone-mapping with pixel-wise task-specific filters for UHD HDR video. In AAAI, Cited by: §2.
  • [29] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §5.1.
  • [30] R. P. Kovaleski and M. M. Oliveira (2014) High-quality reverse tone mapping for a wide range of exposures. In 2014 27th SIBGRAPI Conference on Graphics, Patterns and Images, Cited by: §2.
  • [31] J. Kronander, S. Gustavson, G. Bonnet, A. Ynnerman, and J. Unger (2014) A unified framework for multi-sensor HDR video reconstruction. Signal Processing: Image Communication. Cited by: §1, §2, §5.1.
  • [32] S. Lee, G. Hwan An, and S. Kang (2018)

    Deep recursive hdri: inverse tone mapping using generative adversarial networks

    .
    In

    Proceedings of the European Conference on Computer Vision (ECCV)

    ,
    Cited by: §2.
  • [33] Y. Li, C. Lee, and V. Monga (2016) A maximum a posteriori estimation framework for robust high dynamic range video synthesis. TIP. Cited by: §1, §2, §3.1.
  • [34] O. Liba, K. Murthy, Y. Tsai, T. Brooks, T. Xue, N. Karnad, Q. He, J. T. Barron, D. Sharlet, R. Geiss, et al. (2019) Handheld mobile photography in very low light. TOG. Cited by: §2.
  • [35] Y. Liu, W. Lai, Y. Chen, Y. Kao, M. Yang, Y. Chuang, and J. Huang (2020) Single-image HDR reconstruction by learning to reverse the camera pipeline. In CVPR, Cited by: §2.
  • [36] Z. Liu, L. Yuan, X. Tang, M. Uyttendaele, and J. Sun (2014) Fast burst images denoising. TOG. Cited by: §2.
  • [37] K. Ma, H. Li, H. Yong, Z. Wang, D. Meng, and L. Zhang (2017) Robust multi-exposure image fusion: a structural patch decomposition approach. TIP. Cited by: §2.
  • [38] S. Mangiat and J. Gibson (2010) High dynamic range video with ghost removal. In Applications of Digital Image Processing, Cited by: §2.
  • [39] S. Mangiat and J. Gibson (2011) Spatially adaptive filtering for registration artifact removal in HDR video. In ICIP, Cited by: §1, §2.
  • [40] S. Mann and R. Picard (1995) On being ‘undigital’ with digital cameras: extending dynamic range by combining differently exposed pictures. In IS&T, Cited by: §2.
  • [41] R. Mantiuk, K. J. Kim, A. G. Rempel, and W. Heidrich (2011) HDR-VDP-2: a calibrated visual metric for visibility and quality predictions in all luminance conditions. TOG. Cited by: §5.
  • [42] D. Marnerides, T. Bashford-Rogers, J. Hatchett, and K. Debattista (2018)

    ExpandNet: a deep convolutional neural network for high dynamic range expansion from low dynamic range content

    .
    In Computer Graphics Forum, Cited by: §2.
  • [43] M. McGuire, W. Matusik, H. Pfister, B. Chen, J. F. Hughes, and S. K. Nayar (2007) Optical splitting trees for high-precision monocular imaging. IEEE Computer Graphics and Applications. Cited by: §2.
  • [44] C. A. Metzler, H. Ikoma, Y. Peng, and G. Wetzstein (2020) Deep optics for single-shot high-dynamic-range imaging. In CVPR, Cited by: §2.
  • [45] K. Moriwaki, R. Yoshihashi, R. Kawakami, S. You, and T. Naemura (2018) Hybrid loss for learning single-image-based HDR reconstruction. arXiv preprint arXiv:1812.07134. Cited by: §2.
  • [46] M. Narwaria, M. P. Da Silva, and P. Le Callet (2015) HDR-VQM: an objective quality measure for high dynamic range video. Signal Processing: Image Communication. Cited by: §5.
  • [47] S. K. Nayar and T. Mitsunaga (2000) High dynamic range imaging: spatially varying pixel exposures. In CVPR, Cited by: §2.
  • [48] Y. Niu, J. Wu, W. Liu, W. Guo, and R. W. Lau (2020) HDR-gan: hdr image reconstruction from multi-exposed ldr images with large motions. arXiv preprint arXiv:2007.01628. Cited by: §2.
  • [49] T. Oh, J. Lee, Y. Tai, and I. S. Kweon (2014) Robust high dynamic range imaging by rank minimization. TPAMI. Cited by: §2.
  • [50] E. Reinhard, M. Stark, P. Shirley, and J. Ferwerda (2002) Photographic tone reproduction for digital images. In TOG, Cited by: §5.
  • [51] M. S. Santos, T. I. Ren, and N. K. Kalantari (2020) Single image HDR reconstruction using a cnn with masked features and perceptual loss. In SIGGRAPH, Cited by: §2.
  • [52] P. Sen, N. K. Kalantari, M. Yaesoubi, S. Darabi, D. B. Goldman, and E. Shechtman (2012) Robust patch-based HDR reconstruction of dynamic scenes. TOG. Cited by: §2.
  • [53] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.3.
  • [54] Q. Sun, E. Tseng, Q. Fu, W. Heidrich, and F. Heide (2020) Learning rank-1 diffractive optics for single-shot high dynamic range imaging. In CVPR, Cited by: §2.
  • [55] Y. Tian, Y. Zhang, Y. Fu, and C. Xu (2020) TDAN: temporally-deformable alignment network for video super-resolution. In CVPR, Cited by: §1, §3.3.
  • [56] M. D. Tocci, C. Kiser, N. Tocci, and P. Sen (2011) A versatile HDR video production system. In TOG, Cited by: §1, §2.
  • [57] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy (2019) EDVR: video restoration with enhanced deformable convolutional networks. In CVPR Workshops, Cited by: §1, §3.3, §3.3.
  • [58] S. Wu, J. Xu, Y. Tai, and C. Tang (2018) Deep high dynamic range imaging with large foreground motions. In ECCV, Cited by: §1, §2, §3.2.
  • [59] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman (2019) Video enhancement with task-oriented flow. International Journal of Computer Vision. Cited by: §5.1.
  • [60] Q. Yan, D. Gong, Q. Shi, A. v. d. Hengel, C. Shen, I. Reid, and Y. Zhang (2019) Attention-guided network for ghost-free high dynamic range imaging. In CVPR, Cited by: §1, §2, §3.2, Table 2, Table 3, §5.
  • [61] Q. Yan, L. Zhang, Y. Liu, Y. Zhu, J. Sun, Q. Shi, and Y. Zhang (2020) Deep HDR imaging via a non-local network. TIP. Cited by: §2.
  • [62] J. Zhang and J. Lalonde (2017) Learning high dynamic range from outdoor panoramas. In ICCV, Cited by: §2.
  • [63] H. Zhao, B. Shi, C. Fernandez-Cull, S. Yeung, and R. Raskar (2015) Unbounded high dynamic range photography using a modulo camera. In ICCP, Cited by: §1, §2.