High dynamic range (HDR) imaging is of fundamental importance in modern digital photography pipelines and used to produce a high-quality photograph with well exposed regions despite varying illumination across the image. This is typically achieved by merging multiple low dynamic range (LDR) images taken at different exposures. However, over-exposed regions and misalignment errors due to poorly compensated motion result in artefacts such as ghosting. In this paper, we present a new HDR imaging technique that specifically models alignment and exposure uncertainties to produce high quality HDR results. We introduce a strategy that learns to jointly align and assess the alignment and exposure reliability using an HDR-aware, uncertainty-driven attention map that robustly merges the frames into a single high quality HDR image. Further, we introduce a progressive, multi-stage image fusion approach that can flexibly merge any number of LDR images in a permutation-invariant manner. Experimental results show our method can produce better quality HDR images with up to 0.8dB PSNR improvement to the state-of-the-art, and subjective improvements in terms of better detail, colours, and fewer artefacts.READ FULL TEXT VIEW PDF
Despite recent advances in imaging technology, capturing scenes with wide dynamic range still poses several challenges. Current camera sensors suffer from limited or Low Dynamic Range (LDR) due to inherent hardware limitations. The maximum dynamic range a camera can capture is closely related to (a) the sensor’s photosite full well electron capacity or saturation point, and (b) the black point, which is generally constrained by the uncertainty in the reading due to the dominant presence of noise.
Different solutions have been proposed to overcome these limitations. The principle behind most of them relies on capturing observations of the same scene with different exposure values. This enables a richer coverage of the scene’s original dynamic range, but also requires a mechanism to align and unify the different captured observations . Some approaches make use of multi-sensor or multi-camera configurations, e.g. Tocci et al. , McGuire et al. , Froehlich et al. , where a beam splitter enables the light to be captured by multiple sensors. However, such setups are normally expensive, fragile, with bulky and cumbersome rigs, and they may suffer from double contours, light flares, or polarization artefacts .
More pragmatic solutions include only a single sensor and obtain multiple exposures by either spatial (i.e. per-pixel varying exposure) [5, 6] or temporal multiplexing (i.e. capturing differently exposed frames) . This simpler hardware setup (and related algorithms) has recently seen widespread adoption, and is now found in cameras ranging from professional DSLR to low-cost smartphones.
Early multi-frame exposure fusion algorithms work remarkably well for almost-static scenes (e.g. tripod, reduced motion) but result in ghosting and other motion-related artefacts for dynamic scenes. In recent years, Convolutional Neural Networks (CNNs) have greatly advanced the state-of-the-art for HDR reconstruction, especially for complex dynamic scenes.
Most HDR CNNs rely on a rigid setup with a fixed, ordered set of LDR input images, which assumes the medium exposure to be the reference image. The most common mechanism for the merging step is image or feature concatenation, and thus for methods where the feature encoder is not shared among input frames , there is a dependency between reference frame choice, relative exposure and input image ordering. Optimal exposure parameters  or fast object motion might constrain the amount of relevant frames available, and in general, broader flexibility in terms of number of frames and choice of reference is necessary to extend applicability without the burden of model tweaking or retraining.
As for frame registration, previous models largely rely on pre-trained or classical off-the-shelf optical flow methods that are rarely designed or optimized for the characteristics of exposure-bracketed LDR images. Recent pixel rejection or attention strategies are disconnected from the alignment stage and mostly ignore uncertainty in exposure or motion.
In this paper, we propose a novel algorithm that addresses these limitations in a holistic and unified way. First, we design a HDR-specific optical flow network which can predict accurate optical flow estimates even when the input and target frames are under- or over-exposed. We do this by using symmetric pooling operations to share information between allinput frames, so any missing information in one frame can be borrowed from other frames. Further, we propose models of exposure and alignment uncertainties which are used by our flow and attention networks to regulate contributions from unreliable and misaligned pixels. Finally we propose a flexible architecture that can process any number of input frames provided in any order.
The contributions of this paper are threefold:
A lightweight, iterative and self-supervised HDR-specific optical flow network which can estimate accurate pixel correspondences between LDR frames, even when improperly exposed, by sharing information between all input frames with symmetric pooling operations.
Models of exposure and alignment uncertainty which we use to regulate contributions from unreliable and misaligned pixels and greatly reduce ghosting artefacts.
A flexible architecture with a multi-stage fusion mechanism which can estimate an HDR image from an arbitrary set of LDR input images.
In this section we review the HDR literature with a focus on relevant deep-learning multi-frame exposure fusion methods. For a broader overview we refer the reader to[10, 11].
The seminal work of  was the first to introduce a training and testing dataset with dynamic scene content. Their proposed method for learning-based HDR fusion is composed of two stages: first, input LDR images are aligned using a classical optical flow algorithm  and then a CNN is trained to both merge images and potentially correct any errors in the alignment. Shortly after,  proposed a similar approach that does not perform a dense optical flow estimation, but rather uses an image-wide homography to perform background alignment, leaving the more complex non-rigid foreground motions to be handled by the CNN. However, this method is highly dependent on the structure of the reference image, and the magnitude and complexity of the motion. Thus, if certain regions are saturated in the reference image, it fails to accurately reconstruct them in the ﬁnal result. Both  and  rely on the optimisation of the HDR reconstruction loss to implicitly learn how to correct ghosting and handle the information coming from different frames. However, neither provides an explicit mechanism to prevent incorrect information (e.g. overexposed regions) from influencing the final HDR estimation. Despite the noteworthy performance improvement over existing methods at the time, these approaches still suffer from ghosting, especially for fast moving objects and saturated or near-saturated regions.
Yan et al.  address some limitations of its predecessors by establishing an attention mechanism to suppress undesired information before the merging stage, e.g. misalignments, overexposed regions, and focus instead on desirable details of non-reference frames that might be missing in the reference frame. In the work of Prabhakar et al.  parts of the computation, including the optical flow estimation, are performed in a lower resolution and later upscaled back to full resolution using a guide image generated with a simple weight map, thus saving some computation.
More recently, the state of the art in HDR imaging has been pushed to new highs.  propose the first GAN-based approach to HDR reconstruction which is able to synthesize missing details in areas with disocclusions. Liu et al.  introduce a method which uses deformable convolutions as an alignment mechanism instead of optical flow and was the winning submission to the 2021 NTIRE HDR Challenge . Contemporary work has explored and pioneered new training paradigms, such as the weakly supervised training strategy proposed by .
Extending these methods to an arbitrary number of images requires changes to the model definition and re-training. Set-processing neural networks can naturally handle those requirements. In , a permutation invariant CNN is used to deblur a burst of frames which present only rigid, 0-mean translations with no explicit motion registration. For the HDR task,  proposed a method that uses symmetric pooling aggregation to fuse any number of images, but requires pre-alignment  and artefact correction by networks which only work on image pairs.
Given a set of LDR images with different exposure values our aim is to reconstruct a single HDR image which is aligned to a reference frame . To simplify notation, we denote , but any input frame can be chosen as the reference frame. To generate the inputs to our model, we follow the work of [13, 8, 7] and form a linearized image for each as follows:
where is the exposure time of image with power-law non-linearity . Setting approximates the inverse of gamma correction, while dividing by the exposure time adjusts all the images to have consistent brightness. We concatenate and in the channel dimension to form a 6 channel input image . Given a set of inputs our proposed network estimates the HDR image by:
where denotes our network and are the learned weights of the network. Our network accepts any number of frames and is invariant to the order of the non-reference inputs. This is different from the work of [13, 8, 7] where the value of is fixed to 3 and the order of inputs is fixed, and the work of  where only the fusion stage is performed on inputs, but frame alignment and attention are performed on image pairs only. Our method performs alignment, regulates the contribution of each frame based on related alignment and exposure uncertainties and flexibly fuses any number of input frames in a permutation-invariant manner. Our network is also trained end-to-end and is learned entirely during the HDR training.
Our architecture is composed of: Learnable Exposure Uncertainty (Sec. III-C), HDR Iterative Optical Flow (Sec. III-D), Alignment Uncertainty and Attention (Sec. III-E), and Merging Network (Sec.III-F). An overview of the architecture can be seen in Figure 2. Our architecture makes use of max-pooling operations to share information between frames and to fuse frames together (Sec. III-B). This improves the accuracy of our flow and attention networks and gives us the advantage of an architecture that is flexible enough to accept an arbitrary number of images. The flow network and the attention network work together to align non-reference frames to the reference frame and suppress artefacts from misaligned and over-exposed regions. The merging network then combines the aligned features to predict a single HDR image. By explicitly modelling the two most common sources of error, motion and exposure, we create a network that is aware of uncertainty and is able to greatly reduce artefacts compared to state-of-the-art methods, as shown in Figure 1.
Many state-of-the-art CNN HDR reconstruction methods require a fixed number of inputs in fixed order of exposure [13, 8, 7]. To overcome this limitation, we design a set-processing network that can naturally deal with any number of input images. Related concepts have previously shown strong benefits for problems such as deblurring  and we here propose to leverage set-processing and permutation invariance tools for HDR fusion.
Given input images, our network uses identical copies of itself with shared weights to process each image separately in its own stream. We use a multi-stage fusion mechanism, where features of each individual stream at an arbitrary point within the network can share information with each other as follows:
where denotes a max-pooling operation, denotes concatenation and denotes a convolutional layer (see Fig. 5). This operation is repeated at multiple points in the network. Finally, the outputs of each stream are then pooled together into a single stream with a global max-pooling operation . This result is processed further in the final layers of our network to obtain the HDR prediction. This allows the network to process any number of frames in a permutation invariant manner while still being informed by the other frames.
A key limitation of LDR images is that any pixel values above the sensor saturation point results in information loss. Values approaching the saturation level are also unreliable due to negative post-saturation noise 
. When reconstructing an HDR image from multiple LDR images, values at or close to the saturation point can produce artefacts in the final output. Furthermore, underexposed values close to zero are also unreliable due to dark current and low signal-to-noise ratio. We seek to regulate the contribution of such values by modelling our confidence in a given pixel value being correct. For a given input image, we propose the following piecewise linear function where and are predicted by the network for each image:
Here denotes the mean value across the three RGB channels and is the predicted exposure map which represents our estimated confidence in a given pixel. This function is plotted in Figure 4. We learn from data how to predict and by means of a shallow network, i.e. a convolution acting on the concatenation of followed by a spatial average pooling. We constrain and such that . As approaches 0 or 1, the pixel becomes increasingly unreliable and the value of the exposure mask approaches zero. The slope with which approaches zero is determined by and . As shown in Figure 1 this allows us to regulate the contribution that improperly exposed regions in an image can have on our result.
Recent learning based optical flow methods  typically do not work well for HDR. Successive frames can have large amounts of missing information due to overexposure, which makes aligning frames difficult. This is especially true if the reference and non-reference frames are both overexposed. We solve this issue by using max-pooling operations to share information between all input frames in our flow network’s encoder, as described in Eq. (3). This lets the network fill in missing information from any of the available input frames and predict more accurate flows.
The architecture of our proposed flow network is inspired by RAFT 
, however we design the network to be lightweight and efficient. We do not use a context encoder, a correlation layer or a convolutional gated recurrent unit, instead using only simple convolutional layers to predict our optical flow field.
Given an input and an exposure mask , we use a convolutional layer to extract features from . The inputs into the flow network are then concatenated as follows: , where
corresponds to the features extracted from the reference image. The flow network is informed byso that our predictions are aware of the exposure uncertainty in the image. As recurrent convolutions can be computationally expensive at full resolution, the flow network first downsamples the input features by 8
using strided convolutions. It then iteratively refines the predicted flow over 16 iterations, with a flow initialized to zero, to obtain the optical flow fieldvia:
where denotes our optical flow network. The optical flow field is resized to the original resolution with bilinear upsampling and used to warp our features :
where are the warped features and denotes the function of warping an image with an optical flow field. The architecture of our flow network can be seen in Figure 3.
Unlike other methods which use fixed alignment [8, 21], our flow network is trained in a fully self-supervised manner. As ground truth optical flows for our datasets are unavailable, we use the self-supervised photometric loss between the reference features and the warped features as supervision to guide the learning of the flow network. We multiply the loss by so that the reference frame is only used as supervision in regions where it is well exposed. We also apply the optical flow field to the exposure mask, so it remains spatially aligned with the warped features:
where is the warped exposure mask.
Our attention network is informed by two measures of uncertainty: exposure and alignment. To model the alignment uncertainty we compute an uncertainty map as:
where denotes the elementwise absolute value and denotes element-wise multiplication. This map captures the difference in motion between the reference frame and the warped frame and helps inform our attention network of any inconsistencies in the alignment. We multiply by so that only the well exposed regions of the reference frame are used to calculate misalignments. By regulating the contributions from misaligned areas of the image, our network can significantly reduce ghosting in the final output. The exposure uncertainty is given by the warped exposure map . The inputs to the attention network are then concatenated as follows . The attention network predicts a 64 channel attention map as follows:
where denotes our attention network. As in our flow network, we use max-pooling to share information between all input frames. We then obtain our regulated features by multiplying the warped features by the attention map and the exposure map:
where denotes element-wise multiplication and denotes the regulated features. Multiplication by the exposure map enforces a strict constraint on our network and prevents unreliable information leaking into our output. Our HDR-aware attention effectively regulates the contribution of each frame, taking into account both alignment and exposure uncertainty.
Our merging network takes the regulated features obtained from Equation 10 and merges them into a single HDR image. The merging network is based on a Grouped Residual Dense Block (GRDB) , which consists of three Residual Dense Blocks (RDBs) . We modify the GRDB so that each stream can share information with the other streams for a multi-stage fusion of features. An overview of the fusion mechanism can be seen in Figure 5. Specifically we add a max-pooling operation after each RDB which follow the formulation described in Equation 3. This allows the network to progressively merge features from different streams, instead of merging them together in a single concatenation step where information might be lost. This is followed by a final global max-pooling operation which collapses the
streams into one. The merging network then processes this result further with a global residual connection and refinement convolutions.
As HDR images are not viewed in the linear domain, we follow previous work and use the -law to map from the linear HDR image to the tonemapped image:
where is the linear HDR image, is the tonemapped image and . We then estimate the -norm between the prediction and the ground truth to construct a tone mapped loss as follows:
To improve the quality of reconstructed textures we also use the perceptual loss as in . We pass the tonemapped images through a pre-trained VGG-19  and extract features from three intermediate layers. We reduce the -norm between the features of the ground truth and our prediction:
where is a pre-trained VGG-19 network. Finally, to provide supervision for our optical flow network, we calculate a simple photometric loss between the warped features and the reference features and multiply by to limit supervision to well exposed regions in the reference frame:
Our total loss function can be expressed as:
During training, we take a random crop of size from the input image. We perform random horizontal and vertical flipping and random rotation by 0°, 90°, 180° or 270° degrees to further augment the training data. We train using a batch size of and a learning rate of
with the Adam optimizer. We implement the model in PyTorch.
We conduct several experiments both comparing against well-known state-of-the-art algorithms and also individually validating the contributions in an extensive ablation study. The experimental setup is described below.
Datasets: We use the dynamic training and testing datasets provided by Kalantari and Ramamoorthi  which includes 89 scenes in total. Each of these scenes include three differently exposed input LDR images (with EV of -2.00, 0.00, +2.00 or -3.00, 0.00, +3.00) which contain dynamic elements (e.g. camera motion, non-rigid movements) and a ground-truth image aligned with the medium frame captured via static exposure fusion. Additionally we use the dynamic testing dataset provided by Chen et al.  for further evaluation. As this dataset does not have a corresponding training set, all methods are trained on the Kalantari dataset and evaluated on the Chen dataset. We test on the 3-Exposure setting which has the ground truth aligned to the middle exposure. To keep it consistent with training, we restrict the number of input frames to three with EVs of -2.00, 0.00, +2.00.
Metrics: We include three different objective metrics in our quantitative evaluation, reproducing the same benchmark setup as recent publications [8, 13, 15]. First, we compute the PSNR-L, which is a fidelity metric computed directly on the linear HDR estimations. HDR linear images are normally tonemapped for visualization, and thus we include too the PSNR-, which evaluates PSNR on images tonemapped using the -law, as defined in Eq. (11), which is a simple canonical tonemapper. Lastly, we also compute the HDR-VDP2 , which estimates both visibility and quality differences between image pairs.
We evaluate the contribution of the different parts of our model architecture on the Kalantari dataset.
In Table I, we evaluate the quantitative impact of using our multi-stage fusion mechanism as well as the performance gain from our proposed flow network and our uncertainty modelling. Our baseline model uses the same architecture as our proposed method, but with the flow network, uncertainty modelling and multi-stage max-pooling removed, instead using concatenation as the fusion mechanism, and the attention mechanism from .
Fusion Mechanism. We show in Table I that using our multi-stage fusion mechanism outperforms concatenation by 0.39dB PSNR-L and 0.27dB PSNR-. The progressive sharing of information between streams allows the network to retain more information and produce sharper, more detailed images.
Motion Alignment and Modelling Uncertainty. We look at the performance of our proposed flow network and uncertainty modelling in Table I. Our flow network improves PSNR-L by a large 0.7dB, and PSNR- by 0.05dB. We validate the contribution of our learnable model of exposure uncertainty by comparing it to the fixed, non-learnable exposure model used by . Our learnable exposure modelling shows an improvement in PSNR- of 0.07dB and PSNR-L of 0.08dB compared to the fixed exposure model. We also validate the contribution of our alignment uncertainty, which gives an improvement in PSNR- of 0.07dB and PSNR-L of 0.32dB when compared to using only exposure uncertainty.
We evaluate the performance of our proposed method for the HDR estimation task and compare it to other state-of-the-art methods both quantitatively and qualitatively. The methods included in our benchmark cover a broad range of approaches, namely: the patch-based method of Sen et al. , methods which use traditional alignment followed by CNNs to correct dense and global alignment, [7, 18, 8], the flexible aggregation approach of 
that also uses dense alignment, methods which rely on attention or feature selection followed by a CNN to deghost and merge images[13, 32], a GAN-based approach which can synthesize missing details in areas with disocclusions  and a method which uses deformable convolutions as an alignment mechanism . We show in Table II the quantitative evaluation on the Kalantari test set. The improvement in terms of PSNR- when compared with the runner-up HDR-GAN is large (i.e. dB), and we observe an even a higher improvement in terms of PSNR-L (i.e. dB) when compared against the runner-up in this metric Prabhakar21 . Similarly, the HDR-VDP-2 score obtained by our method outperforms all others by a wide margin (i.e. 0.8).
We observe similar performance on the Chet et al. dynamic test set, demonstrating good generalization and out of domain performance. We show in Table III that the improvement in PSNR- when compared with the runner-up Prabhakar21  is large (i.e. dB), and we observe a good improvement in terms of PSNR-L (i.e. dB) when compared against the runner-up in this metric ADNet . We obtain the second best HDR-VDP-2 score on this test set, with ADNet outperforming our method by 0.58.
In Figures 6, 7, and 8 we show some visualizations of our algorithm compared with the benchmarked methods for qualitative, subjective evaluation. All other methods present traces of ghosting artefact around the edges near a moving object, especially where disocclusions happen and one or more frames have overexposed values in those locations (e.g. moving head, moving arm). Our method tackles such challenges effectively thanks to the exposure confidence awareness, and strongly suppresses the ghosting artefact. Additionally, our method also demonstrates better performance when it comes to edges and textures (e.g. building facade), as well as out of domain low-light performance.
We show that our model is flexible enough to accept an arbitrary number of images without the need for re-training. In Table IV we evaluate the performance of our proposed model when trained and tested on different numbers of images with different exposures. We use the following input frame configurations for training and testing: Short + Medium + Long, Short + Medium, Medium + Long, Medium. As expected, performance is best when the testing configuration is seen during training. Our model trained on all permutations achieves very good cross-setting performance, obtaining the best results for the S + M and M + L settings. It is also competitive with our best model for the M and S + M + L settings, without needing any extra training time, and is capable of accepting a range of different input configurations without the need for re-training. In fact, we show that our model using only two frames (S + M) can obtain results outperforming current state-of-the-art methods using all three frames in PSNR-  and PSNR-L . In Figure 9 we show that our model is capable of producing high quality HDR outputs for a range of different input configurations. Furthermore we show in Figure 11 that our method can use any frame as the reference frame without re-training, providing superior flexibility and choice when compared to state-of-the-art methods. However a limitation of our method is that it is not able to hallucinate details in large over-exposed regions, as seen in Figure 10.
In this paper we explored modelling exposure and alignment uncertainties to improve HDR imaging performance. We presented (1) an HDR-specific optical flow network which is capable of accurate flow estimations, even with improperly exposed input frames, by sharing information between input images with a symmetric pooling operation. (2) We also presented models of exposure and alignment uncertainty which we use to regulate contributions from unreliable and misaligned pixels and greatly reduce ghosting artefacts. (3) Lastly a flexible architecture which uses a multi-stage fusion to estimate an HDR image from an arbitrary set of LDR input images. We conducted extensive ablation studies where we validate individually each of our contributions. We compared our method to other state-of-the-art algorithms obtaining significant improvements for all the measured metrics and noticeably improved visual results.
European Conference on Computer Vision, 2018.
IEEE Conference on Computer Vision and Pattern Recognition, 2010.
Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” inComputer Vision and Pattern Recognition, 2018.