High dynamic range (HDR) videos can more realistically represent natural scenes, with higher bit depth per pixel and rich colors from a wider color gamut. They can be viewed on HDR TVs, which also tend to be UHD (Ultra High Definition) with very high resolution. However even with the latest UHD HDR TVs, the vast majority of the transmitted visual contents are still low resolution (LR), standard dynamic range (SDR) videos. There are also abundant existing legacy LR SDR videos, which brings about the need for appropriate conversion technologies that can generate high resolution (HR) HDR videos from LR SDR videos. This can be achieved by joint super-resolution (SR) and inverse tone-mapping (ITM), where SR up-scales LR videos to HR, and ITM up-converts SDR videos to HDR.
In joint SR-ITM, it is important to restore details while up-scaling the image resolution, and to enhance local contrast while increasing the signal amplitudes. In this paper, we take a ‘divide-and-conquer’ approach by dividing this joint problem into three sub-tasks: image reconstruction (IR), detail restoration (DR), and local contrast enhancement (LCE). A single subnet is dedicated for each of the tasks, but all subnets are jointly trained for the same goal of joint SR-ITM. To overcome the limitations of conventional shared-convolution filters over input channels in each layer, we design a pair of pixel-wise 1D separable filters in the DR subnet for detail restoration and a pixel-wise 2D local filter in the LCE subnet for contrast enhancement. Moreover, the 1D separable and 2D local filters are designed to be scalable to up-scaling factors. This approach is inherently different from the approaches that directly produce the output HR HDR images. Furthermore, each input frame is divided into its base layer and a detail layer component by the guided filter . In our composite framework, the DR subnet, LCE subnet and the IR subnet are optimally combined to finally produce faithful HR HDR results.
Generative adversarial networks (GANs) have been widely applied in low level vision tasks, such as SR, where they tend to generate images with high subjective (perceptual) quality but low objective quality (e.g. PSNR, SSIM, etc.). For joint SR-ITM, direct generation of output images based on the conventional GAN-based methods can lead to unsatisfying results with lack of details and unnatural local contrasts, since simultaneously enhancing the local contrast and details while increasing both the bit-depth and the spatial resolution becomes a very challenging task in training GAN-based frameworks. Therefore, our GAN-based joint SR-ITM method, called JSI-GAN, incorporates a novel detail loss that enforces its generator to mimic the perceptually realistic details in the ground truth, and a feature-matching loss that helps mitigate the drop in objective performance by stabilizing the training process.
Our contributions are summarized as follows:
We first propose a GAN framework for joint SR-ITM, called JSI-GAN, with a novel detail loss and a feature-matching loss that enable the restoration of realistic details and forces stable training.
The generator of JSI-GAN is designed to have task-specific subnets (DR/IR/LCE subnets) with pixel-wise 1D separable filters for local detail improvement and a 2D local filter for local contrast enhancement by considering local up-sampling operation given the up-scaling factor.
The DR subnet focuses on the high frequency components to elaborately restore the details of HR HDR output, while the LCE subnet effectively restores the local contrast by focusing on the base layer components of LR SDR input.
Pixel-Wise or Pixel-Aware Filter Learning
In convolution layers, same convolution kernels are applied on all spatial positions of the input, and once trained, the same kernels are used for any input image. Therefore, to consider pixel-wise and sample-wise diversity, Brabandere et al. first introduced dynamic filter networks  in video and stereo prediction, where position-specific filters are predicted through a CNN and applied as an inner product on each pixel position of the input image. Since a different filter is applied to each pixel, and different filters are predicted from different input images, they allow for sample-specific and position-specific filtering. This operation is called dynamic local filtering.
This concept was successfully utilized in other video-related tasks, such as frame interpolation and video SR . Niklaus et al.’s first idea  was similar to that of  with 2D local filters being predicted, named as adaptive convolution. Their extended work  with two 1D separable filters (horizontal/vertical) allowed to enlarge the final receptive field with the same number of parameters. For filter generation networks, the receptive field for the final output is solely defined by the size of the generated filter, implying that the depth or kernel sizes in the middle layers do not affect the final receptive field. Jo et al. used 3D-CNNs and added an up-sampling feature in generating the 2D filters (dynamic up-sampling filters), while incorporating a conventional residual network with direct reconstruction.
In our architecture, we design (i) a DR subnet with pixel-wise 1D separable horizontal and vertical filters to capture the distinct details; (ii) an LCE subnet with pixel-wise 2D local filters so that it can focus on the local region contrast. With the same number of filter parameters, 1D separable filters are coarse representations of a larger region, whereas 2D filters are finer representations of a local receptive area. In our framework, the DR subnet, LCE subnet and the IR subnet are combined to produce the final HR HDR images.
Generative Adversarial Networks
A GAN-based framework is typically composed of a generator and a discriminator, which are trained in an adversarial way, to force the generator to synthesize realistic image that is indistinguishable by the discriminator 
. Many advanced techniques and variants of GANs have achieved significant performance improvements in various computer vision tasks, especially in image restoration, such as image dehazing, SR [26, 17], denoising  and image enhancement . These methods adopted various GAN-based frameworks to improve perceptual quality for their individual purposes, and they generally train the main network as the generator, which is first trained with a pixel-wise norm-based loss, and is then fine-tuned by introducing an adversarial loss with the discriminator.
Motivated by the enhanced perceptual quality of GAN-based methods, we also design a GAN-based framework for joint SR-ITM. However, simply adopting conventional GANs for joint SR-ITM leads to difficulty in training the network for a more complex task of jointly improving the local variations of contrast and details along with up-converting to both higher bit depth per pixel and larger spatial resolution. To guide the generator in a more effective way to fool the discriminator and generate perceptually realistic HR HDR results with the GAN-based framework, we propose a novel detail GAN loss, which jointly optimizes a second discriminator for the detail components decomposed from the HR HDR result and the ground truth. We also employ the feature-matching loss that is measured in the feature space of the discriminator to reduce the drop in objective performance for a more stable training.
Deep-learning-based joint SR-ITM is a recent topic rising with the advent of UHD HDR TVs. The first CNN-based joint SR-ITM method  took a multi-task learning approach and focused on the advantages of performing the individual SR and ITM tasks simultaneously, along with the joint SR-ITM task. The more recent method is the Deep SR-ITM , where the input is decomposed, and element-wise modulations are inserted to introduce spatially-variant operations. We also use the decomposed input for the DR subnet, but our architecture incorporates pixel-wise filters of two kinds: separable 1D kernels and 2D local kernels with up-sampling that are used to generate the final HR HDR output by filtering operation. Moreover, our network is a GAN-based framework, differing from the previous methods.
We propose a GAN-based framework for joint SR-ITM, called JSI-GAN, where the generator, JSInet, is composed of three different task-specific subnets.
Network Architecture (Generator)
In joint SR-ITM, it is important to restore the high frequency details and enhance local contrast with the increase in image resolutions and pixel amplitudes to generate a high quality HR HDR image. Therefore, we design three subnets dedicated for each of these subtasks as a divide-and-conquer approach: the image reconstruction (IR) subnet reconstructs a coarse HR HDR image; the detail restoration (DR) subnet restores the details to be added on the coarse image; and the local contrast enhancement (LCE) subnet generates a local contrast mask to boost the contrast in this image. A detailed structure of JSInet is depicted in Fig. 2.
Detail Restoration (DR) Subnet
For the DR subnet, the input is the detail layer , containing the high frequency components of the LR SDR input image . is given by
where is the guided filtered output of , as in . is utilized to generate horizontal and vertical 1D separable filters. Formally, a residual block (ResBlock) RB is defined as
where is the input to the ResBlock, is a convolution layer, and
is a ReLU activation. Then, the horizontal 1D filteris obtained by,
where denotes a serial cascade of ResBlocks. The vertical 1D filter can be obtained in the same way as Eq. (3). As shown in Fig. 2, all layers except the last convolution layer are shared when producing and .
Each of the two last convolution layers consists of output channels, where 41 is the length of the 1D separable kernel, each applied onto its corresponding grid location, and takes into account the pixel shuffling operation for the up-scaling factor . Hence, this dynamic separable up-sampling operation () simultaneously applies two 1D separable filters while producing the spatially-upscaled output. Then, the final filtered output of the DR subnet is given by
The generated 1D kernels are position-specific, and also detail-specific, as different kernels are generated for different detail layers, unlike convolution filters that are fixed after training. In our implementations, was first applied to the detail layer via local filtering for each scale channel, followed by applying on its output. Finally, pixel shuffle was applied on the final filtered output with channels for spatial up-scaling. With the same number of parameters, the 1D separable kernels represent a wider receptive field ( parameters), but as a coarse approximation compared to a 2D kernel ( parameters). In our case, the separable 1D kernels of size can be compared to the 2D kernel of size , with similar number of parameters.
Local Contrast Enhancement (LCE) Subnet
The base layer , obtained through guided filtering, is used as the input to the LCE subnet. The LCE subnet generates a 2D local filter at each pixel grid position. As in the DR subnet, it is also an up-sampling filter which is constituted by output channels in the last layer. The 2D filter is then given by
With the 2D local filter predicted, dynamic 2D up-sampling operation () is performed, and is applied so that the the output lies in the range [0, 2] with the middle value at being 1. Formally, the final output of the LCE subnet is given as
As is considered an LCE mask, and is element-wisely multiplied
onto the sum of the two outputs from the IR and DR subnets, the JSInet converges better with the scaled sigmoid function. Without it, the pixel values of the initial predicted outputs (after multiplying with) are too small, causing much longer training time to reach an appropriate pixel range of the final HR HDR output of JSInet.
Image Reconstruction Subnet
For the IR subnet, the LR SDR input as itself is entered, to first produce the intermediate features as shown in Fig. 2, given by
Then, is concatenated with from the DR subnet, and the final output of the IR subnet is directly generated (without local filtering), as
where is a pixel-shuffle operator and is the concatenation of and in the channel direction.
Then, the final HR HDR prediction is generated by adding the details () to followed by multiplying the contrast mask () to the result, which is given by
We provide an ablation study on the three subnets and demonstrate that they are acting according to their given tasks in the later sections of the paper.
We employ a GAN-based framework, where the discriminator is designed as shown in Fig. 3 with spectral normalization  for stable learning. The discriminator distinguishes the predicted HR HDR image () generated by the generator and the ground truth image (
) alternatively during training. In the structure, batch normalization layers and Leaky ReLU (LReLU) activations  with a slope of 0.2 are adopted, as shown in Fig. 3. Further details on the discriminator architecture are described in the Supplementary Material.
The RaHinge GAN loss  is adopted as a basic adversarial loss for efficient training, which is given by
where and denote the RaHinge GAN losses for the discriminator and the generator , respectively, and with . It should be noted in Eq. (11) that contains both the and , meaning that the generator is trained by gradients from both the and during the adversarial training. We also use a feature-matching loss , where the L2 loss is measured between feature maps of and , from the first LReLU output of the - DisBlock as shown in Fig. 3. However, simple utilization of the above two losses is insufficient to effectively train the generator for joint SR-ITM.
|*(-): Subtraction instead of division to obtain the detail layer.|
|**PSNR is measured in dB.|
Detail GAN Loss
We propose a novel detail GAN loss for the joint SR-ITM task, in order to enforce more stable training and generate visually pleasing HR HDR results. is calculated according to Eq. (10) and Eq. (11) by replacing with and with , where the subscript denotes the detail layer component of the original image. For , we adopt a second discriminator () distinguished from the first discriminator (), both of which have the same architecture as shown in Fig. 3, but alternatively takes two inputs and , calculated by Eq. (1). not only guides the generator for a more stable training but also helps improve both local contrasts and details in the predicted HR HDR images.
The total loss for our proposed GAN-based framework for joint SR-ITM is given by
where the superscript means the loss for detail layer components (, ). We provide another ablation study on the losses and , and especially show the effect of the newly proposed , in the later sections of the paper.
For training, the generator was first pre-trained based only upon the L2 loss with the initial learning rate of
that is then decreased by a factor of 10 at epochs 200 and 225, of total 250 epochs, yielding the JSInet. Then it was fine-tuned based on a stable GAN-based framework with three losses (, , ) at the initial learning rate of that linearly decays to zero from the 5-th epoch of total 10 epochs, which finally yields the JSI-GAN. For training, we used three Adam optimizers  for minimizing , and , and all convolution weights were initialized by Xavier method . The generator and the two discriminators (,
) are trained alternatively by the three corresponding Adam optimizers. The hyperparameters of loss weights were empirically set to, , and . In the generator, the kernel size of the convolution filters were set to with 64 output channels, except for the last layer that predicts local filters for the DR subnet and the LCE subnet, and the layer before pixel shuffle for the IR subnet. The structure of both and has the channel outputs of . The LR SDR patches of size are cropped from 8-bit YUV frames in BT.709 , and the ground truth (HR HDR) patches of size are cropped from the corresponding 10-bit YUV frames in BT.2020  color container after PQ-OETF , following the setting in previous work . For training and testing the JSI-GAN, we utilized the 4K HDR dataset used in .
|Method||Scale||PSNR (dB)||SSIM||mPSNR (dB)||MS-SSIM||HDR-VDP (Q)|
|EDSR+Huo et al.||29.76||0.8934||31.81||0.9764||58.95|
|EDSR+Eilertsen et al.||25.80||0.7586||28.22||0.9635||53.51|
|JSInet (w/o GAN)||35.99||0.9768||38.20||0.9843||60.58|
|JSI-GAN (w/ GAN)||35.73||0.9763||37.96||0.9841||60.80|
|EDSR+Huo et al.||28.90||0.8753||30.92||0.9693||55.59|
|EDSR+Eilertsen et al.||26.54||0.7822||28.75||0.9631||53.88|
|JSInet (w/o GAN)||33.74||0.9598||35.93||0.9759||56.45|
|JSI-GAN (w/ GAN)||33.50||0.9572||34.82||0.9743||56.41|
Performance of JSInet
We first investigate the performance of JSInet that is trained solely on the L2 loss without the GAN framework, by analyzing the efficacy of its three subnets.
Ablation Study of Subnets
We first performed an ablation study on the three subnets in JSInet. Table 1 shows the PSNR/SSIM performance for six combinations of the three subnets, where the IR subnet is used as the essential subnet for all cases. As shown in column (c) of Table 1, emplying the DR subnet to the IR subnet brings 0.44 dB gain in PSNR, and additionally using the LCE subnet in column (f), further increases the PSNR by 0.11 dB, resulting in a total 0.55 dB gain over only using the IR subnet in column (a). It is also noted that the LCE subnet brings a higher performance gain when fused with the IR subnet (column (d)) with 0.23 dB gain, meaning that the LCE subnet is complementarily beneficial with the DR subnet. We have also provided the experiment results for a different decomposition strategy, subtraction, on the input images instead of division in Eq. (1), with the sign (-) in Table 1. For the structure with only the IR and DR subnets, there is minimal difference in PSNR, although division yields slightly better performance for SSIM. However, when the LCE subnet is jointly utilized, division outperforms subtraction by a larger margin of 0.31 dB in PSNR, since for colum (e), the local contrast map (obtained from the base layer) is multiplied to the coarse image with the added details in Eq. (9), and the subtraction operation is unmatched with the multiplication operation.
To verify that each of the subnets is focusing on their dedicated tasks, the intermediate predictions () of the subnets and the final prediction are visualized in Fig. 4. For visualization, and were first linearly scaled to a maximum value of 8 bits/pixel, and was further scaled by 64 for better visualization. In Fig. 4, the added details are invariant to the brightness context (1st and 2nd rows) and lighting conditions (3rd row), focusing only on the edges and texture as intended. The LCE mask effectively modulates the local contrast, producing the final with enhanced contrast.
Performance of JSI-GAN
We also conducted an ablation study on the efficacy of various losses in terms of weighting parameters and , to investigate their effect. Table 3 shows the average PSNR (dB) and SSIM performance of the JSI-GAN variants, each of which was trained via combinations with/without and , for scales and . If the JSI-GAN is only trained with the basic adversarial loss where , , and , severe performance degradation is observed with at most 2 dB drop in PSNR (variants (a) and (d)). By comparing variants (b) to (a) and (e) to (d), additionally adopting the feature-matching loss between P and Y brings significant gains of 1.9dB and 1.37dB, respectively, in PSNR. Finally, our proposed detail GAN loss between and allows to improve the quantitative performance as shown in variants (c) and (f) of Table 3. The effect of the losses is also shown qualitatively in Fig. 5. Just simply adopting the basic GAN loss not only degrades quantitative performance, but also severely deteriorates the visual qualities with the checkerboard artifact and unnatural details/contrasts, as shown in the leftmost column in Fig. 5. Although the feature-matching loss helps the generator improve the overall visual quality when comparing the variant (b) to (a), the proposed detail-component-related losses (, , ) additionally improve both visual qualities and objective performances comparing the variant (c) to the variants (a) and (b). As a result, the final JSI-GAN enables the restoration of realistic details after stable training while allowing for high objective quality of the HR HDR predictions.
Comparison with Other Methods
We compare our JSI-GAN with the two previous joint SR-ITM methods, the Multi-purpose CNN  and Deep SR-ITM . Additionally, the JSI-GAN is also compared with the cascades of an SR method, EDSR , and two ITM methods [8, 4]. The previous methods were compared following the experiment protocol on the ITM prediction pipeline and visualization as described in Deep SR-ITM .
The quantitative comparison of the proposed JSInet (without GAN) and JSI-GAN (with GAN) against the previous methods are given in Table 2. The performance is measured using error-based metrics such as PSNR and mPSNR, and structural metrics such as SSIM and MS-SSIM, as well as HDR-VDP-2.2.1 , which is able to measure the performance degradation in all luminance conditions. The CNN-based joint SR-ITM methods outperform the cascaded methods by a large margin, and our JSInet outperforms the other methods in all cases except for HDR-VDP for scale factor 2. The proposed JSI-GAN also shows good quantitative performance, thanks to the stable training that mitigates the drop in objective performance metrics.
The qualitative comparison of the proposed JSI-GAN is given in Fig. 1 and Fig. 6. In Fig. 1, our method is able to reconstruct the fine lines on the window, produce more tree-like textures, and generate correct horizontal patterns on the wall. Likewise in Fig. 6, the proposed JSI-GAN generates fine details with enhanced contrast without artifacts in the homogeneous regions (e.g. sky), thanks to both our task-specific subnets and detail component-related losses.
In this paper, we first proposed a GAN-based framework for joint SR-ITM, called JSI-GAN, where the generator (JSInet) jointly optimizes the three task-specific subnets designed with pixel-wise 1D separable filters for fine detail restoration and a pixel-wise 2D local filter for contrast enhancement. These subnets were carefully designed for their intended purposes to boost the performance significantly. Moreover, we also proposed a novel detail GAN loss alongside the conventional GAN loss, which helps enhancing both local details and contrasts for generating high quality HR HDR reconstructions. We analyzed the efficacy of the subnet components and the weighting parameters for losses with intensive ablation studies. Our proposed JSI-GAN, which is applicable for directly converting LR SDR frames to HR HDR ones in real-world applications, achieves a state-of-the-art performance over the previous methods.
Image blind denoising with generative adversarial network based noise modeling.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3155–3164. Cited by: Generative Adversarial Networks.
-  (2018) Deep photo enhancer: unpaired learning for image enhancement from photographs with gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6306–6314. Cited by: Generative Adversarial Networks.
-  (2016) Dynamic filter networks. In Advances in Neural Information Processing Systems, pp. 667–675. Cited by: Pixel-Wise or Pixel-Aware Filter Learning, Pixel-Wise or Pixel-Aware Filter Learning.
-  (2017) HDR image reconstruction from a single exposure using deep cnns. ACM Transactions on Graphics 36 (6), pp. 178. Cited by: Comparison with Other Methods.
Understanding the difficulty of training deep feedforward neural networks. In
Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pp. 249–256. Cited by: Experiment Conditions.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: Generative Adversarial Networks.
-  (2012) Guided image filtering. IEEE transactions on pattern analysis and machine intelligence 35 (6), pp. 1397–1409. Cited by: Introduction.
-  (2014) Physiological inverse tone mapping based on retina response. The Visual Computer 30 (5), pp. 507–517. Cited by: Comparison with Other Methods.
Batch normalization: accelerating deep network training by reducing internal covariate shift.
Proceedings of the International Conference on Machine Learning, pp. 448–456. Cited by: Discriminator.
-  (2002) Parameter values for the hdtv standards for production and international programme exchange. Note: ITU-R Rec. BT.709-5[Online]. Available: http://www.itu.int/rec/R-REC-BT.709 Cited by: Experiment Conditions.
-  (2014) Parameter values for ultra-high definition television systems for production and international programme exchange. Note: document ITU-R Rec. BT.2020-1[Online]. Available: http://www.itu.int/rec/R-REC-BT.2020 Cited by: Experiment Conditions.
-  (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3224–3232. Cited by: Pixel-Wise or Pixel-Aware Filter Learning.
-  (2019) The relativistic discriminator: a key element missing from standard gan. In International Conference on Learning Representations, Cited by: Adversarial Loss.
A multi-purpose convolutional neural network for simultaneous super-resolution and high dynamic range image reconstruction. In Proceedings of the 14th Asian Conference on Computer Vision, pp. 379–394. Cited by: Joint SR-ITM, Comparison with Other Methods.
-  (2019) Deep sr-itm: joint learning of super-resolution and inverse tone-mapping for 4k uhd hdr applications. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: Joint SR-ITM, Detail Restoration (DR) Subnet, Experiment Conditions, Comparison with Other Methods.
-  (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: Experiment Conditions.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4681–4690. Cited by: Generative Adversarial Networks.
-  (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 136–144. Cited by: Comparison with Other Methods.
-  Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the International Conference on Machine Learning, Cited by: Discriminator.
-  (2011) A calibrated visual metric for visibility and quality predictions in all luminance conditions. ACM Transactions on Graphics 30 (4), pp. 40. Cited by: Quantitative Comparison.
-  (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, Cited by: Discriminator.
-  (2017) Video frame interpolation via adaptive convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 670–679. Cited by: Pixel-Wise or Pixel-Aware Filter Learning.
-  (2017) Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 261–270. Cited by: Pixel-Wise or Pixel-Aware Filter Learning.
Enhanced pix2pix dehazing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8160–8168. Cited by: Generative Adversarial Networks.
-  (2014) High dynamic range electro-optical transfer function of mastering reference displays. Note: ST2084:2014 Cited by: Experiment Conditions.
-  (2018) Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision Workshops, Cited by: Generative Adversarial Networks.