ImagePairs: Realistic Super Resolution Dataset via Beam Splitter Camera Rig

04/18/2020
by   Hamid Reza Vaezi Joze, et al.
9

Super Resolution is the problem of recovering a high-resolution image from a single or multiple low-resolution images of the same scene. It is an ill-posed problem since high frequency visual details of the scene are completely lost in low-resolution images. To overcome this, many machine learning approaches have been proposed aiming at training a model to recover the lost details in the new scenes. Such approaches include the recent successful effort in utilizing deep learning techniques to solve super resolution problem. As proven, data itself plays a significant role in the machine learning process especially deep learning approaches which are data hungry. Therefore, to solve the problem, the process of gathering data and its formation could be equally as vital as the machine learning technique used. Herein, we are proposing a new data acquisition technique for gathering real image data set which could be used as an input for super resolution, noise cancellation and quality enhancement techniques. We use a beam-splitter to capture the same scene by a low resolution camera and a high resolution camera. Since we also release the raw images, this large-scale dataset could be used for other tasks such as ISP generation. Unlike current small-scale dataset used for these tasks, our proposed dataset includes 11,421 pairs of low-resolution high-resolution images of diverse scenes. To our knowledge this is the most complete dataset for super resolution, ISP and image quality enhancement. The benchmarking result shows how the new dataset can be successfully used to significantly improve the quality of real-world image super resolution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 7

page 8

02/12/2021

A Generative Model for Hallucinating Diverse Versions of Super Resolution Images

Traditionally, the main focus of image super-resolution techniques is on...
05/12/2020

Invertible Image Rescaling

High-resolution digital images are usually downscaled to fit various dis...
08/05/2021

Dual-reference Training Data Acquisition and CNN Construction for Image Super-Resolution

For deep learning methods of image super-resolution, the most critical i...
07/26/2018

A Tensor Factorization Method for 3D Super-Resolution with Application to Dental CT

Available super-resolution techniques for 3D images are either computati...
05/29/2019

Towards Real Scene Super-Resolution with Raw Images

Most existing super-resolution methods do not perform well in real scena...
12/10/2020

Super-resolution Guided Pore Detection for Fingerprint Recognition

Performance of fingerprint recognition algorithms substantially rely on ...
01/25/2022

Revisiting L1 Loss in Super-Resolution: A Probabilistic View and Beyond

Super-resolution as an ill-posed problem has many high-resolution candid...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Super Resolution (SR) is the problem of recovering high-resolution (HR) image from a single or multiple low-resolution (LR) images of the same scene. In this paper we are focusing on single-image SR which uses a single LR image as input. It is an ill-posed problem as the high frequency visual details of the scene are lost in the LR image while the HR image is being recovered. Therefore, the SR techniques are proven to be restrictive for usage in the practical applications [3]. SR could be used for many different applications such as satellite and aerial imaging [57], medical image processing [74], infrared imaging [77], improvement of text, sign and license plate [4], and finger prints [13].

Figure 1 shows an example of single-image SR process where the recovered HR image is 4 times larger that its LR input image. We show the result of different super resolution techniques in this figure. If the technique fails to recover adequate detail from the LR input, the output will be blurry without sharp edges.

Figure 1: Super Resolution Process on an image from ImagePairs dataset using bicubic, SRGAN [38] and EDSR [41] methods.

The SR problem has been studied comprehensively in the past [58, 47] and many machine learning techniques has been proposed to solve this problem. Examples would include Bayesian [60], steering kernel regression [73], adaptive Wiener filter [26], neighbor embedding [22, 7], matching [56] and example-based [21] methods.

Deep learning techniques have been proven a success in many areas of computer vision. This involves application of deep learning techniques by the lead image restoration researcher to solve SR 

[27, 37, 34, 66, 44, 43, 25, 16]

. Because of the nature of deep the learning networks, being a multi-layered feature extraction cascade 

[15], more data is required in order to train these complex methods [50].

As proven, the input data itself plays a significant role in the machine learning processes [5, 78], especially deep learning approaches which are data hungry. Hence, the process of gathering data and its formation may be equally as vital to solving the machine learning problem as the technique used. The sole purpose of SR is not to upscale or to increase the number of pixels in an image, but to increase the quality of it as closely to an image with the target resolution as possible. An example would be capturing a photo using a cellphone with a front facing camera and a rear facing camera where a 2X SR technique applied to the front facing camera will make it . This is an attempt to increase the number of pixel from to while expecting an increase in the the quality the output image similar to that of the high quality rear facing camera. An example is presented in Fig. 1, where the same scene was photographed with a camera and camera in the same lighting condition. The same part of the image was cropped to show the nature of the difference in the quality of the images (ground truth vs. bicubic). This shows that maintaining the quality of the SR technique output requires SR, noise cancellation, image sharpening and even color correction to some extend while the-state-of-the-art methods such as SRGAN [38] and EDSR [41] fail to do so as seen in Fig. 1. We believe that the main reason for failure of these methods is lack of realistic training data that we focus on this paper.

A more complex version of this task could be Image Signal Processing (ISP) pipeline with various stages including denoising [9, 72], demosaicing [39], gamma correction, white balancing [63, 61]

and so on. ISP pipeline has to be tuned by camera experts for a relatively long time before it can be used in the commercial cameras. Domain knowledge such as optics, mechanics of the cameras, electronics and human perception of colors and contrast are necessary in this tuning process. Replacing this highly skilled and tedious tuning process with a deep neural network is a recent research direction in computational photography

[51, 49, 40]. Current datasets [68, 2] widely used for training SR models increase the number of pixel without taking the quality of the image into consideration. The new data acquisition technique proposed herein may be used for SR, noise cancellation and quality enhancement techniques. A dataset of 11,421 pairs of LR-HR images is presented which was used to solve the SR problem. We use a beam-splitter to capture the same scene by two cameras: LR and HR. The proposed device can capture the same scene by two cameras, there still have a different perspective due to the different focal lenses, but we solve it by local alignment technique. Since we also release the raw images, this large-scale dataset could be used for other tasks such as ISP generation. To our knowledge, this is the most complete dataset for SR, ISP and image quality enhancement with far more real LR-HR images compared to existing dataset for SR and ISP task. This dataset is more than larger than current SR dataset while it includes real LR-HR pairs and more than larger than current ISP dataset while it includes diverse scenes. The benchmark result shows how the new dataset can be successfully used to significantly improve the quality of real-world image super resolution.

Data Set Size Main purpose HR Resolution LR generation Raw
Set5 [7], Set14 [76], Urban100 [28] 5/14/100 SR down-sample HR No
The Berkeley segmentation [45] 200 Segmentation down-sample HR No
DIV2K [2] 1000 SR down-sample HR No
See-In-the-Dark (SID) [12] 5094 Low-Light - Yes
Samsung S7 [51] 110 ISP - Yes

RealSR  [11]
595 SR Real Yes

ImagePairs (Proposed)
11421 SR Real Yes
Table 1: Compression between proposed dataset to current datasets used for its task.

2 Related Works

In recent years, the core of image SR methods has shifted towards machine learning, mainly the machine learning techniques and the datasets. Herein, a brief description is given on the single image SR methods and learning-based ISP methods as well as their common datasets. There are also multiple image SR methods [60, 8, 19, 16] which are not the main focus of this paper. More comprehensive SR methods descriptions may be found in [58, 47].

2.1 SR methods

Interpolation based:

The early SR methods are known as interpolation based methods where new pixels are estimated by interpolating given pixels. This is the easiest way to updates the image resolution. Examples include Nearest Neighbor interpolation, Bilinear interpolation and Bicubic interpolation which uses 1, 4 and 16 neighbor pixels respectively to compute the value of new pixels. These methods are wildly in use in image resizing.

Patch based: More recent SR methods rely on machine learning techniques to learn the relation between patches of HR image and patches of LR images. These methods are referred to as patch-based methods in some literature [68, 65] and Exemplar-based in other [21, 23]. Unlike the first class of methods, these methods need training data in order to train their models. These training data are usually pairs or corresponding LR and HR images. The training dataset is further discussed in subsection 2.3. Depending on the source of a training patch, the corresponding method for patch based SR may be categorized into two main categories: external or internal.

External methods the external method uses a variety of learning algorithms to learn the LR-HR mapping from a large database of LR-HR image pairs. These include nearest neighbor [21]

, kernel ridge regression 

[35], sparse coding [69]

and convolutional neural networks 

[16].

Internal Methods the internal method on the other hand assumes that patches of a natural image recurs within and across scales of the same image [6]. Therefore, it makes an attempt to search for a HR patch within a LR image with different scales. Glasner et al. [23] united the classical and example-based SR by exploiting the patch recurrence within and across image scales. Freedman and Fattal [20] gained computational speed-up by showing that self-similar patches can often be found in limited spatial neighborhoods. Yang et al. [68] refined this notion further to seek self-similar patches in extremely localized neighborhoods, and performed first-order regression. Michaeli and Irani [46] used self-similarity to jointly recover the blur kernel and the HR image. Singh et al. [28] used the self-similarity principle for super-resolving noisy images.

With the success of convolution neural networks, many internal patch-based SR methods were proposed which outperform the prior methods. As an example, SRGAN [38] used a generative adversarial network (GAN) [24]

for this task that trained by perceptual loss function consisting of an adversarial loss and a content loss. The residual dense networks (RDN) 

[75] exploited the hierarchical features from all the convolutional layers. EDSR [41] did a performance improvement by removing unnecessary modules in conventional residual networks. WDSR [70] introduced a linear low-rank convolution in order to further widen activation without computational overhead.

2.2 ISP Methods

Image Signal Processing (ISP) pipeline is a method used to convert an raw image into a digital form in order to get an enhanced image. This consists of various stages including denoising [9, 72], demosaicing [39], gamma correction, white balancing [62, 17] and so on. Currently, this pipeline has to be tuned by camera experts for a long period of time for each new camera. Replacing the expert-tuned ISP with a fully automatic method has been done with few recent methods approach by training an end-to-end deep neural network [51, 49, 40]. Schwartz et al. [51] released a data set, named Samsung S7 data set, contains RAW and RGB image pairs with both short and medium exposures. They design a network that first processes the image locally then globally. Ratnasingam [49] replicates the steps of a full ISP with a group of sub networks and achieves the-state-of-the-art result by training and testing on a set of synthetic images. Liang et al. [40] used 4 sequential u-nets in order to solve this problem. They claimed that the same network can be used for en-lighting extreme low light images.

2.3 SR and ISP Datasets

SR dataset includes pairs of HR and LR images. Most existing datasets generate an LR image from the corresponding HR image by sub-sampling the image using various settings. Here the HR images are also called ground truth as the final goal of SR methods is to retrieve them from LR images. Therefore, SR dataset includes sets of HR images or ground truths and settings to generate LR image from HR images. Here is a list of common SR datasets:

  1. The Berkeley segmentation dataset [45] is one of the first datasets used for single image SR [55, 20, 23]. It includes 200 professional photographic style images of pixels with a diverse content.

  2. Yang et. al. [68] proposed a benchmark for single image SR which includes The Berkeley segmentation dataset as well as a second set containing 29 undistorted high-quality images from the LIVE1 dataset [53] , ranging from to pixels. Huang et. al. [27] added 100 urban high resolution images from flicker100 with a variety of real-world structures to this benchmark, in order to focus more on man made object.

  3. DIV2K dataset [2] has introduced a new challenge for single image SR. This database include 1000 images of diverse contents with train/test/validation split as 90/10/10.

  4. RealSR dataset [11] captured images of the same scene using fixed DSLR cameras with different focal lengths. The focal length changes can capture finer details of the scene. This way, HR and LR image pairs on different scales can be collected with a registration algorithm. This dataset includes 595 LR/HR pairs of indoor and outdoor scenes.

There are also few standard benchmark datasets, Set5 [7], Set14 [76], and Urban100 [28] commonly used for performance comparison. These datasets include 5 ,14 and 100 images, respectively. Apart from RealSR [11], all other datasets do not include LR images so the LR image should generate synthetically from corresponding HR image. There are several ways to generate LR test images from HR images (the ground truth) [52, 59, 54] such that the generated LR test images may be numerically different. One common way to achieve this is to generate a LR image in a Gaussian blur kernel to down-sample the HR image using a noise term [32, 35, 68]. The parameter for this task will be as scale factor, for Gaussian kernel and for noise factor. There are other datasets dedicated to image enhancements such as MIT5K [10] and DPED [29]. MIT5K [10] includes 5,000 photographs taken with SLR cameras, each image retouched by professionals to achieve visually pleasing renditions. DPED [29, 31] consists of photos taken synchronously in the wild by three smartphones and one DSLR camera. The smartphone images were aligned with DSLR images to extract patches for CNN training including 139K, 160K and 162K pairs for each settings. This dataset was used in a challenge on image enhancement [31] as well as a challenge on RAW to RGB Mapping [30].

There are not many publicly available ISP dataset which requires raw image as well as generated image from that. Here we describe two datasets that were used for ISP.

  1. See-In-the-Dark (SID): proposed by Chen et al. [12], is a Raw-RGB dataset captured in extreme low-light where each short-exposure raw image is paired with its long-exposure RGB counterpart for training and testing [71]. Images in this dataset were captured using two cameras: Sony and Fujifilm X-T2, each subset contains about 2500 images, with about as test set. The raw format of Sony subset is the traditional 4-channel Bayer pattern that of Fuji subset is XTrans format with 9 channels. Beside raw and RGB data, their exposure times are provided alongside.

  2. Samsung S7: captured by Schwartz et al. [51], contains 110 different RAW-RGB pairs, with train/test/validation split of 90/10/10. Different to the SID dataset, this one does not provide related camera properties such as the exposure time associated with the image pairs. The raw format here is also the traditional 4-channel Bayer pattern.

Current SR methods as well as learning based ISP methods are mainly focused on their learning process as mentioned before. Different machine learning techniques have been applied to these problems and recent efforts have involved training different deep neural network models. Comparing to datasets for popular computer vision tasks such as image classification [14], detection [1, 42], segmentation [42], video classification [33] and sign language recognition [64], there is an obvious lack of large realistic dataset for SR and ISP tasks despite of the potential to produce significant result by neural network techniques. Table 1 shows all these datasets currently used for SR and ISP tasks and their specification compared to our proposed dataset ImagePairs. Our proposed dataset is not only at least 10 times larger than other SR datasets and 2 times from other ISP datasets, but also has real LR-HR images and includes raw images which could be used for other tasks.

3 Data Acquisition

3.1 Hardware Design

The high resolution camera used had a , 1/2.4” format CMOS image sensor supporting frame capture, pixel size, and lens focal length of (F/1.94), providing a field of view (FOV). The camera also featured bidirectional auto-focus (open loop VCM) and 2-axis optical image stabilization (closed loop VCM) capability.

The lower resolution fixed-focus camera used had a similar FOV with approximately half the angular pixel resolution. it also featured a , 1/4” format CMOS image sensor supporting frame capture, pixel size, and lens focal length (F/2.4), providing a FOV. Table 2 shows the specifications for these cameras.

Camera Low-resolution High-resolution
Image sensor format 1/4” 1/2.4”
pixel size
Resolution 5MP 20.1MP
FOV (H,V) , ,
Lens focal length
Focus fixed-focus auto-focus
Table 2: Camera Specifications.
Figure 2: Opto-mechanical layout of dual camera combiner, showing high resolution camera (transmission path) and low resolution camera (reflective path) optically aligned at nodal points and with overlapping FOV pointing angle.
Figure 3: The data acquisition device install on a tripod while the trolley is used for outdoor manoeuvre.
Figure 4: Two camera setup

In order to simultaneously capture frames on both cameras with a common perspective, the FOVs of both cameras are combined using a Thorlabs BS013 50/50 non-polarizing beam-splitter cube. They are then aligned such that pointing angle of the optical axes are at far distance and entrance pupils at each camera (nodes)are at near distance. The high resolution camera, placed behind the combiner cube in the transmission optical path, is mounted on a Thorlabs K6XS 6-axis stage so that the and position of the entrance pupil is centered with the cube and the position in close proximity. The tip and tilt of camera image center field pointing angle is aligned with a target at distance while rotation about camera optical axis is aligned by matching pixel row(s) with a horizontal line target. Fig. 2 illustrates the opto-mechanical layout of the dual camera combiner. The low resolution camera is placed behind the combiner cube in the lateral folded optical path and also mounted on a 6-axis stage. It is then aligned in , and such that entrance pupil optically overlaps that of the high resolution camera. The tip/tilt pointing angle as well as camera rotation about optical axis may be adjusted so as to achieve similar scene capture. In order to refine the overlap toward pixel accuracy, a live capture tool displays the absolute difference of camera frame image content between cameras such that center pointing and rotation leveling may be adjusted with high sensitivity. Any spatial and angular offsets may be substantially nulled by mechanically locking the camera in position. The unused combiner optical path is painted with carbon black to limit image contrast loss due to scatter. The opto-mechanical layout of dual camera combiner is illustrated at figure  4.

The proposed device can capture the same scene by two different cameras. The two cameras have a difference in perspective due to the different focal lenses which was solved by a local alignment technique described in section 4. Furthermore, the two camera sensors get half of the light because of 50/50 split with poorer image quality mainly on low-resolution camera.

3.2 Software Design

A data capturing software was developed to connect to both cameras, allowing them to synchronize with each other. The software may capture photo from both cameras at the same time as well as adjusting camera parameters such as gain, exposure and lens position for the HR camera. The raw data was stored for each camera, allowing later use of the arbitrary ISP. For each camera, all the meta data was stored on a file including the image category selected by the photographer. Figures 3 shows the data acquisition device installed on a tripod while the trolley is used for outdoor maneuvering.

4 ImagePairs Dataset

The dataset was called ImagePairs as it includs pairs of images of the exact scene using two different cameras. Images are either LR or HR where the HR image is twice as big in each dimensions as the corresponding LR image; all LR images are pixels and HR images are pixels. Unlike other real world datasets, we do not use zooming levels or scaling factor to increase the number of pairs so each pair corresponds to a separate scene. This means that we captures 11,421 distinct scenes with the device which generates 11,421 image pairs.

For each image pair, the meta data such as gain, exposure, lens position and scene categories were stored. Each image pair was assigned to a category which may later be used for training purposes. These categories include Document, Board, Office, Face, Car, Tree, Sky, Object, Night and Outdoor. The pairs are later divided in two sets of train and test, each including 8591 and 2830 image pairs, respectively. The two cameras have a difference in perspective due to the different focal lenses. Therefore, in order to generate pairs corresponding to each other in pixel level, the following steps were applied: (1) ISP (2) image undistortion (3) pair alignment (4) margin cropping. Figure 6 illustrates diverse samples from proposed dataset after the final alignments. In order to show the accuracy of pixel-by-pixel alignment, each sample image is divided by half horizontally to show LR at left and HR at right in Fig. 6.

ISP : The original images were stored in the raw format. The first step was to convert the raw data to color images, using a full-stack powerful ISP for both LR and HR. Since we have access to the raw data, the ISP can be replaced with a different one or a simple linear one to ignore the non-linearity in the pipeline.

Image Undistortion : CMOS cameras introduce a lot of distortion to images. Two major distortions are radial distortion and tangential distortion. In radial distortion, straight lines will appear curved while in the tangential distortion the lens is not aligned perfectly parallel to the imaging plane. To overcome these distortions in both LR and HR images, we calibrated both cameras by capturing several checkerboard images. These images were later used to solve a simple model for radial and tangential distortions [48]. Figure 5 shows the un-distorted image for both LR and HR images.

Figure 5: Undistorted image of HR at right and LR at left.

Alignment : We use two steps in order to align the LR and HR images. First we try to globally match two images using image registration technique specifically homography transformation. Although now HR and LR image are globally aligned but they may not be aligned pixel by pixel due to some geometry constrains. So as the second step, we use a 10 by 10 grid for LR image and do a local search to find the best match on HR image for that grid. Lastly, we use matching position for grids on HR image to warp the LR image so that the LR and HR are globally and locally matched to each other.

Margin Crop : Although the images were aligned globally and locally, the borders are not as aligned as we expected, possibly due to differences in the camera specifications. Therefore, of border from each image was removed, resulting in a change in the resolution of both LR and HR images; pixels and pixels respectively.

Figure 6: Sample images from ImagePairs dataset. Each image divided by half horizontally to show LR on the left and HR on the right.

For each image (LR or HR) we also stored meta data which is analogue gain, digital gain, exposure time, lens position and scene category. The scene category which is selected by the photographer includes Office, Document, Tree, Outside, Toy, Sky, Sign, Art, Building, Night, etc. Figure 7 illustrates the frequency of each categories for ImagePairs train/test sets.

Figure 7: Frequency of ImagePairs train/test categories.

At this point, the ImagePairs consists of a large dataset of HR-LR images, allowing the easy application of patch-base algorithms. Random patches can pick from LR and the corresponding HR patches. Since the correspondence is pixel by pixel, there is no need to search for similar patches in different scales. Additionally, the ground truth (HR) has 4 times more pixels, is sharper and less noisy compared to the LR images, hence an increased image quality.

Bicubic RDN [75] SRGAN [38] EDSR [41] WDSR [70] Proposed
Figure 8: Qualitative compassion of the-state-of-the-art super resolution methods train on DIV2K [2] dataset set compare to simple network trained on ImagePairs dataset. The first two images are from ImagePairs test set and the next two are external images.

5 Experimental Results

5.1 Realistic Super Resolution

Before running a benchmark for state-of-the-art SR methods, we need to see their performance when trained on current SR datasets. As mentioned before, a real LR image usually has many other artifacts as it is captured with a weaker camera. We train a basic generative adveral network (GAN) model which includes 10 convolution layers for generator and a U-Net with 10 convolution/deconvolution for discriminator network with the proposed dataset. The sole reason of this experiment is to see if current state-of-the-art methods trained on synthetic images can outperform our simple method training on real images or not. Figure 8 shows the performance of this method compared to the super resolution methods: SRGAN [38], EDSR [41], WDSR [70] and RDN [75] trained on DIV2K dataset [2]. The first two images are from ImagePairs test set and the next two images are from real-world LR images from the internet. As expected, these methods only increase the pixel and do not effect image artifacts like noise and color temperature. Our method trained on ImagePairs dataset does well for test images from the dataset and real-world LR images.

5.2 Super Resolution Benchmark

We trained three 2X super resolution methods on ImagePairs train set including SRGAN [38], EDSR [41] and WDSR [70] by using their model implementation by [36, 18]. All SR methods trained using LR-HR rgb images and we do not use raw images as input. We use same patch size of for HR images and batch size equal to for all training. All methods are trained for iterations. For evaluation, we run trained models on centered quarter of cropped images of Imagepairs test set. Table 3

reports the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) 

[67] for trained model on ImagePairs as well as model trained on DIV2K dataset with similar parameters. As we discussed before, the PSNR and SSIM for methods trained on DIV2K is comparable with bicubic method. In some cases, they perform worst than bicubic since noise could boost with some SR methods. On the other hand, when we trained the same models with proposed ImagePairs dataset, all methods outperform their PNSR. SRGAN [38] and EDSR [41] is doing a good job in noise cancellation and outperform at least 2 db for PSNR and 0.6 on SSIM. On the other hand, SRGAN [38] which is not optimized for PSNR, mainly focuses on color correction and not much on noise cancellation. Figure 9 illustrates qualitative comparison of these methods trained on ImagePairs dataset. Needless to say, these models perform much better on nose cancellation, color correction and super resolution compared to similar models trained on DIV2K.

Model Train data PSNR (db) SSIM
Bicubic - 21.451 0.712
SRGAN [38] DIV2K 21.906 0.699
WDSR [70] DIV2K 21.299 0.697
EDSR [41] DIV2K 21.298 0.697
SRGAN [38] ImagePairs 22.161 0.673
WDSR [70] ImagePairs 23.805 0.767
EDSR [41] ImagePairs 23.845 0.764
Table 3: Comparisons of state-of-the-art single image super resolution algorithms on ImagePairs data set.
HR Bicubic WDSR [70] EDSR [41] SRGAN [38]
Figure 9: Qualitative comparison of the-state-of-the-art super resolution methods train on proposed dataset.
Raw DeepISP [51] SIDnet [12] GuidanceNet [40] Ground Truth
Figure 10: Qualitative comparisons of state-of-the-art ISP methods trained on ImagePairs dataset.

5.3 ISP Benchmark

For ISP task, we consider LR images and their corresponding raw images of ImagePairs train/test sets as the raw HR images are too large. We trained DeepISP net [51], SID net [12] and GuidanceNet [40] on ImagePairs training set which contains raw and LR images. All networks read RAW images and associated 4 camera properties: analogue gain, digital gain, exposure time and lens position. Here, the exposure time is in microsecond; the lens position is the distance between the camera and the scene in centimeters. GuidanceNet [40] is designed to use camera properties in its bottleneck layers, but we modified DeepISP net [51] and SID net [12]. For DeepISP net [51], we tile and concatenate these features with the output of their local sub-network and then feed it to the global sub-network for estimating the quadratic transformation coefficients. For SID net [12], we tile and concatenate these features with the input image. Tables 4 reports the evaluation of these three models on ImagePairs test set in term of PSNR and SSIM. This shows GuidanceNet [40] which properly used camera properties outperform others. Figure 10 illustrates examples for each of these models.

Model PSNR (db) SSIM
DeepISP [51] 20.30 0.89
SIDnet [12] 23.08 0.90
GuidanceNet [40] 29.22 0.96
Table 4: Comparisons of ISP algorithms on ImagePairs dataset.

6 Conclusion

In this paper we proposed a new data acquisition technique which could be used as an input for SR, noise cancellation and quality enhancement techniques. We used a beam-splitter to capture the same scene by a low resolution camera and a high resolution camera. Unlike current small-scale datasets used for these tasks, our proposed dataset includes 11,421 pairs of low-resolution and high-resolution images of diverse scenes. Since we also release the raw images, this large-scale dataset could be used for other tasks such as ISP generation. We trained state-of-the art methods for SR and ISP tasks on this dataset and showed how the new dataset can be successfully used to improve the quality of real-world image super resolution significantly.

References

  • [1] M. ”Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (”2015”) ”The pascal visual object classes challenge: a retrospective”. ”International Journal of Computer Vision” ”111” (”1”), pp. 98–136. Cited by: §2.3.
  • [2] E. Agustsson and R. Timofte (2017-07) NTIRE 2017 challenge on single image super-resolution: dataset and study. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    ,
    Cited by: Table 1, §1, item 3, Figure 8, §5.1.
  • [3] S. Baker and T. Kanade (2002) Limits on super-resolution and how to break them. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (9), pp. 1167–1183. Cited by: §1.
  • [4] J. Banerjee and C. Jawahar (2008) Super-resolution of text images using edge-directed tangent field. In 2008 The Eighth IAPR International Workshop on Document Analysis Systems, pp. 76–83. Cited by: §1.
  • [5] M. Banko and E. Brill (2001) Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th annual meeting on association for computational linguistics, pp. 26–33. Cited by: §1.
  • [6] M. F. Barnsley (2014) Fractals everywhere. Academic press. Cited by: §2.1.
  • [7] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. Cited by: Table 1, §1, §2.3.
  • [8] S. Borman and R. L. Stevenson (1998) Super-resolution from image sequences-a review. In Circuits and Systems, 1998. Proceedings. 1998 Midwest Symposium on, pp. 374–378. Cited by: §2.
  • [9] A. Buades, B. Coll, and J. Morel (2005) A non-local algorithm for image denoising. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, pp. 60–65. Cited by: §1, §2.2.
  • [10] V. Bychkovsky, S. Paris, E. Chan, and F. Durand (2011) Learning photographic global tonal adjustment with a database of input / output image pairs. In The Twenty-Fourth IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.3.
  • [11] J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang (2019) Toward real-world single image super-resolution: a new benchmark and a new model. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3086–3095. Cited by: Table 1, item 4, §2.3.
  • [12] C. Chen, Q. Chen, J. Xu, and V. Koltun (2018-06) Learning to see in the dark. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1, item 1, Figure 10, §5.3, Table 4.
  • [13] J. Cui, Y. Wang, J. Huang, T. Tan, and Z. Sun (2004) An iris image synthesis method based on pca and super-resolution. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, Vol. 4, pp. 471–474. Cited by: §1.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Cited by: §2.3.
  • [15] L. Deng, D. Yu, et al. (2014) Deep learning: methods and applications. Foundations and Trends® in Signal Processing 7 (3–4), pp. 197–387. Cited by: §1.
  • [16] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision, pp. 184–199. Cited by: §1, §2.1, §2.
  • [17] M. S. Drew, H. R. V. Joze, and G. D. Finlayson (2014) The zeta-image, illuminant estimation, and specularity manipulation. Computer Vision and Image Understanding 127, pp. 1–13. Cited by: §2.2.
  • [18] F. C. et al. (2018) ISR. Note: https://github.com/idealo/image-super-resolution Cited by: §5.2.
  • [19] S. Farsiu, M. D. Robinson, M. Elad, and P. Milanfar (2004) Fast and robust multiframe super resolution. IEEE transactions on image processing 13 (10), pp. 1327–1344. Cited by: §2.
  • [20] G. Freedman and R. Fattal (2011) Image and video upscaling from local self-examples. ACM Transactions on Graphics (TOG) 30 (2), pp. 12. Cited by: item 1, §2.1.
  • [21] W. T. Freeman, T. R. Jones, and E. C. Pasztor (2002) Example-based super-resolution. IEEE Computer graphics and Applications 22 (2), pp. 56–65. Cited by: §1, §2.1, §2.1.
  • [22] X. Gao, K. Zhang, D. Tao, and X. Li (2012) Image super-resolution with sparse neighbor embedding. IEEE Transactions on Image Processing 21 (7), pp. 3194–3205. Cited by: §1.
  • [23] D. Glasner, S. Bagon, and M. Irani (2009) Super-resolution from a single image. In Computer Vision, 2009 IEEE 12th International Conference on, pp. 349–356. Cited by: item 1, §2.1, §2.1.
  • [24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.1.
  • [25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §1.
  • [26] R. Hardie (2007) A fast image super-resolution algorithm using an adaptive wiener filter. IEEE Transactions on Image Processing 16 (12), pp. 2953–2964. Cited by: §1.
  • [27] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206. Cited by: §1, item 2.
  • [28] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5197–5206. Cited by: Table 1, §2.1, §2.3.
  • [29] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool (2017) DSLR-quality photos on mobile devices with deep convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3277–3285. Cited by: §2.3.
  • [30] A. Ignatov, R. Timofte, S. Ko, S. Kim, K. Uhm, S. Ji, S. Cho, J. Hong, K. Mei, J. Li, et al. (2019) Aim 2019 challenge on raw to rgb mapping: methods and results. In IEEE International Conference on Computer Vision (ICCV) Workshops, Vol. 5, pp. 7. Cited by: §2.3.
  • [31] A. Ignatov and R. Timofte (2019) Ntire 2019 challenge on image enhancement: methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §2.3.
  • [32] M. Irani and S. Peleg (1991) Improving resolution by image registration. CVGIP: Graphical models and image processing 53 (3), pp. 231–239. Cited by: §2.3.
  • [33] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §2.3.
  • [34] J. Kim, J. K. Lee, and K. M. Lee (2016-06) Accurate image super-resolution using very deep convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR Oral), Cited by: §1.
  • [35] K. I. Kim and Y. Kwon (2010) Single-image super-resolution using sparse regression and natural image prior. IEEE transactions on pattern analysis and machine intelligence 32 (6), pp. 1127–1133. Cited by: §2.1, §2.3.
  • [36] M. Krasser (2018) Single image super-resolution with edsr, wdsr and srgan. Note: https://github.com/krasserm/super-resolution Cited by: §5.2.
  • [37] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi (2016) Photo-realistic single image super-resolution using a adversarial network. CoRR abs/1609.04802. External Links: Link Cited by: §1.
  • [38] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: Figure 1, §1, §2.1, Figure 8, Figure 9, §5.1, §5.2, Table 3.
  • [39] X. Li, B. Gunturk, and L. Zhang (2008) Image demosaicing: a systematic survey. In Visual Communications and Image Processing 2008, Vol. 6822, pp. 68221J. Cited by: §1, §2.2.
  • [40] L. Liang, I. Zharkov, F. Amjadi, H. R. Vaezi Joze, and V. Pradeep (2020) Guidance network with staged learning for computational photography. In arXiv, Cited by: §1, §2.2, Figure 10, §5.3, Table 4.
  • [41] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017-07) Enhanced deep residual networks for single image super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: Figure 1, §1, §2.1, Figure 8, Figure 9, §5.1, §5.2, Table 3.
  • [42] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §2.3.
  • [43] D. Liu, Z. Wang, N. Nasrabadi, and T. Huang (2016) Learning a mixture of deep networks for single image super-resolution. In Asian Conference on Computer Vision, Cited by: §1.
  • [44] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang (2016) Robust single image super-resolution via deep networks with sparse prior. IEEE Transactions on Image Processing 25 (7), pp. 3194–3207. Cited by: §1.
  • [45] D. Martin, C. Fowlkes, D. Tal, and J. Malik (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, Vol. 2, pp. 416–423. Cited by: Table 1, item 1.
  • [46] T. Michaeli and M. Irani (2013) Nonparametric blind super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 945–952. Cited by: §2.1.
  • [47] K. Nasrollahi and T. B. Moeslund (2014) Super-resolution: a comprehensive survey. Machine vision and applications 25 (6), pp. 1423–1468. Cited by: §1, §2.
  • [48] OpenCV Camera calibration with opencv. Note: http://docs.opencv.org/2.4/doc/tutorials/calib3d/camera_calibration/camera_calibration.html Cited by: §4.
  • [49] S. Ratnasingam (2019-08) Deep camera: a fully convolutional neural network for image signal processing. In International Conference on Computer Vision Workshops (ICCVW), Cited by: §1, §2.2.
  • [50] J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
  • [51] E. Schwartz, R. Giryes, and A. M. Bronstein (2019) DeepISP: toward learning an end-to-end image processing pipeline. IEEE Transactions on Image Processing 28, pp. 912–923. Cited by: Table 1, §1, item 2, §2.2, Figure 10, §5.3, Table 4.
  • [52] Q. Shan, Z. Li, J. Jia, and C. Tang (2008) Fast image/video upsampling. ACM Transactions on Graphics (TOG) 27 (5), pp. 153. Cited by: §2.3.
  • [53] H. R. Sheikh, M. F. Sabir, and A. C. Bovik (2006) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing 15 (11), pp. 3440–3451. Cited by: item 2.
  • [54] J. Sun, Z. Xu, and H. Shum (2008) Image super-resolution using gradient profile prior. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. Cited by: §2.3.
  • [55] J. Sun, J. Zhu, and M. F. Tappen (2010) Context-constrained hallucination for image super-resolution. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 231–238. Cited by: item 1.
  • [56] L. Sun and J. Hays (2012) Super-resolution from internet-scale scene matching. In Proceedings of the IEEE Conf. on International Conference on Computational Photography (ICCP), Cited by: §1.
  • [57] M. W. Thornton, P. M. Atkinson, and D. Holland (2006) Sub-pixel mapping of rural land cover objects from fine spatial resolution satellite sensor imagery using super-resolution pixel-swapping. International Journal of Remote Sensing 27 (3), pp. 473–491. Cited by: §1.
  • [58] J. Tian and K. Ma (2011) A survey on super-resolution imaging. Signal, Image and Video Processing 5 (3), pp. 329–342. Cited by: §1, §2.
  • [59] R. Timofte, V. De Smet, and L. Van Gool (2013) Anchored neighborhood regression for fast example-based super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1920–1927. Cited by: §2.3.
  • [60] M. E. Tipping, C. M. Bishop, et al. (2002) Bayesian image super-resolution. In NIPS, Vol. 15, pp. 1279–1286. Cited by: §1, §2.
  • [61] H. R. Vaezi Joze, M. S. Drew, G. D. Finlayson, and P. A. T. Rey (2012) The role of bright pixels in illumination estimation. In Color and Imaging Conference, Vol. 2012, pp. 41–46. Cited by: §1.
  • [62] H. R. Vaezi Joze and M. S. Drew (2012) Exemplar-based colour constancy.. In BMVC, pp. 1–12. Cited by: §2.2.
  • [63] H. R. Vaezi Joze and M. S. Drew (2013) Exemplar-based color constancy and multiple illumination. IEEE transactions on pattern analysis and machine intelligence 36 (5), pp. 860–873. Cited by: §1.
  • [64] H. R. Vaezi Joze and O. Koller (2019-09) MS-asl: a large-scale data set and benchmark for understanding american sign language. In The British Machine Vision Conference (BMVC), Cited by: §2.3.
  • [65] Q. Wang, X. Tang, and H. Shum (2005) Patch based blind image super resolution. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, Vol. 1, pp. 709–716. Cited by: §2.1.
  • [66] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang (2015) Deep networks for image super-resolution with sparse prior. In Proceedings of the IEEE International Conference on Computer Vision, pp. 370–378. Cited by: §1.
  • [67] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §5.2.
  • [68] C. Yang, C. Ma, and M. Yang (2014) Single-image super-resolution: a benchmark. In European Conference on Computer Vision, pp. 372–386. Cited by: §1, item 2, §2.1, §2.1, §2.3.
  • [69] J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang (2012) Coupled dictionary training for image super-resolution. IEEE transactions on image processing 21 (8), pp. 3467–3478. Cited by: §2.1.
  • [70] J. Yu, Y. Fan, J. Yang, N. Xu, Z. Wang, X. Wang, and T. Huang (2018) Wide activation for efficient and accurate image super-resolution. arXiv preprint arXiv:1808.08718. Cited by: §2.1, Figure 8, Figure 9, §5.1, §5.2, Table 3.
  • [71] S. W. Zamir, A. Arora, S. Khan, F. S. Khan, and L. Shao (2019) Learning digital camera pipeline for extreme low-light imaging. Technical report ArXiV. External Links: arXiv:1904.05939, Link Cited by: item 1.
  • [72] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §1, §2.2.
  • [73] K. Zhang, X. Gao, D. Tao, and X. Li (2012) Single image super-resolution with non-local means and steering kernel regression. IEEE Transactions on Image Processing 21 (11), pp. 4544–4556. Cited by: §1.
  • [74] S. Zhang, G. Liang, S. Pan, and L. Zheng (2019) A fast medical image super resolution method based on deep learning network. IEEE Access 7 (), pp. 12319–12327. External Links: Document, ISSN 2169-3536 Cited by: §1.
  • [75] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2472–2481. Cited by: §2.1, Figure 8, §5.1.
  • [76] H. Zhao, O. Gallo, I. Frosio, and J. Kautz (2015) Loss functions for neural networks for image processing. arXiv preprint arXiv:1511.08861. Cited by: Table 1, §2.3.
  • [77] Y. Zhao, Q. Chen, X. Sui, and G. Gu (2015) A novel infrared image super-resolution method based on sparse representation. Infrared Physics & Technology 71, pp. 506–513. Cited by: §1.
  • [78] X. Zhu, C. Vondrick, D. Ramanan, and C. C. Fowlkes (2012) Do we need more training data or better models for object detection?.. In BMVC, Vol. 3, pp. 5. Cited by: §1.