RGBW is a new type of CFA pattern (Fig. 1
(a)) designed for image quality enhancement under low light conditions. Thanks to the higher optical transmittance of white pixels over conventional red, green, and blue pixels, the signal-to-noise ratio (SNR) of the sensor output becomes significantly improved, thus boosting the image quality, especially under low light conditions. Recently several phone OEMs, including Transsion, Vivo, and Oppo have adopted RGBW sensors in their flagship smartphones to improve the camera image quality[2, 3, 1].
The binning mode of RGBW is mainly used in the camera preview mode and video mode, in which the pixels of the same color are averaged in the diagonal direction within a window in RGBW to further improve the image quality and to reduce the noise. A fusion algorithm is thereby needed to take the input of a diagonal-binning-bayer (DBinB) and a diagonal-binning-white (DBinC) to obtain a Bayer of better signal-to-noise ratio (SNR) in Fig. 1 (b). A good fusion algorithm should be able (1) to get a Bayer output from RGBW with least artifacts, and (2) to fully take advantage of the SNR and resolution benefit of white pixels.
The RGBW fusion problem becomes more challenging when the input DBinB and DBinC become noisy especially under low light conditions. A joint fusion and denoise task is thus in demand for real-world applications.
In this challenge, we intend to fuse the RGBW inputs (DBinB and DBinC in Fig. 1
(b)) to denoise and improve the Bayer. The solution is not necessarily deep-learning. However, to facilitate the deep learning training, we provide a dataset of high-quality binning-mode RGBW (DBinB and DBinC) and the output Bayer pairs, including 100 scenes (70 scenes for training, 15 for validation, and 15 for testing). We provide a Data Loader to read these files and show a simple ISP in Fig.2
to visualize the RGB output from the Bayer and calculate loss functions. The participants are also allowed to use other public-domain datasets. The algorithm performance is evaluated and ranked using objective metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Learned Perceptual Image Patch Similarity (LPIPS) , and KL-divergence (KLD). The objective metrics of a baseline method are available as well to provide a benchmark.
This challenge is a part of the Mobile Intelligent Photography and Imaging (MIPI) 2022 workshop and challenges emphasizing the integration of novel image sensors and imaging algorithms, which is held in conjunction with ECCV 2022. It consists of five competition tracks:
RGB+ToF Depth Completion uses sparse and noisy ToF depth measurements with RGB images to obtain a complete depth map.
Quad-Bayer Re-mosaic converts Quad-Bayer RAW data into Bayer format so that it can be processed by standard ISPs.
RGBW Sensor Re-mosaic converts RGBW RAW data into Bayer format so that it can be processed by standard ISPs.
RGBW Sensor Fusion fuses Bayer data and a monochrome channel data into Bayer format to increase SNR and spatial resolution.
Under-display Camera Image Restoration improves the visual quality of the image captured by a new imaging system equipped with an under-display camera.
To develop high-quality RGBW fusion solution, we provide the following resources for participants:
A high-quality RGBW (DBinB and DBinC in Fig. 1b) and Bayer dataset; As far as we know, this is the first and only dataset consisting of aligned RGBW and Bayer pairs, relieving the pain of data collection to develop learning-based fusion algorithms;
A data processing code with Data Loader to help participants get familiar with the provided dataset;
A simple ISP including basic ISP blocks to visualize the algorithm output and to calculate the loss function on RGB results;
A set of objective image quality metrics to measure the performance of a developed solution.
2.1 Problem Definition
The RGBW fusion task aims to fuse the DBinB and DBinC of RGBW (Fig. 1 (b)) to improve the image quality of the Bayer output. By incorporating the white pixels (DBinC) of higher spatial resolution and higher SNR, the output Bayer potentially would have better image quality. In addition, the binning mode of RGBW is mainly used for the preview and video modes in smartphones, thus requiring the fusion algorithms to be lightweight and power-efficient. While we do not rank solutions based on the running time or memory footprint, the computational cost is one of the most important criteria in real applications.
2.2 Dataset: Tetras-RGBW-Fusion
The training data contains 70 scenes of aligned RGBW (DBinB and DBinC input) and Bayer (ground-truth) pairs. For each scene, DBinB at 0dB is used as the ground truth. Noise is synthesized on the 0dB DBinB and DBinC data to provide the noisy input at 24dB and 42dB respectively. The synthesized noise consists of read noise and shot noise, and the noise models are measured on an RGBW sensor. The data generation steps are shown in Fig. 3. The testing data contains DBinB and DBinC inputs of 15 scenes at 24dB and 42dB, but the ground truth Bayer results are not available to participants.
2.3 Challenge Phases
The challenge consisted of the following phases:
Development: The registered participants get access to the data and baseline code, and are able to train the models and evaluate their running time locally.
Validation: The participants can upload their models to the remote server to check the fidelity scores on the validation dataset, and to compare their results on the validation leaderboard.
Testing: The participants submit their final results, code, models, and factsheets.
2.4 Scoring System
2.4.1 Objective Evaluation
The evaluation consists of (1) the comparison of the fused output (Bayer) with the reference ground truth Bayer, and (2) the comparison of RGB from the predicted and ground truth Bayer using a simple ISP (the code of the simple ISP is provided). We use
Peak Signal-to-Noise Ratio (PSNR)
Structural Similarity Index Measure (SSIM) 
Learned Perceptual Image Patch Similarity (LPIPS) 
to evaluate the fusion performance. The PSNR, SSIM, and LPIPS will be applied to the RGB from the Bayer using the provided simple ISP code, while KLD is evaluated on the predicted Bayer directly.
A metric weighting PSNR, SSIM, KLD, and LPIPS is used to give the final ranking of each method, and we will report each metric separately as well. The code to calculate the metrics is provided. The weighted metric is shown below. The M4 score is between 0 and 100, and a higher score indicates a better overall image quality.
For each dataset we report the average results over all the processed images belonging to it.
3 Challenge Results
Six teams submitted their results in the final phase, and their results have been verified using their submitted code as well. Table. 1 summarizes the results in the final test phase. LLCKP, MegNR, and jzsherlock are the top three teams ranked by M4 are presented in Eq. (1), and LLCKP shows the best overall performance. The proposed methods are described in Section 4, and the team members and affiliations are listed in Appendix 0.A.
To learn more about the algorithm performance, we evaluated the qualitative image quality in addition to the objective IQ metrics in Fig. 4 and Fig. 5 respectively. While all teams in Table. 1 have achieved high PSNR and SSIM, the detail and texture loss can be found on the yellow box in Fig. 4 and on the test chart in Fig. 5. When the input has a large amount of noise and the scene is under low light conditions, oversmoothing tends to yield higher PSNR at the cost of detail loss perceptually.
In addition to benchmarking the image quality of fusion algorithms, computational efficiency is evaluated because of the wide adoption of RGBW sensors in smartphones. We measured the running time of the RGBW fusion solutions of the top three teams in Table. 2. While running time is not employed in the challenge to rank fusion algorithms, computational cost is critical when developing algorithms for smartphones. jzsherlock achieved the shortest running time among the top three solutions on a workstation GPU (NVIDIA Tesla V100-SXM2-32GB). With sensor resolution of mainstream smartphones reaching 64M or even higher, power-efficient fusion algorithms are highly desirable.
|Team name||12001800 (measured)||16M|
4 Challenge Methods
In this section, we describe the solutions submitted by all teams paticipanting in the final stage of MIPI 2022 RGBW Joint Fusion and Denoise Challenge.
BITSpectral developed a transformer-based network, Fusion Cross-Patch Attention Network (FCPAN), for this joint fusing and denoising task. The FCPAN is presented in Fig. 6
(a) consisting of a Deep Feature Fusion Module (DFFM) and several Cross-Patch Attention Modules (CPAM). The input of DFFM contains an RGGB Bayer pattern and a W channel. The output of DFFM is the fused features of RGBW, which is fed to CPAM for depth feature extraction. CPAM is a U-shape network with spatial downsampling to reduce computational complexity. They proposed to use 4 CPAMs in the network.
Fig. 6 also includes the details of Swin Transformer Layer  (STL), the Cross-Patch Attention Block (CPAB), and Cross-Patch Attention Multi-Head Self-Attention (CPA-MSA). They used STL to extract the attention within feature patches in each stage and CPAB to directly obtain the global attention among patches for the innermost stage. Compared with STL, CPAB has an extended range of perception due to the cross-patch attention.
BIVLab proposed a Self-Guided Spatial-Frequency Complement Network(SG-SFCN) for the RGBW joint fusion and denoise task. As shown in Fig. 7, the swin transformer layer (STL)  is adopted to extract rich features from DBinB and DBinC separately. SpaFre blocks (SFB) 
then fuses the DBinB and DBinC in complementary spatial and frequency domains. In order to handle the different noise levels, the features extracted by the STL, which contain the noise-level information, are applied to each SFB as a guidance. Finally, the denoised Bayer is obtained by adding the predicted Bayer residual to the original DBinB Bayer. During the training, all the images are cropped to patches of sizein order to guarantee essential global information.
HIT-IIL proposed a NAFNet  based model for the RGBW Joint Fusion and Denoise task. As shown in Fig. 8, the framework consists of a 4-level encoder-decoder and bottleneck module. For the encoder, the numbers of NAFNet’s blocks for each level are 2, 2, 4, and 8. For the decoder, the numbers of NAFNet’s blocks are set to 2 for all of the 4 levels. In addition, the bottleneck module contains 24 NAFNet’s blocks. Unlike the original NAFNet design, the skip connection between the input and the output is removed in their method.
During the training, they also used two data augmentation strategies. The first one is mixup, which generates the synthesized images as:
Here, the and
denote the images of the same scene with noise levels of 24dB and 42dB. A random variableis selected between 0 and 1 to generate the synthesised image . Their second augmentation strategy is the image flip proposed in .
Jzsherlock proposed a dual-branch network for the RGBW joint fusion and denoise task. The entire architecture, consisting of a Bayer branch and a white branch, is shown in Fig. 9. The Bayer branch’s input is a normalized noisy Bayer image and output the denoised result. After pixel unshuffle operation with scale=2, the Bayer image is converted to GBRG channels. They use stacked ResBlocks without BatchNorm (BN) layers to extract the feature maps of noisy Bayer image. On the other hand, the white branch extracts the features from the corresponding white image using stacked ResBlocks as well. An average pooling layer rescales the white image features to the same size as Bayer branch for feature fusion. Several Residual-in-Residual Dense Blocks (RRDB)  are applied to the fused feature maps for restoration. After the RRDB blocks, a Conv+LeakyReLU+Conv structure is applied to enlarge the feature map channels by a scale of 4. Then pixel shuffle with scale=2 is applied to upscale the feature maps to the input size. A Conv layer is used to convert the output to the GBRG 4 channels. Finally, a skip connection is applied to add the input Bayer to form the final denoised result.
The network is trained by L1 loss in the normalized domain. The final normalization with min=64 and max=1023, with values out of the range clipped.
LLCKP proposed a denoising method based on existing image restoration model, . As shown in Fig. 10, they synthesized RGBW images from GT GBRG images with additional synthetic noise and real-noise pair (noisy images provided by challenge). They also used 20,000 pairs of RAW image from SIDD with normal exposure and synthesized RGBW images as extra data. During the training, the Restormer model’s  weights are pre-trained on SIDD RGB images. Data augmentation  and cutmix  are applied during the training phase.
MegNR proposed a pipeline for the RGBW Joint Fusion and Denoise task. The overall diagram is shown in Fig. 11. The pixel-unshuffle(PU)  is firstly applied to RGBW images to split them into independent channels. Inspired by Uformer , they developed their RGBW fusion and reconstruction network, HAUformer. They replaced the LeWin Blocks  in Uformer’s original design and included two modules Hybrid Attention Local-Enhanced Block(HALEB) and Overlapping Cross-Attention Block(OCAB) to capture more long-range dependencies information and useful local context. Finally, the pixel-shuffle(PS)  module restored the output to the standard Bayer format.
In this paper, we summarized the Joint RGBW Fusion and Denoise challenge in the first Mobile Intelligent Photography and Imaging workshop (MIPI 2022) held in conjunction with ECCV 2022. The participants were provided with a high-quality training/testing dataset for RGBW fusion and denoise, which is now available for researchers to download for future research. We are excited to see so many submissions within such a short period, and we look forward for more research in this area.
We thank Shanghai Artificial Intelligence Laboratory, Sony, and Nanyang Technological University to sponsor this MIPI 2022 challenge. We thank all the organizers and participants for their great work.
Appendix 0.A Teams and Affiliations
Title: Fusion Cross-Patch Attention Network for RGBW Joint Fusion and Denoise
Members: Zhen Wang (email@example.com), Daoyu Li, Yuzhe Zhang, Lintao Peng, Xuyang Chang, Yinuo Zhang, Liheng Bian
Affiliations: Beijing Institute of Technology
Title: Self-Guided Spatial-Frequency Complement Network for RGBW Joint Fusion and Denoise
Members: Bing Li (firstname.lastname@example.org), Jie Huang, Mingde Yao, Ruikang Xu, Feng Zhao
Affiliations: University of Science and Technology of China
Title: NAFNet for RGBW Image Fusion
Members: Xiaohui Liu (email@example.com), Xiaohui Liu, Rongjian Xu, Zhilu Zhang, Xiaohe Wu, Ruohao Wang, Junyi Li, Wangmeng Zuo
Affiliations: Harbin Institute of Technology
Title: Dual Branch Network for Bayer Image Denoising Using White Pixel Guidance
Members: Zhuang Jia (firstname.lastname@example.org)
Title: Synthetic RGBW image and noise
Members: DongJae Lee (email@example.com)
Title: HAUformer: Hybrid Attention-guided U-shaped Transformer for RGBW Fusion Image Restoration
Members: Ting Jiang (firstname.lastname@example.org), Qi Wu, Chengzhi Jiang, Mingyan Han, Xinpeng Li, Wenjie Lin, Youwei Li, Haoqiang Fan, Shuaicheng Liu
Affiliations: Megvii Technology
-  Camon 19 pro, https://www.tecno-mobile.com/phones/product-detail/product/camon-19-pro-5g
-  Oppo unveils multiple innovative imaging technologies, https://www.oppo.com/en/newsroom/press/oppo-future-imaging-technology-launch/
-  vivo x80 is the only vivo smartphone with a sony imx866 sensor: The world’s first rgbw bottom sensors. https://www.vivoglobal.ph/vivo-X80-is-the-only-vivo-smartphone-with-a-Sony-IMX866-Sensor-The-Worlds-First-RGBW-Bottom-Sensors/
-  Chen, L., Chu, X., Zhang, X., Sun, J.: Simple baselines for image restoration. arXiv preprint arXiv:2204.04676 (2022)
Liu, J., Wu, C.H., Wang, Y., Xu, Q., Zhou, Y., Huang, H., Wang, C., Cai, S., Ding, Y., Fan, H., et al.: Learning raw image denoising with bayer pattern unification and bayer preserving augmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2019)
-  Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) pp. 10012–10022 (2021)
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1874–1883 (2016)
-  Sun, B., Zhang, Y., Jiang, S., Fu, Y.: Hybrid pixel-unshuffled network for lightweight image super-resolution. arXiv preprint arXiv:2203.08921 (2022)
Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., Change Loy, C.: Esrgan: Enhanced super-resolution generative adversarial networks. Proceedings of the European Conference on Computer Vision Workshops (ECCVW) (2018)
-  Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: A general u-shaped transformer for image restoration. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 17683–17693 (2022)
-  Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
-  Xu, S., Zhang, J., Zhao, Z., Sun, K., Liu, J., Zhang, C.: Deep gradient projection networks for pan-sharpening. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1366–1375 (2021)
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) pp. 6023–6032 (2019)
-  Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5728–5739 (2022)
-  Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)