SurroundNet: Towards Effective Low-Light Image Enhancement

10/11/2021 ∙ by Fei Zhou, et al. ∙ 3

Although Convolution Neural Networks (CNNs) has made substantial progress in the low-light image enhancement task, one critical problem of CNNs is the paradox of model complexity and performance. This paper presents a novel SurroundNet which only involves less than 150K parameters (about 80-98 percent size reduction compared to SOTAs) and achieves very competitive performance. The proposed network comprises several Adaptive Retinex Blocks (ARBlock), which can be viewed as a novel extension of Single Scale Retinex in feature space. The core of our ARBlock is an efficient illumination estimation function called Adaptive Surround Function (ASF). It can be regarded as a general form of surround functions and be implemented by convolution layers. In addition, we also introduce a Low-Exposure Denoiser (LED) to smooth the low-light image before the enhancement. We evaluate the proposed method on the real-world low-light dataset. Experimental results demonstrate that the superiority of our submitted SurroundNet in both performance and network parameters against State-of-the-Art low-light image enhancement methods. Code is available at https:



There are no comments yet.


page 1

page 2

page 3

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I introduction

Shadow or a pedestrian? It makes no difference when driving at night until your headlights swept it. Poor illumination condition can appear anywhere and bring complex image degradation, such as signal-dependent noise, low contrast and all that. It hurts the performance of any vision system, both human and computer. To handle such problem, various theories were proposed which can be divided into two categories in general: the histogram-based and Retinex-based methods [23]. In recent years, Convolutional Neural Networks (CNNs) produce compelling results in low-level image processing such as Denoiser[24]

and Super-Resolution

[4]. It makes CNN-based methods become a good alternative and popular route for image enhancement. Due to the data-driven learning and deep structure, the CNN-based methods can achieve better performance in some specific datasets compared with conventional ones. And some works have successfully combined the Retinex and CNNs [46, 34, 43]. Despite the success, all of those Retinex-CNN methods drop the surround function from conventional Retinex methods like Single Scale Retinex (SSR) [14] and stack convolution layers to obtain the illumination map. It is reasonable to some extent that the data-driven illumination estimation can restrain the halo artifacts and has better robustness. However, different from surround function, there is hardly prior knowledge about illumination in CNNs. To achieve the same ends, networks have to utilize numerous redundant parameters to learn the illumination map. Therefore, it inevitably increases model complexity and slows the inference speed.

Fig. 1: Visual comparisons with two typical low-light image enhancement methods. The right side shows the zoom-in images of the selected small square area. (Image from LOL dataset). The proposed Surround achieves remarkable enhancement in brightness, color, and sharpness, while the other two methods generate image artifacts, otherwise low color saturation result.

To address the above problem, we design a novel surround function, ASF (Adaptive Surround Function), to estimate the illumination map, which has the conventional surround function form and can be trained end to end. Base on the ASF, a reasonable and efficient network, SurroundNet, is constructed to bright the low-light image. We show some visual examples in Fig. 1 of our method compared with two SOTAs on one typical low-light image to illustrate its performance briefly. We can see that our method achieves visually promising results in terms of brightness, color, and sharpness. The contributions of our work are summarized as follows.

  • Light-weight low-light image enhancement: The proposed SurroundNet achieves competitive performance against several state-of-the-art methods in various image quality measurements, with less than 150 (80-98 percent size reduction compared to SOTAs) parameters.

  • Adaptive Surround Function (ASF): We propose a new surround function, i.e., ASF, to estimate the illumination map which can be implemented by only two 1D convolution layers. Compared with stacking of 2D convolution layers and extra training in Retinex-CNNs [46][50], ASF has much fewer parameters and is easier to be optimized.

  • Adaptive Retinex Block (ARBlock)): We propose the ARBlock for both illumination adjustment and reflectance enhancement. The structure of ARBlock can be regarded as an extension of Single Scale Retinex (SSR) in features space. We present more discussion in Section III-C.

  • Low-Exposure Denoiser (LED): As the brightness enhancement will usually amplify the noise, we further design a low-light denoise module to smooth the image before enhancement and use the synthetic low light image to supervise the training.

The rest of this paper is organized as follows: Section II summarizes the most related works. Section III describes the proposed SurroundNet. Section IV reports the comparison experiments conducted with SOTAs and analyzes the results. Section V gives a series of comparative ablation experiments. Finally, we conclude our study in Section VI.

Ii related works

Ii-a Histogram based contrast enhancement

As contrast degradation is the main character of low-light image, some simple and effective histogram-based methods like Histogram equalization (HE) [29, 1] are introduced to improve the visual perception of low-light image. However, those methods do not take the real brightness information into consideration, which may result in over- or under-enhancement[6]. Some advanced versions, like brightness preservation[39] or contrast limitation[31], can avoid the above issues to some extent. Nevertheless, histograms of different image part is very inconsistent [42] in serious non-uniform illumination cases, it makes the performance of those histogram-based methods not so attractive.

Ii-B Retinex based brightness enhancement

Retinex theory indicates that a natural image can be decomposed into two parts, i.e., the reflectance and the illumination. In the Retinex framework, a low light image can be regarded as an image with low-intensity illumination part. The reflectance part only hinges on the scene itself, regardless of the lighting conditions. It is evident that the reflectance image can bring good visual performance in low-light image. Unfortunately, the decomposition of image is a high ill-posed problem[43]. To settle the matter, surround function/operation is designed to estimate illumination part and then get reflectance sequentially. To be specific, we can get the illumination map by weighting surround pixels, and the type of weighting called the surround function. The surround functions have different form, such as function[16] and exponential form [26], where , are the coordinates and is a constant value. SSR is another early attempt, it serves the Gaussian filter as surround function to achieve the goal. For SSR, as mentioned above, it assumes the image can be decomposed into two parts:


where represents the image, and represent the illumination and reflectance. Accordingly, the light-free image can be estimated by the following form.


where is the surround function defined as a Gaussian kernel:


where is a weight to make comply with the following constraint.


The logarithmic function converts division to subtraction and denotes the convolution operation. The original SSR research work [14] investigates the logarithmic function and suggests a candidate formula as follows.


Both of the two forms are valid in practice. In this paper, we choose the latter formula.

(a) Surround Functions
Fig. 2: The top-left sub-figure is the comparison of three kind surround functions: inverse square , exponential , and Gaussian , where , and . And the rest three sub-figure are the Gaussian Retinex results with different .

Surround function is of great significance in Retinex-based methods. Some different surround functions can be found in Fig. 2. Although SSR declares the Gaussian filter surpasses others, the enhance result is sensitive to the variance parameter

. Small value means better dynamic range compression, and the opposite means natural tonal rendition[13] as shown in Fig. 2. Meanwhile, halo effect is another critical problem that has not been well solved. To provide both dynamic range compression and tonal rendition simultaneously and suppress halo phenomenon, Muti-Scale Retinex (MSR) uses different size Gaussian filters to obtain better illumination map and improve the visual quality. The multi-scale tactics work, but halo artifact still remains on edge where illumination change dramatically. Moreover, it is not easy to choose the size for different Gaussian filters. In this paper, we propose a novel Adaptive Surround Function (ASF), which can be regarded as a convolution kernel, to learn a general surround form. Details can be found in Section III-B.

Fig. 3: Architecture of SorroundNet. The cube presents the feature maps of the corresponding operation which is denoted by colorful rectangle, the detailed exhibition of ARBlocks and ECA can be found in Section. 3

Ii-C CNN based image enhancement

LLNet [21] attempts to enhance the low-light image with CNN (denoising auto-encoder). EnlightenGAN [12]

employees the Generative Adversarial Networks (GANs) to transfer the low-light image to a normal image. Due to the gracefully mathematical foundation of Retinex theory, some work tries to combine the Retinex and CNNs. Retinex-Net

[46] trains the sub-network (Decompose-Net) for decomposition and another one (Enhance-Net) for enhancing the illumination. MSR-net [34] investigates the relationship between the Multi-Scale Retinex and CNNs and uses the stack of convolutions to play an analogous role with Gaussian Surround Function. Progressive Retinex [43] is more concerned with the signal-dependent noise problem. Two sub-networks, i.e., IM-Net and NM-Net, are designed to estimate illumination and noise level, respectively. Both the two networks mutually benefit from each other.

Ii-D Feature fusion

Feature fusion has been demonstrated its ability on boosting the performance of CNNs, such as multi-scale feature fusion[19][36] and channel attention mechanism. For the low-light image enhancement task, MBLLEN[23] uses the multi-branch features to improve the image quality. And DLN[40] have a similar multi-branch structure. Although the multi-branch processing broadens the feature channels significantly, it inevitably increases the number of parameters. Therefore, Efficient Channel Attention (ECA) [41] is a good choice in reducing the parameters. It first takes the Global Average Pooling (GAP) [18] to obtain aggregated features:


where are height, width and channel size, is the output of prior convolution stage. GAP reduces the spatial dimension of feature maps. So it makes channel-wise processing possible. Channel Attention methods, such as Squeeze-and-Excitation (SE) [9], apply a few dense layers on the output to get the weight of channels :


where ReLU is the Rectified Linear Unit and

represents the Sigmoid function, and

and are the weight of two dense layers. The size of latent feature is usually less than the size of . It benefits for reducing model complexity but injures the direct relationship between the and . To avoid the issue, ECA uses the D convolution to obtain the channel weight:


where denotes 1D convolution, k is the kernel size of . For 1D convolution, the kernel size is exactly equal to the number of parameters. It is obvious that ECA reduces the model complexity. In this paper, we modify original the ECA by using several D convolutions to improve performance. Specially, the channel weight used in this paper turns into the following equation.


Iii the proposed surroundNet

Iii-a The architecture of SurroundNet

Fig. 3 illustrates the structure of our proposed SurroundNet. We can see that the LED module firstly removes noise for the low-exposure condition. Then the shallow feature extraction (one

convolution) projects the denoised low-light image to feature space. After that, we design muti-branch ARblocks to enhance the image at various scales. The shallow feature and enhanced results will be combined into the ECA module. Finally, another convolution layer will fuse the features from ECA and produce the lightened result. The ARBlock and LED modules achieve the model reduction in network size. Even the ECA module slightly increases the number of parameters by utilizing larger kernel and several 1D convolution layers (as shown in equation 9), it is still lightweight.

Iii-B Adaptive Surround Function

Previous researches about the surround function [14] exhibit that Gaussian Surround performs better than the Surround and the Exponential Surround. However, the selection of Gaussian parameters is empirical and can not handle the complex illumination environment. In this work, we design a new Adaptive Surround Function (ASF) whose form can be learned in a data-driven way and does not stick to a certain surround form. Specifically, it is implemented by one convolution layer, which means our ASF is a special convolution kernel.

Fig. 4 visualizes the procedure of constructing the ASF (convolution kernel) as a toy example. At first, we built a D convolution kernel with parameters to learn as the input (e.g., Fig. 4(a)). Then we cumulative sum to guarantee it can be monotonically increased as shown in Fig. 4(b). Fig. 4(c) utilizes the mirror flip to get except the last element. After that, we concatenate and to get a symmetrical and larger receptive field kernel as shown in Fig. 4(d). Finally, we normalize with Equ. 4. For intuitive comparison, Fig. 4(e) shows a Gaussian surround which has the same weights with our ASF. So Gaussian Surround can be considered as a special case of our ASF. In fact, we can get various kinds of Surround Functions (e.g., inverse square and exponential) by set at an appropriate value. In our experiments, is learnable and we initialize

to an all-one vector.

Fig. 4: The visualization of building the ASF convolution kernel. In this toy example, is set to 5 and the normalization step is omitted.

Moreover, the details of the construction algorithm of 1D-ASF can be found in Algorithm 1. The constructing process shows that a size ASF possesses receptive field. It is spatially symmetrical and has the most response on the center pixel. All the operations introduced above are derivable and can be optimized end to end.

Input: 1D-vector with size
Output: Surround Function

1:// Processing the vector with cumulative sum operation.
2: = Cumsum(Abs(x))
3:// Mirror and concatenate.
4: = Fliplr([:-1])
5:s = Concat(,)
6:// Normalization
7: = s / Sum(s)
Algorithm 1 1D-ASF kernel construction

The 1D-ASF can be extended to a 2D form by multiplying the transpose of itself (i.e., Eq.10).


where is 2D-ASF kernel and defines 1D-ASF kernel. Moreover, to be effective, 2D-Gaussian kernel could be divided into two 1D-Gaussian kernels and can be described mathematically as follow:


where is the image, is a 2D-Gaussian kernel, and are two 1D-Gaussian kernel, and means convolution operation.

Obviously, the combination of two separable 1D-kernels has much fewer parameters than a large 2D-kernel. Therefore, we formalize our D ASF into a lightweight module as shown in Eq. 12.


Iii-C Adaptive Retinex Block

In the previous Retinex-based works like SSR, surround function should work on the image space (gray or RGB). However, in this work, we expand the scope of surround function from image space to the feature space, which can be obtained by pre-handing (stack of several convolution layers). Such an assumption is reasonable because the low-pass trait of ASF (just like other surround function) can extract the low-frequency features. And frequence separation strategy has been demonstrated its effectiveness in improving the performance of CNNs [53]. More discussion and experiments can be found in Section. IV.

Fig. 5: The architecture of ARBlock. The five images as shown below are the visualized results by averaging the channel of the feature maps.

Adaptive Retinex Block (ARBlock) is designed for both illumination adjustment and reflectance enhancement, which subsists of three steps. Fig. 5 shows the architecture of our ARBlock. At first, we apply a logarithmic transformation to the input low-light image (LL). The illumination (I) can be obtained by convolving LL and ASF. And the reflectance (R) is the subtraction result of LL and I. We can see that the logarithmic transformation lights the dark part. I represents the smooth and low-frequency part of the image, and R is about the edges. In the enhancement step, we enhance the I and R respectively. We use an illumination adjustment operation (i.e., one convolution layer) on I to promote the brightness and reduce the halo effect of the surround function. And we take a reflectance enhancement operation (i.e., two convolution layers with dilation 2) on R to correct the color and smooth the noise. Different with conventional Retinex, our work aims to get normal-lighted image (NL) rather than only reflectance image. Therefore, the last step is to fusion the enhanced I (EI) and R (ER). We use one convolution to fusion the EI and ER rather than add them. We adopt 4 ARBlocks in our experiments, and the of ASF in different ARBlocks are set to 3,7,11, and 15, respectively.

Iii-D Low-Exposure Denoiser

Many factors can cause the noise of the image, such as dark current and electronics shot in camera imaging [20, 38]. In the enhancement procedure, the noise in the image will be amplified [43]. Therefore, we design the Low-Exposure Denoiser (LED) module to remove the noise before enhancement and use synthetic noise-free dark image to supervise the module learning. The structure of LED is shown in Fig. 6. Densely connection[10]

and residual connection

[8] have been proved their effective in image restoration task [28][37]. Residual Dense Network[51] proposes residual dense block (RDB) to combine both of them. The RDB can reuse the low-level feature by densely connected convolutional layers, and its local and global feature fusion module makes training stabilized. Inspired by [51], our LED employs RDB module as component. As shown in Fig. 6, the first convolution layer extracts the shallow features. Then we apply two RDBs on the shallow feature map. The last layer is also a convolution which fuses the features and produces clear dark image. In this paper, we use kernel convolution rather than two

convolution to extract shallow features and do not adopt batch normalization

[11]. The reason of the former operation is that the input image only has channel, so large kernel in first and last layer can obtain large receptive field with few parameters. The latter is because we found batch normalization will lead to convergence instability.

Fig. 6: The architecture of LED. Here, we omit the ReLU layer.

Iii-E Loss function

Our SurroundNet optimizes the parameters in a fully supervised way. For every dark image, there will be a light image as its training target. Zhao et al. [52]

investigate various loss functions in the image restoration task and suggests the combination of L1 loss and MS-SSIM loss

[45]. MIRNet[47] employs Charbonnier loss[2] for various different image restore tasks like denoising, super-resolution, enhancement and achieves state of art for all of them. DLN studies different loss function performances in low-light enhancement task and finds that the combination of SSIM loss[44] and TV Loss has the best performance. Perceptual loss is also used widely in low-light image enhancement[22][23]. In this paper, inspired by the above investigations, we use a combination of three loss functions (i.e., SSIM, Charbonnier and DISTS perceptual loss[3]) for the whole network :


where defines the output image, defines the normal-light image, is the clear noisy-free low-light image, and is the output of LED.

We adopt SSIM loss for denoising and structure reservation. It is defined as follows.


where and are the mean values of images, and are the variances, and is the covariance. There are also two hyper-parameter and to avoid dividing by zero, both of them are set as suggested in [40].

SSIM loss has been proved that can cause shifts of colors [52]. Fortunately, the

loss has the ability to restore colors and luminance. Therefore, it is better to combine them and capture the best performance. However, for low-light images, color degradation is inevitable and hard to restore. The outliers of the result may be harmful to network optimization. Therefore, we use the robust Charbonnier loss to replace

, which is defined as:


where is a constant and we set it to .

We also take DISTS perceptual loss to further improve the visual quality. It first transforms the image to “perceptual” representation by using pre-trained VGG19 [35] (defined as ). Then the “perceptual” representation will be handled by texture measurement () and structure measurement () as follows.


where and are constant values to avoid numerical instability, and are the mean values of perceptual representations, and are the variances and is the covariance.

The integrated DISTS measurement is a weighted sum of texture and structure, so DISTS Loss can be defined as:


where and are the positive learnable parameters, which are set as suggested in [15].

Iv Experiments

Iv-a Real world dataset

Low-Light (LOL) dataset[46] is a publicly available dark-light paired images dataset in the real sense. The low-light images are collected by changing exposure time and ISO. It contains 500 images in total, we use 485 images of them for training, and the rest for evaluation as suggested by [46].

Fig. 7: Some synthetic low-light images and their normal-light ones in PASFAL VOC 2007 dataset.
Fig. 8: Enhance results of different algorithms on LOL dataset

Iv-A1 Synthetic dataset

For low-light image enhancement task, it is hard to acquire dark-light pairs for real scene at the same time. Even the LOL dataset [46] provides a real-world image dataset, we still need a huge amount of images for training the deep networks. Lv et al. [22] gave an effective synthesize low-light image method from normal exposed image. The idea is that a low-light image can be simulated through the combination of linear and gamma transformation. So the math formulation can be shown as follow.


where , , and

are randomly sampled from uniform distribution. When the symbols

is greater than , the dynamic range of dark region in image will be compressed. The and will control the max value of bright part. This method will darken the normal light image which could achieve a close visual perception from the real low-light image. We faithfully follow the parameter settings in [22]: , ,

. The normal light image dataset should have both good visual quality and diversiform sense. We choose the PASCAL VOC 2007 dataset which has been widely used in various vision tasks like object detection

[30] and semantic segmentation[5]. We use all the split sets (training, validation, testing) in the VOC2007 dataset which contains 9,963 images. In addition, we present some examples of the synthetic images in Fig. 7.

Iv-A2 Noise-free low-exposure data

As mentioned in Section III-D, we need noise-free low-light images to supervise the LED module training. However, noise in low-light image is signal-dependent [43] and difficult to be simulated. In this work, we synthesize the noise-free low-light images on real dataset (LOL). LOL contains clear normal-light image and noisy low-light image pairs. Inspired by the above synthetic method, we can generate the noise-free low-light image from the normal light image in LOL. It is worth mentioning that the three parameters, i.e., , and , should be set adaptively to reduce the difference between synthetic dark image and real dark image as far as possible. In this work, we treat it as a least-squares optimization problem as shown below.


where is the pixel index and is the number of pixels. For each dark-light image pair in the LOL dataset, we use Levenberg-Marquardt algorithm[27] to optimize the three parameters. We will use the optimized parameters to synthesise the required noise-free low-light images. As for synthetic dataset, our synthetic process (Section IV-A1) does not involve any synthetic noise, therefore, the supervision target of LED is the dark image itself.

Iv-B Training settings

Except ASF module (III-B

), all weights of our SurroundNet are initialized randomly. Adam optimizer is adopted to optimize the whole network. The learning rate is set to 0.001, and the momentum is 0.9. In every mini-batch, we randomly crop 32 dark-light patch pairs, each of which has two paired 128*128 images. We set epoch to 100 on synthetic dataset, and fine-tune epoch is set to 3500 on LOL dataset. For ablation experiments, all the training epochs are set to 100 unless otherwise stated. All models are trained on the platform with Nvidia GTX 2080Ti GPU.

Our code is open source and can be downloaded from the following GitHub repository:

Iv-C Compare with State-of-the-Art methods

We compare our SurroundNet with five the-state-of-art low-light enhancement methods, i.e., LIME (TIP, 2017) [6], RetinexNet (BMVC, 2018) [46], EnlightenGan (TIP, 2021) [12], kinD++ (IJCV, 2021) [49] and DLN (TIP, 2020) [40]. LIME is one representative conventional method via illumination map estimation, and the rest four methods employ the deep CNN for image enhancement. Specifically, LIME utilizes the maximum value of RGB to roughly estimate the initial illumination and use structure-aware smoothing to get final result [6]. RetinexNet introduces relectance consistent loss to learn illumination in a data-driven way [46]. EnlightenGan employs adversarial learning to enhance low-light image without paired training data [12]. KinD++ decomposes image into illumination and reflectance parts and adopts multi-scale illumination attention module to restrain visual defects [49]. DLN brings Back-Projection concept into low-light enhancement from super-resolution tasks [40].

We adopt four quality metrics (PSNR, SSIM, NIQE[25], LPIPS[48]) to evaluate those methods. NIQE is a no-reference metric on the strength of natural scenes statistics. LPIPS measures the visual quality by computing the weighted distance on feature space.

Table I shows the comparison results with the five methods. From the results, we can see that our SurroundNet is superior to the others with much fewer parameters. For example, we exceed the second best method, i.e., DLN, by on PSNR and on SSIM but only use about percent parameters of DLN. It certifies that our SurroundNet is very effective in low-light image enhancement. The best performance on PSNR and SSIM clarifies that our method can hold both brightness enhancement and noise suppression. Meanwhile, the LIPIS index certifies that our method achieves the best visual perception. KinD++ is the improved version of KinD [50] and achieves competitive results. It is worth mentioning that the number of parameters for kinD++ and EnlightenGan is more than ten times higher than our method. This is because both of them employ Unet-like [32] structure with several down-sampling modules, which makes them have to spend a large amount of parameters on restoring the lost spatial information. In contrast, DLN and our SurroundNet methods perform on high-resolution image and feature maps. Therefore, they do not have the trouble of losing spatial information.

LIME 14.02 0.513 8.089 0.391 /
RetinexNet 16.77 1.699 8.878 0.467 440k
EnlightenGan 17.48 0.651 4.686 0.390 8643k
kinD++ 21.80 0.834 5.118 0.289 8274k
DLN 21.94 0.848 4.882 0.259 700k
SurroundNet 22.81 0.853 4.384 0.190 137k
TABLE I: Comparison with state-of-the-art methods (red: Best; blue: the Best; grenen: the Best)

Fig. 8 gives some visualized results. The comparison includes three conventional methods (SSR [14], CLAHE [29] and LIME) and four above mentioned CNN-based methods. SSR [14] explores different surround functions and recommends the Gaussian Surround to estimate illumination. The CLAHE is a histogram-based method which can suppress over-enhancement by limiting the overstretch of histogram [29]. From Fig. 8, we can find that CNN-based methods exhibit better performance than the conventional ones. And among all the five CNN-based methods, our SurroundNet and DLN achieve the most natural illumination adjustment. Moreover, our method exceeds DLN in image structure preservation. Sub-figure (b) also displays the zoom-in results of the three best CNN-based methods. We can clearly see that SurroundNet can restore the texture of red string ball while the other two methods give the blur results.

Fig. 9: Some comparison examples on the LIME dataset.
Fig. 10: Some comparison examples on the DICM dataset.

To verify the generalization ability of our SurroundNet, we further evaluate it on two widely-adopted datasets: LIME (10 images) [6], and DICM (64 images) [17]. Both of these two datasets are unpaired low/normal light data. Some samples of them are shown in Fig. 9 and Fig. 10. We observe that SurroundNet offers similarity denoising ability with DLN but more proper light adjustment, especially in some bright regions like the sky. We also find our model performs the best in the indoor scene like the second image in Fig. 9 and last image in Fig. 10. The reason is that most of the images in LOL dataset are taken on indoor. And beyond that kinD++ shows striking illumination correction in the outdoor scene. However, in some cases like the last image in Fig. 9, KinD++ gives unnatural enhancement results, and the denoising capacity is not attractive like DLN and our SurroundNet methods. We can see that the enhanced results reveal the validity of our model.

V Ablation experiments

In order to demonstrate the effectiveness of each component of our model, we perform a series of comparative ablation studies.

V-a Transfer learning

As mentioned above, our model is pre-trained on synthetic dataset. Several works[33][7]

have demonstrated that pre-training deep network in synthetic dataset and fine-tuning model in real dataset can achieve promising results. Such a technique is called transfer learning. We test our model in the cases of direct and transfer training. The results are shown in Table

II. We can see that the transfer training version achieves 0.32 promotion on PSNR and 0.005 on SSIM.

Direct 21.77 0.835 4.329 0.224 137k
Transfer 22.09 0.840 4.311 0.221 137k
TABLE II: comparison of the transfer and direct training strategies

V-B Effect of ASF

ASF convolution kernel is the core component of our SurroundNet which helps to enhance low-light image in efficient manner. Here, we design experiments to prove the performance of our new convolution kernel on image enhancement. We first replace the ASF module with traditional convolution layer. The comparison seems unfair because the ASF module uses the depth-wise convolution. It means ASF module has fewer parameters than convolution. Therefore, we also conduct an experiment by using the depth-wise convolution, which has a similar number of parameters with ASF convolution. Table III shows the experimental results. We can see that the ASF exceeds traditional convolution with 0.11 on PSNR and 0.07 on SSIM, meanwhile, ASF reduces about 21 percent of parameters. For the depth-wise convolution, ASF outperforms it with 0.23 on PSNR and 0.14 on SSIM. The result shows that our ASF kernel is high performance and lightweight, which is free lunch in low-light enhancement.

convolution 21.66 0.828 4.491 0.240 172k
DW convolution 21.54 0.821 4.571 0.233 138k
ASF 21.77 0.835 4.329 0.224 137k
TABLE III: comparison of traditional, depthwise and our asf convolution kernel.

As we introduced in Section III-B, the ASF module can adaptively change the weight during the training process to get accurate illumination estimation. To verify this argument, we visualize the weight of the learned ASF kernel. As mentioned in Section III-B, we exploit two-dimension ASF by stacking two one-dimension kernels, so we only illustrate the one-dimension results as shown in Fig. 11. We can see that the learned weight of ASFs appears more diverse and produces more flexible illumination estimation.

Fig. 11: The visualization of ASF kernel, (a) is the initial kernel weight, (b)(c)(d) are three random samples in different ARBlock.

V-C Effect of ARBlocks

In section III-C, we proposed the Adaptive Retinex Blocks (ARBlock) which extends the Single Scale Retinex to feature space. To verify the effectiveness of the ARBlocks, we firstly compare our SurroundNet with “Plain” net where we replace the ARBlocks with the stack of three convolution blocks (i.e., conv.+ReLU). The “Plain” net structure, which is inspired by VGG[35] is a popular network used in low-light enhancement tasks [46, 49]. As presented in Table IV, SurroundNet outperforms PlainNet with 0.8 dB on PSNR and 0.012 on SSIM. The result indicates that ARBlock has the ability to improve performance in low light enhance task. To further analyze the ARBlock, we visualize the feature maps of the first ARBlock in the trained SurroundNet. And the results can be found in Fig. 5. We can see that, due to the implementation of log function, the dynamic range of dark pixels will be stretched while bright part barely changes or gets compressed. It means the input will be lightened preliminary as shown in (Log-LL). Such explicit lighting operation simplifies the learning process. For ASF module in ARBlock, it can be regarded as an adaptive low-pass filter bank. The (I) part is the low-frequency part and (R) is the high-frequency part. We believe that such frequency decomposition benefits noise removal and image enhancement.

plainNet 20.97 0.823 4.789 0.244 159k
SurroundNet 21.77 0.835 4.329 0.224 137k
TABLE IV: effect of arblocks

V-D The number of ARBlocks

This work focuses on effective low-light image enhancement. As mentioned in Section III-C, most of the model parameters of our model belong to ARBlocks. Commonly, more blocks lead to better performance but are not always cost-effective. To explore the best trade-off between model size and enhance result, we train our model with different numbers of ARBlocks. The result can be seen from Table V. Both PSNR and SSIM indices climb, accompanied by the growth of the number of ARBlocks. However, the magnitude of improvement keeps reducing. Four-ARBlocks only exceeds Three-ARBlocks with 0.03 on PSNR and 0.04 on SSIM. Moreover, the NIQE index is even worse than the Three-ARBlocks. Stacking too much ARBlocks seems unworthy. In our paper, we use four ARBlocks for all experiments apart from this ablation comparison.

Block-1 21.55 0.827 4.400 0.242 61k
Block-2 21.53 0.829 4.358 0.233 86k
Block-3 21.74 0.831 4.301 0.229 111k
Block-4 21.77 0.835 4.329 0.224 137k
TABLE V: comparison on different number of arblocks

V-E Effect of Low-Exposure Denoiser

To reveal the validity of the LED module, we design two experiments: LES (Low-Exposure Supervision) and No-LES. LES means that the training loss contains the low-exposure supervision and No-LES does not. Both of them share the same network structure with the original SurroundNet. Moreover, we also test the performance in long training epochs case (3500 epoch). The result can be found in Table VI. We can see that, with 100 epochs, the difference between LES and No-LES is not obvious. The reason might be that a network always tries to learn illumination and color rather than noise in the early epochs. However, with 3500 epochs, LES surpasses the No-LES with 0.08 on SSIM, 0.003 on NIQE and 0.013 on LPIPS. The results confirm that LED module can smooth the image and help retain the image structure. We also visualize the output of LED module. Fig. 12 exhibits some denoised samples. Fig. 12(b) shows LED module remains the color and illumination unchanged and the zoom-in result Fig. 12(d) displays LED can remove the noise and keep the edge sharp.

No-LES(100) 21.83 0.834 4.257 0.227 137k
LES(100) 21.77 0.835 4.329 0.224 137k
No-LES/(3500) 23.12 0.842 4.394 0.209 137
LES(3500) 22.71 0.850 4.391 0.196 137k
TABLE VI: Comparison on low-exposure denoiser
Fig. 12: Visualization on some output samples of Low-Exposure Denoiser. (a)(c) cols are the dark images, (b)(d) cols are LED module result, we light them by using 15 luminance gain(zoom for better view)

V-F Effect of ECA

To verify the function of ECA module, two SurroundNets are designed. We design a second SurroundNet without any channel feature aggregation module. The Table VII shows that network improves performance with ECA. It is about 2.58 promotion on PSNR, 0.022 on SSIM, 0.112 on NIQE, and 0.022 on LPIPS. We can also see that ECA only increases model complexity slightly.

w/o 19.19 0.813 4.217 0.246 134k
w 21.77 0.835 4.329 0.224 137k
TABLE VII: effect of eca

Vi Conclusion and discussion

In this work, we present a lightweight SurroundNet for low-light image enhancement. It only takes less than 150k trainable parameters but achieves very competitive or even better results than SOTA methods. Different from previous Retinex-CNN based methods which employ stacks of convolution layers to estimate illumination, we propose ASF (Adaptive Surround Function) module to reach the same goal. ASF extends the concept of Surround Function in Retinex theory and makes the shape of Surround Function learned in a data-driven way. Furthermore, we propose Adaptive Retinex Block (ARBlock) which applies ASF on the feature space rather than RGB space. We also propose the LED method to smooth the image before enhancement. We evaluate the proposed method on the real-world low-light dataset. Experimental results demonstrate that the superiority of our submitted SurroundNet in both performance and network parameters against State-of-the-Art low-light image enhancement methods.

In further work, we will focus on the global illumination information. We believe it can bring more generalization ability than the local one. In particular in the outdoor environment, which always has changeable lighting conditions. The interaction between noise and illumination is another interesting research content, it may be conducive to better trade-off between performance and parameters.


  • [1] Mohammad Abdullah-Al-Wadud, Md Hasanul Kabir, M Ali Akber Dewan, and Oksam Chae. A dynamic histogram equalization for image contrast enhancement. IEEE Transactions on Consumer Electronics, 53(2):593–600, 2007.
  • [2] Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and Michel Barlaud. Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of 1st International Conference on Image Processing, volume 2, pages 168–172. IEEE, 1994.
  • [3] K. Ding, K. Ma, S. Wang, and E. P. Simoncelli. Image quality assessment: Unifying structure and texture similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 0(01):1–1, dec 5555.
  • [4] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In

    European conference on computer vision

    , pages 184–199. Springer, 2014.
  • [5] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 580–587, 2014.
  • [6] Xiaojie Guo, Yu Li, and Haibin Ling. Lime: Low-light image enhancement via illumination map estimation. IEEE Transactions on Image Processing, 26(2):982–993, 2017.
  • [7] Martin Hahner, Dengxin Dai, Christos Sakaridis, Jan-Nico Zaech, and Luc Van Gool. Semantic understanding of foggy scenes with purely synthetic data. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 3675–3681. IEEE, 2019.
  • [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [9] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • [10] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

    International conference on machine learning

    , pages 448–456. PMLR, 2015.
  • [12] Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, and Zhangyang Wang. Enlightengan: Deep light enhancement without paired supervision. IEEE Transactions on Image Processing, 30:2340–2349, 2021.
  • [13] Daniel J Jobson, Zia-ur Rahman, and Glenn A Woodell. A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Transactions on Image processing, 6(7):965–976, 1997.
  • [14] Daniel J Jobson, Zia-ur Rahman, and Glenn A Woodell. Properties and performance of a center/surround retinex. IEEE transactions on image processing, 6(3):451–462, 1997.
  • [15] Sergey Kastryulin, Dzhamil Zakirov, and Denis Prokopenko. PyTorch Image Quality: Metrics and measure for image quality assessment, 2019. Open-source software available at
  • [16] Edwin H Land. An alternative technique for the computation of the designator in the retinex theory of color vision. Proceedings of the national academy of sciences, 83(10):3078–3080, 1986.
  • [17] Chulwoo Lee, Chul Lee, and Chang-Su Kim. Contrast enhancement based on layered difference representation of 2d histograms. IEEE Transactions on Image Processing, 22(12):5372–5384, 2013.
  • [18] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In Proceedings of International Conference on Learning Representations 2014 (ICLR2014)., 2014.
  • [19] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • [20] Ce Liu, Richard Szeliski, Sing Bing Kang, C Lawrence Zitnick, and William T Freeman. Automatic estimation and removal of noise from a single image. IEEE transactions on pattern analysis and machine intelligence, 30(2):299–314, 2007.
  • [21] Kin Gwn Lore, Adedotun Akintayo, and Soumik Sarkar.

    Llnet: A deep autoencoder approach to natural low-light image enhancement.

    Pattern Recognition, 61:650–662, 2017.
  • [22] Feifan Lv, Yu Li, and Feng Lu. Attention guided low-light image enhancement with a large scale low-light simulation dataset. International Journal of Computer Vision, 129(7):2175–2193, 2021.
  • [23] Feifan Lv, Feng Lu, Jianhua Wu, and Chongsoon Lim. Mbllen: Low-light image/video enhancement using cnns. In BMVC, page 220, 2018.
  • [24] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. Advances in neural information processing systems, 29:2802–2810, 2016.
  • [25] Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing, 21(12):4695–4708, 2012.
  • [26] Andrew Moore, John Allman, and Rodney M Goodman. A real-time neural system for color constancy. IEEE Transactions on Neural networks, 2(2):237–247, 1991.
  • [27] Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical analysis, pages 105–116. Springer, 1978.
  • [28] Bumjun Park, Songhyun Yu, and Jechang Jeong. Densely connected hierarchical network for image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  • [29] Etta D Pisano, Shuquan Zong, Bradley M Hemminger, Marla DeLuca, R Eugene Johnston, Keith Muller, M Patricia Braeuning, and Stephen M Pizer. Contrast limited adaptive histogram equalization image processing to improve the detection of simulated spiculations in dense mammograms. Journal of Digital imaging, 11(4):193, 1998.
  • [30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2016.
  • [31] Ali M Reza. Realization of the contrast limited adaptive histogram equalization (clahe) for real-time image enhancement. Journal of VLSI signal processing systems for signal, image and video technology, 38(1):35–44, 2004.
  • [32] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [33] Christos Sakaridis, Dengxin Dai, and Luc Van Gool.

    Semantic foggy scene understanding with synthetic data.

    International Journal of Computer Vision, 126(9):973–992, 2018.
  • [34] Liang Shen, Zihan Yue, Fan Feng, Quan Chen, Shihao Liu, and Jie Ma. Msr-net: Low-light image enhancement using deep convolutional network. arXiv preprint arXiv:1711.02488, 2017.
  • [35] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • [36] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 31, 2017.
  • [37] Tong Tong, Gen Li, Xiejie Liu, and Qinquan Gao. Image super-resolution using dense skip connections. In Proceedings of the IEEE international conference on computer vision, pages 4799–4807, 2017.
  • [38] Yanghai Tsin, Visvanathan Ramesh, and Takeo Kanade. Statistical calibration of ccd imaging process. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 1, pages 480–487. IEEE, 2001.
  • [39] Chao Wang and Zhongfu Ye. Brightness preserving histogram equalization with maximum entropy: a variational perspective. IEEE Transactions on Consumer Electronics, 51(4):1326–1334, 2005.
  • [40] Li-Wen Wang, Zhi-Song Liu, Wan-Chi Siu, and Daniel PK Lun. Lightening network for low-light image enhancement. IEEE Transactions on Image Processing, 29:7984–7996, 2020.
  • [41] Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [42] Shuhang Wang, Jin Zheng, Hai-Miao Hu, and Bo Li. Naturalness preserved enhancement algorithm for non-uniform illumination images. IEEE Transactions on Image Processing, 22(9):3538–3548, 2013.
  • [43] Yang Wang, Yang Cao, Zheng-Jun Zha, Jing Zhang, Zhiwei Xiong, Wei Zhang, and Feng Wu. Progressive retinex: Mutually reinforced illumination-noise perception network for low-light image enhancement. In Proceedings of the 27th ACM International Conference on Multimedia, pages 2015–2023, 2019.
  • [44] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [45] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. Ieee, 2003.
  • [46] Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. Deep retinex decomposition for low-light enhancement. In British Machine Vision Conference, 2018.
  • [47] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Learning enriched features for real image restoration and enhancement. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pages 492–511. Springer, 2020.
  • [48] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.

    The unreasonable effectiveness of deep features as a perceptual metric.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • [49] Yonghua Zhang, Xiaojie Guo, Jiayi Ma, Wei Liu, and Jiawan Zhang. Beyond brightening low-light images. International Journal of Computer Vision, 129(4):1013–1037, 2021.
  • [50] Yonghua Zhang, Jiawan Zhang, and Xiaojie Guo. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM international conference on multimedia, pages 1632–1640, 2019.
  • [51] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(7):2480–2495, 2020.
  • [52] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for image restoration with neural networks. IEEE Transactions on computational imaging, 3(1):47–57, 2016.
  • [53] Yuanbo Zhou, Wei Deng, Tong Tong, and Qinquan Gao. Guided frequency separation network for real-world super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 428–429, 2020.