There is no doubt that high-quality image plays a critical role in computer vision tasks such as object detection and scene understanding. Unfortunately, the images obtained in reality are often degraded in some cases. For example, when captured in low-light conditions, images always suffer from very low contrast and brightness, which increases the difficulty of subsequent high-level tasks in a great extent. Figure1(a) provides one case, from which many details have been buried into the dark background. Due to the fact that in many cases only low-light images can be captured, several low-light image enhancement methods have been proposed to overcome this problem. In general, these methods can be categorized into two groups: histogram-based methods and Retinex-based methods.
In this paper, a novel low-light image enhancement model based on convolutional neural network and Retinex theory is proposed. To the best of our knowledge, this is the first work of using convolutional neural network and Retinex theory to solve low-light image enhancement. Firstly, we explain that multi-scale Retinex is equivalent to a feedforward convolutional neural network with different Gaussian convolution kernels. The main drawback of multi-scale Retinex is that the parameters of kernels depend on artificial settings rather than learning from data, which makes the accuracy and flexibility of the model reduce in some way. Motivated by this fact, we put forward a Convolutional Neural Network (MSR-net) that directly learns an end-to-end mapping between dark and bright images. Our method differs fundamentally from existing approaches. We regard low-light image enhancement as a supervised learning problem. Furthermore, the surround functions in Retinex theory are formulated as convolutional layers, which are involved in optimization by back-propagation.
Overall, the contribution of our work can be boiled down to three aspects: First of all, we establish a relationship between multi-scale Retinex and feedforward convolutional neural network. Secondly, we consider low-light image enhancement as a supervised learning problem where dark and bright images are treated as input and output respectively. Last but not least, experiments on a number of challenging images reveal the advantages of our method in comparison with other state-of-the-art methods. Figure 1 gives an example. Our method achieves a brighter and more natural result with a clearer texture and richer details.
2 Related Work
2.1 Low-light Image Enhancement
In general, low-light image enhancement can be categorized into two groups: histogram-based methods and Retinex-based methods.
Directly amplifying the low-light image by histogram transformation is probably the most intuitive way to lighten the dark image. One of the simplest and most widely used technique is histogram equalization(HE), which makes the histogram of the whole image as balanced as possible. Gamma Correction is also a great method to enhance the contrast and brightness by expanding the dark regions and compressing the bright ones in the mean time. However, the main drawback of these method is that each pixel in the image is treated individually, without the dependence of their neighborhoods, which makes the result look inconsistent with real scenes. To resolve the mentioned problems above, variational methods which use different regularization terms on the histogram have been proposed. For example, contextual and variational contrast enhancement tries to find a histogram mapping to get large gray-level difference.
In this work, Retinex-based methods have been taken into more account. Retinex theory is introduced by Land  to explain the color perception property of the human vision system. The dominant assumption of Retinex theory is that the image can be decomposed into reflection and illumination. Single-scale Retinex(SSR) , based on the center/surround Retinex, is similar to the difference-of-Gaussian(DOG) function which is widely used in natural vision science, and it treats the reflectance as the final enhanced result. Multi-scale Retinex(MSR)  can be considered as a weighted sum of several different SSR outputs. However, these methods often look unnatural. Further, modified MSR  applies the color restoration function(CRF) in the chromaticity space to eliminate the color distortions and gray zones evident in the MSR output. Recently, the method proposed in 
tries to estimate the illumination of each pixel by finding the maximum value in R, G and B channel, then refines the initial illumination map by imposing a structure prior on it. Seonhee Parket al.  use the variational-optimization-based Retinex algorithm to enhance the low-light image. Fu et al.  propose a new weighted variational model to estimate both the reflection and the illumination. Different from conventional variational models, their model can preserve the estimated reflectance with more details. Inspired by the dark channel method on de-hazing,  finds the inverted low-light image looks like haze image. They try to remove the inverted low-light image of haze by using the method proposed in  and then invert it again to get the final result.
2.2 Convolutional Neural Network for Low-level Vision Tasks
and so on. Besides these high-level vision tasks, deep learning has also shown great ability at low-level vision tasks. For instance, Donget al. 
train a deep convolutional neural network (SRCNN) to accomplish the image super-resolution tasks. Fuet al.  try to remove rain from single images via a deep detail network. Cai et al.  propose a trainable end-to-end system named DehazeNet, which takes a hazy image as input and outputs its medium transmission map that is subsequently used to recover a haze-free image via atmospheric scattering model.
3 CNN Network for Low-light Image Enhancement
We elaborate that multi-scale Retinex as a low-light image enhancement method is equivalent to a feedforward convolutional neural network with different Gaussian convolution kernels from a novel perspective. Subsequently, we propose a Convolutional Neural Network (MSR-net) that directly learns an end-to-end mapping between dark and bright images.
3.1 Multi-scale Retinex is a CNN Network
The dominant assumption of Retinex theory is that the image can be decomposed into reflection and illumination:
Where and represent the captured image and the desired recovery, respectively. Single-scale Retinex(SSR) , based on the center/surround Retinex, is similar to the difference-of-Gaussian(DOG) function which is widely used in natural vision science. Mathematically, this takes the form
Where is the associated Retinex output, is the image distribution in the color spectral band, denotes the convolution operation, and is the Gaussian surround function
is the standard deviation of Gaussian function, andis selected such that
By changing the position of the logarithm in the above formula and setting
, are of course not equivalent in mathematical form. The former is the logarithm of ratio between the image and a weighted average of it, while the latter is the logarithm of ratio between the image and a weighted product. Actually, this amounts to choosing between an arithmetic mean and a geometric mean. Experiments show that these two methods are not much different. In this work we choose the latter for simplicity.
Further, multi-scale Retinex(MSR)  is considered as a weighted sum of the outputs of several different SSR outputs. Mathematically,
Where is the number of scales, denotes the component of the scale, represents the spectral component of the MSR output and is the weight associated with the scale.
After experimenting with one small scale (standard deviation ) and one large scale (standard deviation ), the need for the third intermediate scale is immediately apparent in order to eliminate the visible “halo” artifacts near strong edges . Thus, the formula is as follows:
More concrete, we have
Noticing the fact that convolution of two Gaussian functions is still a Gaussian function, whose variance is equal to the sum of two original variance. Therefore, we can represent the above equation8 by using the cascading structure, as Figure 2(a) shows.
The three cascading convolution layers are considered as three different Gaussian kernels. More concrete, the parameter of the first convolution layer is based on a Gaussian distribution, whose variance is. Similarly, the variances of the second and the third convolution layers are , , respectively. At last, the concatenation and convolution layers represent the weighted average. In a word, multi-scale Retinex is practically equivalent to a feedforward convolutional neural network with a residual structure.
3.2 Proposed Method
In the previous section, we put forward the fact that multi-scale Retinex is equivalent to a feedforward convolutional neural network. In this section, inspired by the novel fact, we consider a convolutional neural network to solve the low-light image enhancement problem. Our method outlined in Figure 2(b) differs fundamentally from existing approaches, which takes low-light image enhancement as a supervised learning problem. The input and output data correspond to the low-light and bright images, respectively. More detail about our training dataset will be explained in section 4.
Our model consists of three components: Multi-scale Logarithmic Transformation, Difference-of-convolution and Color Restoration Function. Compared to single-scale logarithmic transformation in MSR, our model attempts to use multi-scale logarithmic transformation, which has been verified to achieve a better performance in practice. Figure 7 gives an example. Difference-of-convolution plays an analogous role with difference-of-Gaussian in MSR, and so does color restoration function. The main difference between our model and original MSR is that most of the parameters in our model are learned from the training data, while the parameters in MSR such as the variance and other constant depend on the artificial setting.
Formally, we denote the low-light image as input and corresponding bright image as . Suppose , , denote three sub-functions: multi-scale logarithmic transformation, difference-of-convolution, and color restoration function. Our model can be written as the composition of three functions:
Multi-scale Logarithmic Transformation: Multi-scale logarithmic transformation takes the original low-light image as input and computes the same size output . Firstly, the dark image is enhanced by several difference logarithmic transformation. The formula is as follows:
Where denotes the output of the scale with the logarithmic base , and
denotes the number of logarithmic transformation function. Next, we concatenate these 3D tensors(3 channels width height) to a larger 3D tensor (3n channels width
height) and then make it go through convolutional and ReLU layers.
Where * denotes a convolution operator, is a convolution kernel that shrinks the channels to 3 channels, corresponds to a ReLU and is a convolution kernel with three output channels for better nonlinear representation. As we can see from the above operation, this part is mainly designed to get a better image via weighted sums of multiple logarithmic transformations, which accelerates the convergence of the network.
Difference-of-convolution: Difference-of-convolution function takes the input and computes the same size output . Firstly, the input passes through multi-convolutional layers.
Where denotes the convolutional layer, is equal to the number of convolutional layers. And represents the kernel. As mentioned earlier in section 3.1, are considered as smooth images at different scales, then we concatenate these 3D tensors to a larger 3D tensor and get it pass the convolutional layer:
Where the is a convolutional layer with three output channels and the receptive field, which is equivalent to averaging these images. Similar to MSR, the output of is the subtraction between and :
Color Restoration Function: Considering that MSR result often looks unnatural, modified MSR  applies the color restoration function(CRF) in the chromaticity space to eliminate the color distortions and gray zones evident in the MSR output. In our model CRF is imitated by a convolutional layer with three output channels:
Where is the final enhanced image. For more visualization, a low light image and the results of ,, have been shown in Figure 3 respectively.
3.3 Objective function
The goal of our model is to train a deep convolutional neural network to make the output and the label as close as possible under the criteria of Frobenius norm.
Where is the number of training samples, represents the regularization parameter.
Weights and bias are the whole parameters in our model. Besides, the regularization parameter , the number of logarithmic transformation function , the scale of logarithmic transformation and the number of convolutional layers , are considered as the hyper-parameters in the model. The parameters in our model are optimized by back-propagation, while the hyper-parameters are chosen by grid-search. More detail about the sensitivity analysis of hyper-parameters will be elaborated in section 4.
|Dataset||Ground truth||Synthetic image||MSRCR||Dong||LIME||SRIE||Ours|
|2,000 test images||1/3.67||0.74/3.53||0.90/3.50||0.69/4.16||0.84/3.89||0.63/3.66||0.92/3.46|
In this section, we elaborately construct an image dataset and spend about 10 hours on training the end-to-end network by using the Caffe software package. To evaluate the performance of our method, we use both the synthetic test data, the public real-world dataset and compare with four recent state-of-the-art low-light image enhancement methods. At the same time, we analyse the running time and evaluate the effect of hyper-parameters to the final results.
4.1 Image Dataset Generation
On the one hand, in order to learn the parameters of the MSR-net, we construct a new image dataset, which contains a great amount of high quality(HQ) and low-light(LL) natural images. An important consideration is that all the image should be selected in real world scenes. We collect more than 20,000 images from the UCID dataset , the BSD dataset  and Google image search. Unfortunately, many of these images suffer significant distortions or contain inappropriate content. Images with obvious distortions such as heavy compression, strong motion blur, out of focus blur, low contrast, underexposure or overexposure and substantial sensor noise are deleted firstly. After this, we exclude inappropriate images such as too small or too large size, cartoon and computer generated content to obtain 1,000 better source images. Then, for each image, we use Photoshop method  to figure out the ideal brightness and contrast settings, then process them one by one to get the high quality(HQ) images with the best visual effect. At last, each HQ image is used to generate 10 low-light(LL) images by reducing brightness and contrast randomly and using gamma correction with stochastic parameters. So we attain a dataset containing 10,000 pairs of HQ/LL images. Further, 8,000 images in the dataset are randomly selected to generate one million HQ/LL patch pairs for training. And the remaining 2,000 images pairs are used to test the trained network during training(please see more details about the dataset generation in the supplemental materials).
4.2 Training Setup
We set the depth of MSR-net to , and use Adam with weight decay of and a mini-batch size of 64. We start with a learning rate of , dividing it by 10 at 100K and 200K iterations, and terminate training at 300K iterations. During our experiments, we found that the network with multi-scale logarithmic transformation performs better than that with single-scale logarithmic transformation, so we set the number of logarithmic transformation function and respectively. The size of convolution kernel has been described partially in the previous section 3.2, and the specific values are shown in the Table 3.
4.3 Results on synthetic test data
Figure 4 shows visual comparison for three synthesized low light images. As we can see, the result of MSRCR  looks unnatural, the method proposed by Dong  always generates unexpected black edge and the result of SRIE  tends to be dark in some extent. LIME  has a similar result to our method, while ours achieves better performance in dark regions.
Since the ground truth is known for the synthetic test data, we use SSIM  for a quantitative evaluation and NIQE  to assess the natural preservation. A higher SSIM indicates that the enhanced image is closer to the ground truth, while a lower NIQE value represents a higher image quality. All the best results are boldfaced. As shown in Table 1, our method achieves higher SSIM and lower NIQE average than other methods for 2,000 test images.
4.4 Results on real-world data
Figure 5 also shows the visual comparison for three real-world low-light images. As shown in every red rectangle, our method MSR-net always achieves better performance in dark regions. More specifically, in the first and second image we get brighter result. In the third image we achieve more natural result, for instance, the tree has been enhanced to be bright green. Besides, from the Garden image in Figure 1, our result gets a clearer texture and richer details than other methods.
Besides the NIQE to evaluate the image quality, we assess the detail enhancement through the Discrete Entropy . A higher discrete entropy shows that the color is richer and the outline is clearer. We delete the high-light images and only keep the low-light images on the MEF dataset [21, 33], NPE dataset  and VV dataset [27, 28] to evaluate our method. As shown in Table 2, for different dataset, MSR-net can also obtain lower NIQE and higher discrete entropy.
Considering the fact that dealing with real-world low light images sometimes causes noise, we attempt to use a denoising algorithm BM3D  as a post-processing. An example is shown in Figure 6, where removing the noise after our deep network can further improve the visual quality on real-world low light image.
4.5 Color Constancy
In addition to enhance the dark image, our model also does a good job in correcting the color. Figure 4 provides some examples. As we can see, our enhanced image is much more similar to the ground truth. To evaluate the performance of the different algorithms, the angular error  between the ground truth image and model result is used:
4.6 Running time on test data
Compared with other non-deep methods, our approach processes the low-light images efficiently. Table 5 shows the average running time of processing a test image for three different sizes, and each averaged 100 testing images. These experiments are tested on a PC running Windows 10 OS with 64G RAM, 3.6GHz CPU and Nvidia GeForce GTX 1080 GPU. All codes of these methods are run in Matlab, which ensures the fairness of time comparison. Methods ,,, are implemented using CPU, while our method is tested on both CPU and GPU. Because our method is a completely feedforward process after network training, we can find that our approach on GPU processes significantly faster than methods ,, except .
4.7 Study of MSR-net Parameters
The number of logarithmic transformation function and the number of convolutional layers are two main hyper-parameters in MSR-net. In this subsection, we try to experiment on the effect of these hyper-parameters on the final results. As we all know, the effectiveness of deeper structures for low-level image tasks is found not as apparent as that shown in high-level tasks [3, 25]. Specifically, we test for the number of logarithmic transformation function and ;; respectively. At the same time, we set the number of convolutional layers . For the sake of fairness, all these networks are iterated 100K times and 100 synthetic images are used to measure the result by averaging their SSIM.
As shown in Table 6, adding more hidden layers obtains higher SSIM and achieves better results. We believe that, with an appropriate design to avoid over-fitting, deeper structure can improve the network’s nonlinear capacity and learning ability. In Figure 7, from the color of the boy’s skin and the clothes, the network using multi-scale logarithmic transformation performs better. It is also essential for MSR-net to improve nonlinear capacity by using multi-scale logarithmic transformation. To get better performance within the running time and hardware limits, we finally chose the number of logarithmic transformation function and the depth of convolutional layers for our experiments above.
In this paper, we propose a novel deep learning approach for low-light image enhancement. It shows that multi-scale Retinex is equivalent to a feedforward convolutional neural network with different Gaussian convolution kernels. After this, we construct a Convolutional Neural Network(MSR-net) that directly learns an end-to-end mapping between dark and bright images with little extra pre/post-processing beyond the optimization. Experiments on synthetic and real-world data reveal the advantages of our method in comparison with other state-of-the-art methods from the qualitative and quantitative perspective. Nevertheless, there are still some problems with this approach. Because of the limited receptive field in our model, very smooth regions such as clear sky are sometimes attacked by halo effect. Enlarging receptive field or adding hidden layers may solve this problem.
-  http://www.photoshopessentials.com/photo-editing/adding-a-brightness-contrast-adjustment-layer-in-photoshop.html.
-  P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2011.
-  J. Bruna, P. Sprechmann, and Y. Lecun. Image super-resolution using deep convolutional networks. Computer Science, 2015.
-  B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao. Dehazenet: An end-to-end system for single image haze removal. IEEE Transactions on Image Processing, 25(11):5187–5198, 2016.
-  T. Celik and T. Tjahjadi. Contextual and variational contrast enhancement. IEEE Transactions on Image Processing, 20(12):3431–3441, 2011.
-  K. Dabov, A. Foi, V. Katkovnik, and K. O. Egiazarian. Image restoration by sparse 3d transform-domain collaborative filtering. In Image Processing: Algorithms and Systems, page 681207, 2008.
-  C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
-  X. Dong, Y. A. Pang, and J. G. Wen. Fast efficient algorithm for enhancement of low lighting video. In ACM SIGGRAPH 2010 Posters, page 69. ACM, 2010.
-  X. Fu, J. Huang, D. Z. Y. Huang, X. Ding, and J. Paisley. Removing rain from single images via a deep detail network.
X. Fu, D. Zeng, Y. Huang, X.-P. Zhang, and X. Ding.
A weighted variational model for simultaneous reflectance and
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2782–2790, 2016.
-  X. Guo, Y. Li, and H. Ling. Lime: Low-light image enhancement via illumination map estimation. IEEE Transactions on Image Processing, 26(2):982–993, 2017.
-  K. He, J. Sun, and X. Tang. Single image haze removal using dark channel prior. IEEE transactions on pattern analysis and machine intelligence, 33(12):2341–2353, 2011.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  S. D. Hordley and G. D. Finlayson. Re-evaluating colour constancy algorithms. 1(1):76–79, 2004.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  D. J. Jobson, Z.-u. Rahman, and G. A. Woodell. A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Transactions on Image processing, 6(7):965–976, 1997.
-  D. J. Jobson, Z.-u. Rahman, and G. A. Woodell. Properties and performance of a center/surround retinex. IEEE transactions on image processing, 6(3):451–462, 1997.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  E. H. Land. The retinex theory of color vision. Scientific American, 237(6):108–129, 1977.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
-  K. Ma, K. Zeng, and Z. Wang. Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing, 24(11):3345–3356, 2015.
-  A. Mittal, R. Soundararajan, and A. C. Bovik. Making a completely blind image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2013.
-  S. Park, S. Yu, B. Moon, S. Ko, and J. Paik. Low-light image enhancement using variational optimization-based retinex model. IEEE Transactions on Consumer Electronics, 63(2):178–184, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M. H. Yang. Single image dehazing via multi-scale convolutional neural networks. pages 154–169, 2016.
-  G. Schaefer and M. Stich. Ucid: An uncompressed color image database. In Storage and Retrieval Methods and Applications for Multimedia 2004, volume 5307, pages 472–481. International Society for Optics and Photonics, 2003.
-  V. Vonikakis, D. Chrysostomou, R. Kouskouridas, and A. Gasteratos. Improving the robustness in feature detection by local contrast enhancement. In Imaging Systems and Techniques (IST), 2012 IEEE International Conference on, pages 158–163. IEEE, 2012.
-  V. Vonikakis, D. Chrysostomou, R. Kouskouridas, and A. Gasteratos. A biologically inspired scale-space for illumination invariant feature detection. Measurement Science and Technology, 24(7):074024, 2013.
-  L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 3119–3127, 2015.
-  S. Wang, J. Zheng, H.-M. Hu, and B. Li. Naturalness preserved enhancement algorithm for non-uniform illumination images. IEEE Transactions on Image Processing, 22(9):3538–3548, 2013.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
-  Z. Ye, H. Mohamadian, and Y. Ye. Discrete entropy and relative entropy study on nonlinear clustering of underwater and arial images. In Control Applications, 2007. CCA 2007. IEEE International Conference on, pages 313–318. IEEE, 2007.
-  K. Zeng, K. Ma, R. Hassen, and Z. Wang. Perceptual evaluation of multi-exposure image fusion algorithms. In Quality of Multimedia Experience (QoMEX), 2014 Sixth International Workshop on, pages 7–12. IEEE, 2014.