Diving Deeper into Underwater Image Enhancement: A Survey

07/17/2019 ∙ by Saeed Anwar, et al. ∙ CSIRO Tianjin University 5

The powerful representation capacity of deep learning has made it inevitable for the underwater image enhancement community to employ its potential. The exploration of deep underwater image enhancement networks is increasing over time, and hence; a comprehensive survey is the need of the hour. In this paper, our main aim is two-fold, 1): to provide a comprehensive and in-depth survey of the deep learning-based underwater image enhancement, which covers various perspectives ranging from algorithms to open issues, and 2): to conduct a qualitative and quantitative comparison of the deep algorithms on diverse datasets to serve as a benchmark, which has been barely explored before. To be specific, we first introduce the underwater image formation models, which are the base of training data synthesis and design of deep networks, and also helpful for understanding the process of underwater image degradation. Then, we review deep underwater image enhancement algorithms, and a glimpse of some of the aspects of the current networks is presented including network architecture, network parameters, training data, loss function, and training configurations. We also summarize the evaluation metrics and underwater image datasets. Following that, a systematically experimental comparison is carried out to analyze the robustness and effectiveness of deep algorithms. Meanwhile, we point out the shortcomings of current benchmark datasets and evaluation metrics. Finally, we discuss several unsolved open issues and suggest possible research directions. We hope that all efforts done in this paper might serve as a comprehensive reference for future research and call for the development of deep learning-based underwater image enhancement.



There are no comments yet.


page 5

page 6

page 10

page 14

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

‘Sit, be still, and listen.’


Nowadays, developing, exploring, and protecting the ocean’s resources have become the strategy center in the international community. Clear underwater images and videos can provide valuable information of the underwater world, which are essential for numerous engineering and research tasks such as underwater archaeology, underwater surveillance, etc. However, the raw underwater images and videos usually suffer from the effects of quality degradation, especially the impact of backscatter in far distances. The issues of quality degradation are mainly introduced by light selective absorption and scattering in water as well as the use of artificial light in deep water. The degraded underwater images have low contrast and brightness, color deviations, blurry details, and uneven bright speck, which limit their applications in practical scenarios. As an indispensable processing step, underwater image enhancement methods ranging from the conventional techniques (e.g., physical model-based methods, and histogram equalization-based methods) to the data-driven techniques (e.g.

, convolutional neural networks, and generative adversarial networks) have been attracting increasing attention.

The past few decades have seen the rapid development of deep learning techniques, which have been extensively applied in various computer vision and image processing tasks

DL . Deep learning has significantly improved the performance of high-level vision tasks such as object detection Ren2017 and object recognition RedidualNet

. Moreover, the low-level vision tasks, such as image super-resolution

Li2018 and image denoising Zhang2017 , also benefit from the advantages of deep networks and deliver state-of-the-art performance. Unfortunately, we are unable to observe the appealing performance of deep learning-based underwater image enhancement, although lots of researchers have attempted to utilize the deep learning techniques to the underwater image enhancement.

In this paper, we mainly focus on deep learning method, which enhance and restore underwater images. Through this exposition, we provide the latest development and comparison of current deep underwater image restoration and enhancement algorithms. Furthermore, we summarize the existing issues, analyze the potential reasons, and suggest future research directions. The main contributions of this paper are two-fold:

  • We summarize the deep learning-based underwater image enhancement algorithms, including network architectures, network parameters, training data, loss function, and training configurations. It provides, to the best of our knowledge, the first comprehensive and in-depth survey for deep learning-based underwater image enhancement, which is helpful for developing more robust and effective deep algorithms.

  • We conduct the systematic experiments on diverse datasets to qualitatively and quantitatively compare the deep learning-based underwater image enhancement algorithms. Our evaluation and analysis demonstrate the performance of current deep algorithms, point out their limitations, and indicates the bias of existing benchmark datasets and evaluation metrics. As a consequence, we give potential insights for future research directions in this field of study.

The rest of the paper is organized as follows. Section 2 introduces the background of underwater image enhancement and restoration, mainly focusing on the imaging models. Section 3 presents the existing deep learning-based underwater image enhancement algorithms and insights into the network. Section 4 gives the experimental quantitative and qualitative results and analysis, evaluation metrics, and datasets. Section 5 suggests future research directions, and Section 6 concludes this paper.

2 Background

In this section, we mainly introduce the commonly-used physical models for underwater image enhancement, including atmospheric scattering model, simplified underwater image formation model, and revised underwater image formation model. These models are the base of training data synthesis and design of deep networks and also helpful for understanding the process of underwater image degradation.

2.1 Atmospheric Scattering Model

For an image captured in a scattering medium, only a part of the reflected light from the scene reaches the imaging sensor due to the absorption and scattering effects, typically for hazy image formation. Since underwater images usually have a hazy appearance (similar to the hazy image), the atmospheric scattering model Koschmieder1924 is traditionally used to describe the degradation of the underwater image. The atmospheric scattering model Koschmieder1924 can be characterized as:


where denotes the pixel coordinates, is the observed image, is the haze-free latent image, is the global atmospheric light which indicates the intensity of ambient light, and is the transmission which represents the percentage of the scene radiance reaching the camera. When the haze is homogenous, can be further expressed in an exponential decay term as:


where is the atmospheric attenuation coefficient and is the distance from the scene to the camera. In this atmospheric scattering model, the scattering is non-selective, and attenuation is independent of wavelengths.

2.2 Simplified Model

In fact, there is a significant difference between atmospheric scattering model and real-world underwater image formation model. The real-world underwater imaging is far more complicated due to the optical properties of selective attenuation in water. Thus, in the early stage, most physical model-based methods followed a simplified underwater image formulation model provided by Chiang122012 . We denote the captured underwater image by , the clear latent image (also known as scene radiance) as , and the homogeneous global background light as , then the degradation model is given as:


where presents the wavelength of the RGB channels, and is a point in the underwater scene. Similarly, is the medium energy ratio, which is the percentage of the scene radiance captured by the camera (the amount of radiance reflected from the point ). This phenomenon causes contrast degradation, and color casts. To be precise, is a function of and the distance to the camera from the scene point , expressed as:


where is the medium attenuation coefficient, which is dependent on the wavelength. Furthermore, is the energy of light from the submerged scene before it passes through the transmission medium from a distance while is the strength of light after absorption by the transmission medium. Moreover, is the normalized residual energy which is the ratio of residual energy to the initial energy per unit of distance and is dependent on the wavelength of light. For example, the bluish tone of the most underwater images is due to the fast attenuation of the red wavelength in open water as it possesses a longer wavelength than blue and green ones.

2.3 Revised Model

Recent research found that the commonly-used atmospheric scattering model and simplified underwater image formation model ignored some key components in the process of real-world underwater imaging RevisedModel . Specifically, the attenuation coefficient for backscatter strongly depends on the veiling light. Moreover, unlike the absorption in the atmosphere, the absorption in water should not be neglected. Most importantly, the attenuation coefficients for the direct signal and the scattering signal are different.

Based on the findings mentioned above, Akkaynak & Treibitz RevisedModel proposed a revised underwater image formation model which can be expressed as:


where is the veiling light, is the beam attenuation coefficient, is the direct transmitted light,

is the backscattered light, the vectors

and represent the coefficient dependencies. To be more specific, =z, , E, S, and =E, S, b, , where is the range along LOS, is the reflectance, is the irradiance, S is the sensor spectral response, and is the beam scattering coefficient. Similar to the simplified model, is the observed underwater image, is the latent clear underwater image. More details can be found in RevisedModel . Moreover, the coefficient associated with the backscatter varies with the sensor, ambient illumination, and water type. Generally, the coefficient of backscatter is different from the coefficient associated with the direct signal.

In summary, the atmospheric scattering model is suitable for underwater scenarios only in some cases, such as shallow water with low backscatter. Compared to the atmospheric scattering model, the simplified underwater image formation model takes the selective attenuation of different wavelengths into consideration, which extends the generalization of this model. However, the simplified underwater image formation model assumes the attenuation coefficients are only properties of the water, which is inaccurate because the attenuation coefficients vary with the sensor, ambient illumination, etc. Besides, the simplified model ignores the fact that the backscattered light has a different attenuation coefficient from the direct light. Thus, a physically accurate model (i.e., revised underwater image formation model) is proposed, which further completes the model of underwater image formation. Nevertheless, such an accurate model has barely received much attention due to its complexity. Most of the deep learning-based underwater image enhancement algorithms still follow the atmospheric scattering model or simplified underwater image formation model to synthesize their training data and design their network architectures. Inaccurate models tend to happen in unreliable, unstable, and inauthentic results of deep algorithms.

3 Deep Underwater Image Enhancement Algorithms

Deep underwater image enhancement algorithms can ideally be divided into two main categories i.e., CNN-based and GAN-based algorithms. The goal of the CNN algorithms is to be faithful to the original underwater image while the GAN-based algorithms aim to improve the perceptual quality of the images. However, this classification is very naive; therefore, we categorize the networks based on their architectural differences. In Figure 1, the categorization of deep underwater networks is presented, and in the following sections, we list and provide details for each method into different categories based on essential aspects.

Figure 1: Categorization of Deep Underwater Networks: The organization of deep networks based on their essential aspects.

3.1 Encoder-Decoder models

The following models benefit from the famous encoder-decoder architecture to advance the underwater image enhancement research.

3.1.1 P2P network

Recently, Sun et al. sun2018p2p suggested the use of pixel-to-pixel (P2P) network to enhance underwater images. The proposed model is a “symmetric” encoder and decoder network similar to REDNet mao2016REDNet

. The encoder part is composed of three convolutional layers, while the decoder is made from three deconvolutional layers. ReLU follows each network element except the last one.

This model is trained on 3359 images collected from the real-world environment. To simulate the underwater images, the authors pour milk of 30, 50, and 70 ml into 1m of water to produce low, medium and high-level degradation, respectively. Finally, out of these, 10,000 images are selected for training and another 2,000 images for testing. Moreover, the input to the network is a cropped patch of 6666. The loss function is minimized via SGD lecun1998SGD with an initial learning rate of 10.

3.1.2 Uie-Dal

Underwater Image Enhancement using Domain Adversarial Learning (UIE-DAL) uplavikar2019UIE-DAL aims to learn agnostic model where it can enhance any underwater-type image. The backbone architecture of the UIE-DAL uplavikar2019UIE-DAL is the famous encoder-decoder UNET ronneberger2015unet

. The novelty of this work is the incorporation of a neural network classifier, named nuisance classifier, which classifies the latent vector extracted from the encoder.

The authors claim the model to be agnostic considering that nuisance classifier is not aware of the underwater type as it receives the latent vector from the encoder, which is agnostic to the features of the underwater types. The UIE-DAL uplavikar2019UIE-DAL combines three losses i.e. , nuisance loss, and adversarial loss. The training is achieved in two steps. First, the only encoder-decoder structure is trained, then a nuisance classier is incorporated in the network.

3.1.3 Ugan

Recently, Underwater Generative adversarial network (UGAN) fabbri2018UGAN is proposed to improve the underwater image quality. For discriminator, UGAN chose WGAN-GP (Wasserstein GAN with gradient penalty) gulrajani2017WGANGP to enforces the soft constraint on the output concerning its input via the Lipschitz on the gradients norms instead of clipping the gradients in some range. The discriminator is fully convolutional and is similar to radford2015unsupervised

except batch normalization 

ioffe2015BN is not applied to the weights of convolutional layers. Furthermore, the discriminator outputs 3232 feature matrix similar to PatchGAN PatchGAN2016 . The generator is motivated by CycleGAN CycleGAN2017 , comparable to the encoder-decoder network of UNET ronneberger2015unet . The encoder of UGAN fabbri2018UGAN is composed of convolutional layers having filter sizes of 4

4 with a stride of two followed by batch normalization 

ioffe2015BN and leaky ReLU (slope of 0.2). Similarly, the decoder portion consists of deconvolutional layers followed by ReLU nair2010ReLU only except the last layer where TanH is used to restrict distribution between -1 and 1.

The evaluation and training are achieved on the subsets of ImageNet 

deng2009imagenet . Moreover, two types of underwater images are collected i.e. one set of 6,143 images without distortion and another set of 1,817 images with distortion. The Adam kingma2014adam is used as optimizer with a fixed learning rate of 10

for 100 epochs. The input to the network is 256

2563, while loss is a linear combination of and Earth-Mover or Wasserstein-1 distance.

3.2 Modular designs

Modular or block designs employ the repetition of the same structure, commonly known as a “block” or a“module”

, to learn the features. These designs are very successful in computer vision and machine learning tasks. We provide the example of modular or block-based designs for underwater networks below.

3.2.1 Uwcnn

To deal with the low contrast and distorted color of the degraded underwater images, Anwar et al. UWCNN2018 proposed a CNN underwater image enhancement model, called UWCNN. The UWCNN is an end-to-end model trained by the synthetic underwater image datasets, which includes three densely connected building blocks. Furthermore, each basic building block consists of three densely connected convolutional layers. After the three chained building blocks, a convolutional layer is used to learn the difference (residual) between the degraded underwater image and its clean counterpart.

To train the UWCNN UWCNN2018 model, the authors use the attenuation coefficients of different water types to synthesize various underwater image datasets according to the underwater image formation model resulting in ten types of underwater image datasets which are synthesized by using the RGB-D NYU-v2 dataset silberman2012NYU . These underwater image datasets simulate the open ocean water types and coastal water types ranging from the clearest to the most turbid. Finally, the authors train ten UWCNN models for the ten types of underwater images. The parameters of the UWCNN model are learned by joint optimizing the and SSIM loss functions. In the entire UWCNN, the kernel sizes and filter numbers are fixed, i.e., 3 and 16, respectively. The learning rate is set to 210 and ADAM kingma2014adam

is used for optimization in TensorFlow framework.

3.2.2 DenseGAN

To enhance the underwater images, Guo et al. DenseGAN introduced a multiscale dense block (MSDB) algorithm, namely, DenseGAN111The authors’ term the model as UWGAN; however, Li et al. UWGAN2018 proposed a model with the same name earlier. To avoid confusion, we call it DenseGAN due to its dense connections. which employs the use of dense connections, residual learning, and multi-scale network for underwater image enhancement.

The generator at the start is composed of two convolutional, batch normalization (BN), leaky ReLU (LReLU) sequence then two MSDB blocks followed by sequence Deconvolutional-BN-LReLU, while at the end there is a deconvolutional layer and a TanH layer. The network architecture of the DenseGAN generator and MSDB are shown in Figure 2. In each MSDB block, the input features are passed through two different branches, where each branch has kernels with different dilations. The features from each branch are concatenated half-way through the MSDB block and fed again into the respective branches. At the end of the MSDB block, the features are concatenated again and passed through a 11 convolutional layer. The discriminator network is similar to PatchGAN PatchGAN2016 ; however, it is composed of five layers of spectral normalization miyato2018spectral . Except for the first and last layer, the discriminator is composed of sequences of convolutional-BN-LReLU.

The first two layers of the generator have 77 and 33 filter size with 64 and 128 feature maps respectively. The last deconvolution layer outputs the same number of channels as the input. The TanH layer keeps the distribution between -1 and 1. Moreover, the slope of the leaky ReLU is fixed at 0.2, and the network is trained via TensorFlow framework using a learning rate of 10 with patch size of 2562563. The ADAM kingma2014adam is used for optimization, and batch size is set to 32. The losses employed are GAN loss, , and gradient loss.

Figure 2: Network architectures: A glimpse of network architectures used for underwater image enhancement using CNNs and GANs. Best viewed with zoom-in on a digital display.

3.3 Multi-branch designs

The multiple branch designs aim to either learn different features of the same input at different levels or exploit distinct inputs at separate branches. Following are the examples of such networks.

3.3.1 UIE-Net

Wang et al. wang2017UIENet presented a deep CNN method for enhancement of underwater images, namely, UIE-Net, which is composed of three subnetworks. The first subnet called sharing network (termed as S-Net) is composed of convolutional layers only. S-Net extracts features from the input image which is then forwarded to the other two subnets (i.e. the branches of the network: the color correction network (CC-Net) and the haze removal network (HR-Net).) CC-Net and HR-Net output color corrected image, and transmission map, respectively. Both CC-Net and HR-Net have the same network structure consisting of four convolutional layers, followed by sigmoid activation. The only difference between CC-Net and HR-Net is the number of output channels i.e. three channels and one channel, respectively.

The S-Net has two convolutional layers and a consistent filter size of 55, while the CC-Net and HR-Net have four convolutional layers with filter sizes of 11, 33, 55 and 77 to capture contextual information. Figure 2 shows the underlying network architecture of the UIE-Net. The inputs to the network are 3232 image patches in the procedure of training, and the network is trained on 210 image patches synthesized from 200 clear images collected from the internet. The initial learning rate is fixed at 510, which is decreased by half after 5 0 until 2.510.

The loss employed for learning is . Moreover, the authors perform smoothing on the input patches to obtain desirable results. As the last step, the guided image filtering He2010Guided is applied on the transmission map to remove artifacts if any. It is also to be noted here that UIE-Net is one of the pioneering work in deep learning direction.

3.3.2 DUIENet

More recently, Li et al. libenchmark2019 constructed a real-world underwater image enhancement dataset, including 950 underwater images, 890 of which have the corresponding reference images. These potential reference images are produced by 12 image enhancement methods, and the final references are selected by 50 volunteers via majority voting.

Inspired by fusion-based underwater image enhancement method Aucti2012CVPR , Li et al. libenchmark2019 proposed a gated fusion CNN trained by the constructed dataset for underwater image enhancement, called DUIENet. First, three input versions are generated by sequentially applying White Balance, Histogram Equalization, and Gamma Correction algorithms to the raw input image. Then, the DUIENet learns three confidence maps, which determine the most important features remaining in the final result. The DUIENet is a multi-scale FCNN, which consists of 14 convolutional layers followed ReLU except for the last layer (followed by Sigmoid). To reduce the color casts and artifacts introduced by the three pre-processing algorithms, three feature transformation units (FTUs) are used in the DUIENet libenchmark2019 . The FTU includes three stacked multi-scale convolutional layers. The input of each FTU is the corresponding preprocessed underwater image, and its output is the transformed image. At last, the transformed three inputs are multiplied by the three learned confidence maps, and then the summation of the three products is the enhanced underwater image.

With the constructed dataset, the authors selected 800 pairs of images randomly to generate the training set. These images are resized to 11211 and data augmentation is used to obtain seven additional versions of the original 800 pairs of training data. The rest 90 pairs of images are treated as the testing set. To reduce the artifacts induced by pixel-wise loss functions, the authors minimize the perceptual loss (layer relu54 of the pre-trained VGG19 network simonyan2014vgg ).

3.3.3 Fgan

Fusion generative adversarial network, abbreviated as FGAN LI2019FGAN , takes multiple inputs and passes them through different branches in the same network. In the end, the features are summed before the loss of the generator. The architecture of FGAN LI2019FGAN is similar to DenseGAN with slight modifications in the block’s architecture. The generator with the fundamental block structure is shown in Figure 2. The discriminator is composed of five convolutional layers employing spectral normalization miyato2018spectral . The discriminator is similar to PatchGAN PatchGAN2016 .

A batch-mode learning method with a batch size of is applied. The RGB images of size 256256 are used as inputs. Further, the learning rate is set to 10. The loss function is a combination of relativistic GAN loss jolicoeur2018relativistic , adversarial loss, and loss.

3.4 Depth-guided networks

Depth map or transmission map plays a vital role in restoring the underwater image, which is related to the degradation induced by scattering. Therefore, it is a natural choice to predict the depth map or transmission map of the underwater image to improve the performance of enhancement and restoration. We list the depth-guided networks next.

3.4.1 Urcnn

Underwater residual convolutional neural network (URCNN) hou2018URCNN is proposed by Hou et al., which aims to learn the transmission map. The URCNN, in the first, uses a convolutional layer followed by ReLU to extract features. The batch normalization and ReLU succeed the second Conv layer. This pattern is repeated until the reconstruction layer, where only the convolutional layer is employed to output the transmission map. A global skip connection is used to enforce residual learning. The output transmission map is used to refine the input image.

The network architecture of the URCNN is a modified version of VGG simonyan2014vgg and the input to the network is 180180 transmission map instead of the original image. The underwater images are generated from randomly selected 1000 NYU dataset silberman2012NYU images. Furthermore, using random medium attenuation coefficient and background light, a total of 1800 images are generated for training and 200 images for testing. The initial learning rate is selected to be 10 and reduced to 10 for 60 epochs. The depth of the network is 25 layers with each layer having 64 feature maps and a filter size of 33. Similar to wang2017UIENet , the loss used for learning is .

3.4.2 UIR-Net

Cao et al. cao2018underwater

lately developed a deep network for underwater image restoration inspired by classical methods where the transmission map and the background light are estimated and computed independently. Consequently, two different network architectures were proposed

i.e. the light network (BL-Net) and the transmission map network (TM-Net) while collectively, the network is called UIR-Net cao2018underwater . The background light network (BL-Net) is simple and consists of five layers. The initial three layers are convolutional with BN and pooling. The last two layers are fully connected ones. The output of this BL-Net is thresholded to constrain it, in the range of [0,1]. The transmission map network (TM-Net) is more complicated and is based on eigen2014depth , consisting of two subnets, i.e., coarse-global subnet, and refine subnet. The coarse subnet is made of five convolutional layers, with the first two convolutional layers having pooling and batch normalization. The last layers of the coarse-global subnet are fully connected ones. The refined subnet has three convolutional layers and an upsampling layer which lies before the final convolutional layer. The output of this network is the depth map. Using depth maps, the transmission maps are computed. As a last preprocessing step, the guided filter He2010Guided is applied to refine the maps further.

The loss for the BL-Net is Euclidean while for the TM-Net is a scale-invariant minimum square error (MSE) adopted from Eigen et al. eigen2014depth . Similar to wang2017UIENet , UIR-Net cao2018underwater use NYU-v2 dataset silberman2012NYU to generate 12,000 synthetic underwater images using a total of 29 different underwater ambient lights. The BL-Net is initialized randomly, while TM-Net utilizes the weights from VGG simonyan2014vgg .

3.4.3 WaterGAN

WaterGAN li2018watergan as the name indicates, is a generative adversarial network, which manipulates RGB-D images to simulate underwater images for color correction. The authors present a two-part solution where the first part in the pipeline is the WaterGAN li2018watergan , and the second part is the image restoration network, composed of a depth estimation network and a color correction network. The WaterGAN has two systems: a generator G and discriminator D. The generator is a noise vector, which is projected, reshaped and passed through several convolutional and deconvolutional layers which output a synthetic image. The discriminator distinguishes between real image (from another dataset) and synthetic (generated by generator). The generator aims to create images which the discriminator classify as real.

The underwater images generated by li2018watergan are passed through an image restoration network. The network is inspired by an encoder-decoder architecture, particularly, pixel-wise dense learning, and SegNet badrinarayanan2015segnet

. The SegNet uses a non-parametric upsampling layer which benefits from the max-pooling index information in the encoder. Furthermore, the authors incorporate the skipping layers in the encoder-decoder architecture to compensate for the high frequencies’ loss due to pooling operation.

The authors collect 7,000 images from Michigan’s Marine Hydrodynamics Laboratory. Another 6,500 images are collected from Port Royal, Jamaica. Similarly, 6,083 images are gathered from the coral reef system, Australia pizarro2017survey . Besides, four Kinect datasets i.e. the B3DO janoch2013B3DO , the UW RGB-D lai2014UWRGBD , the NYU silberman2012NYU and the Microsoft 7-scenes shotton2013Microsoft7scenes , are utilized to form 15,000 underwater images via WaterGAN, out of which 12,000 are used for training and 3,000 for testing. The depth estimation network is trained separately at a fixed learning rate of 10 while the color correction network is initially trained with an input resolution of 128 128 having learning rate 10. After that, the authors refined the color correction network with input images of 512 512 resolution, reducing the base learning rate to 10. The loss is utilized for depth estimation and color correction networks, and further, as a post-processing step, the images are normalized i.e. [0,1].

3.5 Dual Generator GANs

The dual generator GANs algorithms for underwater image enhancement employ multiple generators to predict the improved image. Currently, the trend is to use two generators with one discriminator or two generators with two discriminators; either the aim is to share the features between the generators or use the prediction of one generator as an input to the other generator. Examples of the dual generator GANs are the following.

3.5.1 Uwgan

Based on the GANs GAN2014 , Li et al. UWGAN2018 proposed a weakly supervised color transfer method for underwater image color correction, called UWGAN. The UWGAN model relaxes the need for paired underwater images for training and allows the underwater images to be regarded in unknown locations, which benefits from adversarial learning. Following the CycleGAN CycleGAN2017 , the UWGAN model adopts a cycle structure which includes a forward network and a backward network to learn the mapping functions between a source domain (i.e., underwater) and a target domain (i.e., air). The purpose of such a cycle structure is to capture the unique characteristics of one image collection and figure out how these characteristics could be translated into the other image collection.

The generators used in the UWGAN UWGAN2018 have the same architecture as Johnson2016 . For the discriminators, the UWGAN uses 7070 PatchGANs PatchGAN2016 . To train the network, 3800 underwater images and 3800 high-quality air images are collected and are resized to 256256. The final loss function is the linear combination of three-loss functions, including adversarial loss, cycle consistency loss, and SSIM loss. The adversarial loss is to match the distribution of generated images with that of the target domain. The cycle consistency loss is to prevent the learned mappings from contradicting each other. The SSIM loss is to preserve the content and structure of source images.

3.5.2 MCycleGAN

To restore underwater images, Lu et al. lu2019MCycleGAN proposed a Multi-Scale Cycle Generative Adversarial Network (MCycleGAN), which is a variant of the CycleGAN network CycleGAN2017 . The authors incorporate the multiscale SSIM loss into the CycleGAN CycleGAN2017 to improve the image restoration task. The aim is to transfer the underwater style to the recovered style image.

As a first step, the dark channel prior (DCP) he2011DCP is used to obtain the transmission map of a turbid underwater image. Additionally, the transmission maps provide depth information in the form of three binary filters. The turbid underwater images are forwarded through the generator network. The turbid and generated clear underwater images are split into R, G, and B channels. The channels are then subjected to different size of sliding windows to compute the SSIM loss between the turbid and generated images. Furthermore, the SSIM maps are multiplied with corresponding filters and added together, which results in the multiscale SSIM map for final loss computation. As a final step, both the real-world underwater image and the computed ones are passed through the discriminator.

CycleGAN CycleGAN2017 inspired the generator and discriminator of MCycleGAN lu2019MCycleGAN . More specifically, the generator is adapted from image superresolution by Johnson et al. johnson2016perceptual which consists of nine ResNet blocks with training images of size 256256 while the discriminator is based on 7070 PatchGANs isola2017image ; ledig2017photo to differentiate between real and fake image patches. The loss function is a union of the adversarial loss, the cycle-consistent loss, and the multiscale SSIM loss. The dataset is composed of 1,037 turbid underwater images collected from ImageNet deng2009imagenet and Jiao Zhou Bay, out of which 837 are retained as a training dataset, and the rest 200 are reserved for testing. ADAM kingma2014adam is used as an optimizer adopting a fixed learning rate of 0.0002 until convergence.

3.5.3 UIE-sGAN

Yu et al. ye2018underwater proposed an underwater image enhancement system using stacked conditional generative adversarial networks, abbreviated as UIE-sGAN. The proposed network architecture consists of two subnetworks i.e. haze detection subnetwork and color correction subnetwork. Each subnetwork has a generator and discriminator, and the color correction subnetwork is stacked on the haze detection subnetwork. For the haze detection subnet, the generator is similar to UNET ronneberger2015unet consisting of seven convolutional layers and seven deconvolutional layers, both followed by BN and leaky ReLU except the first convolutional layer where only leaky ReLU is employed and the last deconvolutional layer where TanH nonlinear function is realized. While the discriminator is made of four convolutional layers where the initial layer has leaky ReLU purely, and the subsequent ones have batch normalization and leaky ReLU followed by a sigmoid layer. The output of the haze detection network is a haze mask. The structure of haze detection subnet and the color-correction subnet is identical except that color-correction subnet takes the haze mask and RGB images as input and outputs a color corrected underwater image.

The UIE-sGAN ye2018underwater has three losses i.e. the adversarial loss for each network and a consistency loss. The training is accomplished by using WaterGAN li2018watergan to generate underwater images from NYU-v2 dataset silberman2012NYU . Out of 1449 images, 1200 are held for training while the network is evaluated on the remaining ones. The images are resized to 286286 and then cropped to 256256 and further applying data augmentation. The network is optimized using ADAM by fixing the learning rate as 510.

max width= Network Parameters Patch Network Feature Variable Residual Skip Methods Size Depth maps Kernels Blocks learning connections Framework Loss UIE-Net wang2017UIENet 3232 7 16-20 - UIR-Net cao2018underwater 224224 8 96-384 - P2P Net sun2018p2p 6666 6 96-384 Caffe UIE-sGAN ye2018underwater 256256 16 64-512 TensorFlow , WaterGAN li2018watergan 512512 42 128-512 Caffe UGAN fabbri2018UGAN 256256 9 64-512 TensorFlow , UWCNN UWCNN2018 310230 10 32 TensorFlow , URCNN hou2018URCNN 180180 25 64 MatConvNet UWGAN UWGAN2018 256256 18 64-256 TensorFlow ,, DUIENet libenchmark2019 112112 8 32-128 TensorFlow MCycleGAN lu2019MCycleGAN 256256 24 64-128 TensorFlow ,, DenseGAN DenseGAN 256256 10 64-512 TensorFlow ,, FGAN LI2019FGAN 256256 8 64-256 TensorFlow ,, UIE-DAL uplavikar2019UIE-DAL 256256 27 64-512 - ,,

Table 1: Network Specifics: Essential parameters of underwater image enhancement and restoration networks. The losses i.e., , , , , and represents adversarial, consistency, Wasserstein, nuisance, relativistic and gradient losses, respectively. The “-” means information is not available.

3.6 Network Specifics

After reviewing current deep learning-based underwater image enhancement algorithms, we emphasize the different aspects of the above-mentioned deep models. First, we summarize the network specifics of different models in Table 1 and then further analyze network loss, depth, parameters, and input patch size.

Network Loss Network loss plays an integral part in learning the task underhand. Here, we discuss the losses employed in deep underwater image enhancement. The most popular type of loss functions are to minimize the per-pixel error between the ground-truth image and the predicted image, commonly known as and . For example, the UIE-Net wang2017UIENet , UIR-Net cao2018underwater , P2P Net sun2018p2p , and URCNN hou2018URCNN only use to optimize their networks. Usually, other losses such as SSIM, gradient etc., are combined with the ones mentioned earlier to improve the performance of the networks, e.g. UWCNN UWCNN2018 . On the other hand, GANs rely on adversarial loss and perceptual loss to enhance the perceptual quality of the enhanced images, such as DenseGAN DenseGAN , UWGAN UWGAN2018 , etc.

Network Depth and Paramters The network depth and the number of parameters are related. The deeper the network, the more the number of parameters. Unlike other image classification he2016deep and enhancement tasks anwar2019densely where the network depth has exponentially increased and even consists of hundreds of convolutional layers, the underwater image enhancement networks are still very shallow composed of less than 45 layers (deepest network is the WaterGAN li2018watergan with 42 layers); hence comprised of very less number of parameters222As most of the network models are not publicly available, a fair comparison to determine exact number of parameters is not possible..

Input Patch Size Contrary to low-level vision tasks, most of the underwater image enhancement algorithms operate on full-size images. The reason may be to incorporate the wavelength dissipation of red, green, and blue channels. Furthermore, some algorithms reduce the image to predefined size, which requires upsampling as a post-processing step, such as MCycleGAN lu2019MCycleGAN , DenseGAN DenseGAN , and UWGAN UWGAN2018 .

Haze-line haze-line ULFID ULFID UIEBD libenchmark2019

Figure 3: Representative images: Three sample images from Haze-line haze-line , ULFID ULFID , and UIEBD libenchmark2019 datasets to show the diversity of the underwater images.

4 Experimental Settings

4.1 Real-world Underwater Image Datasets

Due to the limitations of synthetic underwater image datasets (e.g., inaccurate formation models, hard assumptions, insufficient images, specific scenes, etc.), we mainly introduce the real-world underwater image datasets in this section.

  • Fish4Knowledge Fish4Knowledge is funded by the European Union Seventh Framework program for the study of marine ecosystems, which provides a video and fish analysis dataset (about 200 Tb in size)333http://groups.inf.ed.ac.uk/f4k/index.html.

  • ULFID: Underwater Light Field Image Dataset ULFID contains several underwater light field images in pure water and hazy conditions, as well as images taken in the air for reference444https://github.com/kskin/data.

  • MARIS: Marine Autonomous Robotics for InterventionS MARIS is to advance the development of cooperating AUVs for undersea intervention in the offshore industry, in search-and-rescue tasks, and in various flavors of scientific exploration. This project provides several underwater images and videos captured by underwater stereo vision system555http://rimlab.ce.unipr.it/Maris.html.

  • Haze-line Dataset haze-line collected a dataset of images taken in different locations with varying water properties, showing color charts in the scenes (about 33GB in size). Moreover, the 3D structure of the scene was calculated based on stereo imaging666http://csms.haifa.ac.il/profiles/tTreibitz/datasets/ambient_forwardlooking/index.html.

  • UIEBD: Underwater Image Enhancement Benchmark Dataset libenchmark2019 includes 950 real-world underwater images, 890 of which have the corresponding reference images where each reference image is selected from 12 enhanced results. The rest 60 underwater images which cannot obtain satisfactory references are treated as challenging data. The UIEBD libenchmark2019 contains a large range of image resolution and spans diverse scene/main object categories.777https://li-chongyi.github.io/proj_benchmark.html.

The existing real-world underwater image datasets usually have monotonous content and limited quality degradation types. Moreover, these datasets did not provide the corresponding ground truth images because it is impractical to simultaneously obtain the degraded underwater image and the ground-truth of the same scene. The UIEBD libenchmark2019 provides the corresponding reference images which can be considered for full-reference image quality assessment. We conduct experimental quantitative and visual comparisons on this dataset. Besides, to validate the generalization of current deep algorithms, we also present the visual results of different methods on another two datasets i.e., Haze-line dataset haze-line and ULFID ULFID . Some representative samples of these three datasets are given in Figure 3.

4.2 Evaluation Metrics

Evaluations performed for underwater image enhancement can be broadly categorized into automatic evaluation metrics and human visual system (HVS). The automatic evaluations are performed using six metrics, out of these, four are also most widely used in image enhancement and restoration problems i.e. PSNR, MSE, and SSIM wang2003SSIM , and PCQI wang2015PCQI while the other two are specific for underwater image enhancement i.e. UCIQE yang2015UICQE and UIQM panetta2015UIQM . Next, to make the article inclusive, we describe all the evaluation metrics and then detail their limitations and reliability. Moreover, we also provide the report which details the human visual evaluation and its importance.

4.2.1 Automatic evaluation metrics

  • MSE and PSNR: We begin our discussion with Mean Square Error (MSE) as the signal measure. The MSE aims to provide a quantitative score that represents the similarity or distortion between the two signals. Usually, one of the signals is the original signal, and the other one is recovered from some distortion or contamination. Mathematically, the MSE between the two signals can be expressed as:


    where and are two signals, in this case, images and and are the pixels at location. Similarly, are the number of pixels. Furthermore, in the image processing literature, peak signal to noise ratio (PSNR) measure is computed from MSE as:


    where is the dynamic range of image pixel intensities (i.e., 255 for image). The usage of MSE and PSNR has many attractive features e.g. 1) it is simple, 2) all norms are valid distance metrics, 3) it has a clear physical meaning, and 4) these are excellent metrics in the context of optimization. The mentioned measures assume that the signal fidelity is independent of the relationship between 1) the original signal, 2) the distorted and original signal, and 3) the signs of the error signal. Unfortunately, none of them even roughly holds in the context of measuring the visual perception of image fidelity wang2009MSELoveOrLive . In the next section, we discuss alternatives to these measures.

    UWE Dataset
    Original 17.36 1768.90 0.6168 1.1118 0.5196 1.1571
    MCycleGAN lu2019MCycleGAN 18.33 1132.21 0.6138 0.4521 0.5196 1.1471
    URCNN hou2018URCNN 15.94 2195.89 0.5972 1.0936 0.5196 1.5332
    UWGAN UWGAN2018 16.06 1853.70 0.2945 0.6000 0.5921 1.1099
    DUIENet libenchmark2019 19.29 1012.20 0.8093 0.9844 0.5720 1.2963
    DenseGAN DenseGAN 17.56 1363.60 0.4239 0.6697 0.6291 1.0952
    UWCNN_type-1 UWCNN2018 13.03 3930.80 0.4795 1.0310 0.4876 1.1319
    UWCNN_type-3 UWCNN2018 13.58 3297.40 0.5482 1.0146 0.4771 1.1035
    UWCNN_type-5 UWCNN2018 13.29 3427.20 0.5102 0.9223 0.4303 1.0122
    UWCNN_type-7 UWCNN2018 13.30 3372.60 0.4287 0.8693 0.4533 1.0385
    UWCNN_type-9 UWCNN2018 10.58 6164.80 0.2598 0.4958 0.3636 0.7775
    UWCNN_type-I UWCNN2018 15.00 2345.00 0.5306 1.0890 0.4954 1.1294
    UWCNN_type-II UWCNN2018 13.46 3654.10 0.4509 1.0631 0.4766 1.1048
    UWCNN_type-III UWCNN2018 14.24 2920.20 0.4945 1.0486 0.4739 1.0333
    Table 2: Quantitative results: The best results are highlighted with red color while the blue color represents the second best.
  • SSIM: Another commonly used measure is the Structural SIMilarity (SSIM) index. The main ideas of SSIM were presented by Wang Bovik wang2002SSIMIdea and formulated in wang2006SSIMimple1 ; wang2004SSIMimpli2 . Let us consider that and are the patches taken from the two different images but locations to be compared against each other. Then SSIM takes three measures into account, which are the similarity of the patch 1) luminance , 2) contrasts , and 3) the local structures . As pointed out in wang2004SSIMimpli2 , these similarities are expressed and computed using simple statistics and are combined to produce local SSIM as:


    where and are means while and

    are standard deviations of the patches

    and , respectively. Similarly, cross-correlation of the patches after removing their means. The constants , and stabilize the terms to avoid near-zero divisions.

  • PCQI: Patch-based contrast quality index (PCQI) wang2015PCQI relies on patch-based approach as contrary to relying on global statistics. The PCQI depends on three independent quantities of an image patch i.e. mean, signal strength and structure. Mathematically, a patch-based contrast image quality index (PCQI) is given by:


    where is to compare mean intensity, is to determine the structural distortion and is the contrast change. PCQI is mathematically expensive as compared to other metrics. Next, we discuss quantitative measures, which are more specific to underwater image enhancement.

  • UCIQE: Underwater color image quality evaluation abbreviated as UCIQE yang2015UICQE , is based chroma, contrast, and saturation of CIELab and is defined as:


    where , and

    are the standard deviation of chroma, the contrast of luminance, and the mean of saturation. It is to be noted here that for underwater images, human perception has a good correlation with the variance of chroma.

  • UIQM: UIQM panetta2015UIQM stands for underwater image quality measure and is different from earlier defined evaluation metrics. The UIQ employs the HVS model only, and does not require a reference image; hence, a better candidate for evaluation of underwater images. UIQM is dependent on three attribute measures the underwater images, which are 1) image colorfulness measure (UICM), 2) sharpness measure (UISM), and 3) contrast measure (UIConM). Following is the formulation of UIQM:


    where , and are the parameters which are application dependent, e.g., more weight should be given to for underwater color correct while for increasing visibility in the underwater scene.

4.2.2 Human Visual System

Due to the lack of real ground-truth data, human subjects are used to evaluate the quality of the predicted images to an attempt to incorporate the perceptual measures. These human inputs may either be crowd-sourced or specialist persons in different competitions. However, none of these methods have shown any significant advantage over the mathematical measure. In other words, mathematically defined measures are still attractive due to the following reasons.

  • They are simple to calculate and computationally inexpensive normally.

  • They are independent of distinct individuals and observing conditions.

Furthermore, it is thought that viewing conditions play an influential role in human perception of image quality. However, if there are multiple viewing conditions, a method dependent on viewing conditions may produce different estimations that may be inconvenient to utilize. Moreover, it may also be specific to the user observation, and it then becomes the responsibility of each to compute the viewing conditions and provide the output to the measurement systems. On the other hand, a method independent of viewing conditions computes a single quantity that provides a general idea about the image quality. Besides, the experience of volunteers significantly affects human visual perception. The volunteers who understand what the degrading effects of attenuation and backscatter are, and what it looks like when either is improperly corrected can provide more reliable subjective scores of image quality.

Underwater Reference URCNN hou2018URCNN DUIENet libenchmark2019
MCycleGAN lu2019MCycleGAN UWCNN (type-I) UWCNN2018 UWGAN UWGAN2018 DenseGAN DenseGAN
Underwater Reference URCNN hou2018URCNN DUIENet libenchmark2019
MCycleGAN lu2019MCycleGAN UWCNN (type-I) UWCNN2018 UWGAN UWGAN2018 DenseGAN DenseGAN
Figure 4: Visual comparison of greenish images: Comparisons of different methods on the greenish underwater samples from UIEBD libenchmark2019 . Here, UWCNN-type-I represents the model trained by synthetic type-I training data.
Underwater Reference URCNN hou2018URCNN DUIENet libenchmark2019
MCycleGAN lu2019MCycleGAN UWCNN (type-I) UWCNN2018 UWGAN UWGAN2018 DenseGAN DenseGAN
Underwater Reference URCNN hou2018URCNN DUIENet libenchmark2019
MCycleGAN lu2019MCycleGAN UWCNN (type-I) UWCNN2018 UWGAN UWGAN2018 DenseGAN DenseGAN
Figure 5: Qualitative comparisons on bluish images: The results of various CNN-based and GAN-based methods on the sample underwater images from UIEBD libenchmark2019 .
Underwater Reference URCNN hou2018URCNN DUIENet libenchmark2019
MCycleGAN lu2019MCycleGAN UWCNN (type-I) UWCNN2018 UWGAN UWGAN2018 DenseGAN DenseGAN
Underwater Reference URCNN hou2018URCNN DUIENet libenchmark2019
MCycleGAN lu2019MCycleGAN UWCNN (type-I) UWCNN2018 UWGAN UWGAN2018 DenseGAN DenseGAN
Figure 6: The low and high backscatter images: The challenging images to remove the backscatter. The images are selected from UIEBD libenchmark2019 dataset. The top image shows the low backscatter, while the bottom image illustrates the high backscatter.
Underwater Distance Haze-line haze-line DUIENet libenchmark2019

MCycleGAN lu2019MCycleGAN UWCNN (type-I) UWCNN2018 UWGAN UWGAN2018 DenseGAN DenseGAN

Underwater Distance Haze-line haze-line DUIENet libenchmark2019

MCycleGAN lu2019MCycleGAN UWCNN (type-I) UWCNN2018 UWGAN UWGAN2018 DenseGAN DenseGAN
Figure 7: Visual comparisons on Haze-line haze-line : The Haze-line dataset provides an accurate distance based on the stereo. To be fair to the authors of Haze-line haze-line , we have also included the results of the best performer (i.e., Haze-line haze-line , a conventional method) on this dataset.
In-air UWCNN (type-I) UWCNN2018 DUIENet libenchmark2019
Underwater MCycleGAN lu2019MCycleGAN UWGAN UWGAN2018 DenseGAN DenseGAN
In-air UWCNN (type-I) UWCNN2018 DUIENet libenchmark2019
Underwater MCycleGAN lu2019MCycleGAN UWGAN UWGAN2018 DenseGAN DenseGAN
Figure 8: Images from ULFID ULFID : A challenging dataset where all the methods fail to provide clean results.

4.3 Benchmark Results

The benchmark results for each technique888The results are reported for the methods having the source code or executables available or the respected authors agreed to provide the results on the dataset. on UIEBD libenchmark2019 dataset are reported in Table 2. The quantitative experiments are conducted on UIEBD libenchmark2019 because it is, to the best of our knowledge, the only one dataset which provides the corresponding reference images for image quality assessment. The results by using reference images can provide realistic feedback on the quality of enhanced results to some extent. Moreover, in case of multiple variants of the same algorithm, all the results are reported. We encourage the readers to consult the original paper for a detailed analysis of each variant of the same model.

The results are presented via the metrics mentioned earlier. It is to be noted here that the PSNR, SSIM, PCQI, UCIQE, and UIQM, the higher, the better while the MSE, the lower, the better. Also, to be fair amidst all the methods under consideration, we resize the output of the network where the predicted image is a scaled-down version of the underwater scene input. From Table 2, DUIENet libenchmark2019 results are the best among the competitors while the UWCNN UWCNN2018 performs worst due to training on the synthesized underwater images which are different from the images in the UIEBD libenchmark2019 . However, it is challenging to state the superiority of one method against the others due to many factors involved, for example, the number of parameters, the depth of network, training images, patch size, number of channels and loss function, etc. To compare fairly, most of these determinants should be kept consistent. To further validate the performance of different deep algorithms, we conduct qualitative comparisons on diverse underwater images from different datasets in the next section.

4.4 Qualitative Comparisons

We present the visual results on UIEBD libenchmark2019 , Haze-line haze-line and ULFID ULFID in Figures 4-8. The ground-truth images for Haze-line haze-line and ULFID ULFID are not available; hence, we furnish the visual results only for both the datasets.

  • Greenish tone images: In Figure 4, we present the visual comparisons of greenish underwater images from UIEBD libenchmark2019 for the state-of-the-art CNN-based and GAN-based methods. The GAN-based models aim to improve the perceptual quality, while CNN models are more focused on the PSNR values of the enhanced images. One can notice that the outputs of GAN methods are generally different in the tone as compared to CNN methods, as the later is more faithful to the original underwater image colors. This also contributes to the higher PSNR for the CNN methods compared to GAN methods, as shown in Table 2. It is to be noted that in Figure 4, we only show one of the variants in case of the same algorithm for the limited space.

  • Bluish tone images: Figure 5 shows the visual comparisons on two bluish images from UIEBD libenchmark2019 consisting of a ray and statues. The bluish tone is ubiquitous in underwater images and difficult to be completely removed by current algorithms. DUIENet libenchmark2019 and UWCNN UWCNN2018 render the best outcomes; however, the results still have a bluish tone, especially in far distances (more severe backscatter). By contrast, the UWGAN UWGAN2018 and DenseGAN DenseGAN introduces obvious artificial colors mainly inducing by the shortcomings of their unpaired training data.

  • Low and high backscatter images: Backscatter is a challenging problem faced during the underwater imaginary. The leading causes of backscattering are the strobes or the internal flash, which lights up the particles in the water present between the subject and the camera lens. This phenomenon can also be observed behind the subject, lighting up the open water. With a dark background, backscattering is more natural to recognize. Here, we present two images in Figure 6 on low and high backscatter from libenchmark2019 . The first image in Figure 6 is an example of low backscatter, while the bottom one is of high backscatter. We can visually observe that the URCNN hou2018URCNN has over-exposed the images while the UWGAN UWGAN2018 created some artificial colors. In addition, the low backscatter is relatively easier to be removed than the high backscatter. For the high backscatter image, none of the methods can produce visually pleasing results and current methods even introduce annoying artifacts and color casts. It should also be regarded here that UWCNN UWCNN2018 can produce good results if the model matches the type of water.

  • Haze-line haze-line images: The visual comparisons for underwater images from Haze-line dataset haze-line is provided in Figure 7. This dataset only provides the depth maps reconstructed from the stereo images; however, no ground-truth images are available for computing the evaluation metrics. The images in this dataset are challenging since most of the images have bluish tone and high backscatter. UWGAN UWGAN2018 and DenseGAN DenseGAN provide visually promising results, but both have created false colors, and this is also the case with DUIENet libenchmark2019 and MCycleGAN lu2019MCycleGAN networks. It is obvious that all deep algorithms fall behind the performance of a conventional method haze-line which mismatches the progress of deep learning in other low-level visual tasks.

  • ULFID ULFID images: As the last example, we show the images with severe degradations from ULFID ULFID in Figure 8. The ground-truth images for this dataset are not feasible to evaluate the models; hence, we only present the visual results. Although the deep algorithms can remove the greenish tone from the images; however, all of them fail to furnish clear images and even amplify the noise. This dataset is an excellent example that the underwater image enhancement still requires concerted efforts to progress, and the noise in underwater images should be paid more attention in the future study.

5 Future and Emerging Directions

Underwater image enhancement is a classical research area and has improved a lot in recent years, mainly due to the rapid development of deep learning techniques. The performance is still lacking in many aspects when compared to other image enhancement techniques like image super-resolution, deblurring, and dehazing. There is ample room to advancement the underwater image enhancement direction. Here, in the following paragraphs, we present the list of some of the potential future directions.

  • Datasets: Underwater image enhancement methods usually employ synthetic images for training due to lack of representative real-world underwater images and its corresponding ground-truth images. Although there are limited datasets available, which have underwater and their reference images; however, these datasets consist of a finite number of images and are typically used as test images rather than training the models. A true effort in this direction may improve the performance of underwater image enhancement models and also provide realistic feedback on the image quality of enhanced results by different methods.

  • Objective functions and evaluation metrics: Current algorithms predominantly employ objective functions common to image enhancement techniques. Although these functions produce some favorable results; however, none of them incorporate the underwater physical model properties. Likewise, the available evaluation metrics to underwater images are limited and have failure cases, which keeps the field of underwater image enhancement at a standstill. For example, the visual results shown in Figures 4-8 do not match the quantitative results in Table 2. Therefore, more specialized objective functions and evaluation metrics are required to advance the underwater image enhancement research.

  • Prior knowledge: The human perception of the scene depends on the extensive domain or prior knowledge. When experts describe the image quality, they don’t solely rely on the content of the visuals; instead, they also use their domain knowledge. An exciting venue to explore is to augment the current techniques with prior or domain knowledge wu2016ask . This has shown an increase in the performance in areas like visual question answering and would likely help to improve underwater image enhancement.

  • Unsupervised learning:

    Due to the lack of dataset, which has underwater images and their ground-truth images, many methods generate synthetic data to train their models. Although these models exhibit promising results for synthetic underwater scenes; however, they fail on real-world underwater images. To deal with the lack of data, a possible research direction could be unsupervised learning, also known as zero-shot or few-shot learning. This capability may lead to promising results, but the zero-shot problem itself is not trivial. A more realistic scenario would be to employ the present limited datasets, few-shot learning, where the network learns from a few available images. The development of unsupervised learning is an open research problem.

  • Real vs. Synthetic: Existing algorithms use diverse physical (mathematical) models to generate underwater images. The distribution of the generated underwater scenes may not be conferred to the real-world scenes; therefore, the models trained on artificially produced datasets lack generalization capability. A more thorough and exhaustive effort is required to generate artificial datasets, and one solution may be to use GAN-based networks to transfer style from underwater images to the simulated scenes. Even though minimal work li2018watergan has been done in this direction, still there is a lot of scope of improvement.

6 Conclusion

We presented the first comprehensive literature survey on CNNs and GANs for underwater image enhancement. To the best of our knowledge, we have included all the deep learning-based methods, which deal with underwater image enhancement, including those which are available on arxiv999at the time of submission. Moreover, we provided and reviewed the datasets, which can be used for training and testing the algorithms. We also discussed the details of the evaluation metrics with their limitations. Using all the metrics, we compared the performance on the benchmark dataset. We also presented the visual comparisons to illustrate the varying difficulty and the robustness of the algorithms. As a final step, we reviewed the limitations and provided future research areas to advance the underwater image enhancement.

The deep learning-based underwater image enhancement methods still follow the development of deep learning ranging from CNNs to GANs. Most of the current models are the modifications of existing network architectures such as encoder-decoder network and CycleGAN. The significant difference is the training data (i.e., underwater images). Besides, there is no network architecture or loss function well-designed for underwater image enhancement tasks, which results in the unstable and visually unpleasing results. In most cases, the deep learning-based methods fall behind state-of-the-art conventional methods. More importantly, almost all models use synthetic data for networks’ training. The synthetic training data limit the generalization of models. Thus, the development of deep learning-based underwater image enhancement has a long way to go.

According to our survey, the underwater research progress is hindered by the lack of purposely built evaluation metrics and large training dataset. The current metrics are taken from the image enhancement while the training datasets are synthetically generated. One approach to develop evaluation metrics is to incorporate underwater image properties. Similarly, more realistic datasets can be created using the GANs.


  • (1) Akkaynak, D., Treibitz, T.: A revised underwater image formation model. In: CVPR (2018)
  • (2) Anwar, S., Barnes, N.: Densely residual laplacian super-resolution. arXiv preprint arXiv:1906.12021 (2019)
  • (3) Anwar, S., Li, C., Porikli, F.: Deep underwater image enhancement. arXiv preprint arXiv:1807.03528 (2018)
  • (4) Aucuti, C., Ancuti, C.O., Bekaert, P.: Enhancing underwater images and videos by fusion. In: CVPR (2012)
  • (5) Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI (2017)
  • (6) Berman, D., Levy, D., Avidan, S., Treibitz, T.: Underwater single image color restoration using haze-lines and a new quantitative dataset. arXiv preprint arXiv:1811.01343 (2018)
  • (7) Boom, B.J., He, J., Palazzo, S., Huang, P.X., Chou, H.M., Lin, F.P., Spampinato, C., Fisher, R.B.: A research tool for long-term and continuous analysis of fish assemblage in coral-reefs using underwater camera footage. In: Ecological Informatics (2014)
  • (8) Cao, K., Peng, Y.T., Cosman, P.C.: Underwater image restoration using deep networks to estimate background light and scene depth. In: SSIAI (2018)
  • (9) Chiang, J., Chen, Y.: Underwater image enhancement by wavelength compensation and dehazing. IEEE Transactions on Image Processing 21(4), 1756–1769 (2012)
  • (10) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
  • (11) Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS (2014)
  • (12) Fabbri, C., Islam, M.J., Sattar, J.: Enhancing underwater imagery using generative adversarial networks. arXiv preprint arXiv:1801.04011 (2018)
  • (13) Goodfellow, I.: Generative adversarial nets. In: NIPS (2014)
  • (14) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: NIPS (2017)
  • (15) Guo, C., Li, C., Guo, J., etal: Hierarchical features driven residual learning for depth map super-resolution. TIP (2018)
  • (16) Guo, Y., Li, H., Zhuang, P.: Underwater image enhancement using a multiscale dense generative adversarial network. IEEE J. Oceanic. Eng. (2019)
  • (17) He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior. TPAMI (2011)
  • (18) He, K., Sun, J., Tang, X.: Guided image filtering. TPAMI (2013)
  • (19) He, K., Zhang, X., Ren, S., etal: Deep residual learniing for image recognition. In: CVPR (2016)
  • (20) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
  • (21) Hou, M., Liu, R., Fan, X., Luo, Z.: Joint residual learning for underwater image enhancement. In: ICIP (2018)
  • (22) Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015)
  • (23)

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks.

    In: CVPR (2017)
  • (24) Janoch, A., Karayev, S., Jia, Y., Barron, J.T., Fritz, M., Saenko, K., Darrell, T.: A category-level 3d object dataset: Putting the kinect to work. In: Consumer depth cameras for computer vision (2013)
  • (25) Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV (2016)
  • (26) Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV (2016)
  • (27) Jolicoeur-Martineau, A.: The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734 (2018)
  • (28) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. ICLR (2014)
  • (29) Koschmieder, H.: Theorie der horizontalen sichtweite. Beitrage zur Physik der freien Atmosphare (1924)
  • (30) Lai, K., Bo, L., Fox, D.: Unsupervised feature learning for 3d scene labeling. In: ICRA (2014)
  • (31) LeCun, Y., Bengio, Y., Hinton, G.: Nature (2015)
  • (32) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE (1998)
  • (33) Ledig, C., Wang, Z., Shi, W., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR (2017)
  • (34) Li, C., Guo, C., Guo, J.: Emerging from water: Underwater image color correction based on weakly supervised color transfer. IEEE Signal Processing Letters (2018)
  • (35) Li, C., Guo, C., Ren, W., Cong, R., Hou, J., Kwong, S.: An underwater image enhancement dataset and beyond. arXiv preprint arXiv:1901.05495 (2019)
  • (36) Li, C., Wand, M.: Precomputed real-time texture synthesis with markovian generative adversarial networks. In: ECCV (2016)
  • (37) Li, H., Li, J., Wang, W.: A Fusion Adversarial Underwater Image Enhancement Network with a Public Test Dataset. arXiv e-prints arXiv:1906.06819 (2019)
  • (38) Li, J., Skinner, K.A., Eustice, R.M., Johnson-Roberson, M.: Watergan: Unsupervised generative network to enable real-time color correction of monocular underwater images. IEEE Robotics and Automation Letters (2018)
  • (39) Lu, J., Li, N., Zhang, S., Yu, Z., Zheng, H., Zheng, B.: Multi-scale adversarial network for underwater image restoration. Optics & Laser Technology (2019)
  • (40) Mao, X., Shen, C., Yang, Y.B.: Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In: NIPS (2016)
  • (41) Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)
  • (42)

    Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines.

    In: ICML (2010)
  • (43) Oleari, F., Kallasi, F., Rizzini, D.L., Aleotti, J., Caselli, S.: An underwater stereo vision system: from design to deployment and dataset acquistion. In: OCEANS (2015)
  • (44) Panetta, K., Gao, C., Agaian, S.: Human-visual-system-inspired underwater image quality measures. IEEE Journal of Oceanic Engineering (2015)
  • (45) Pizarro, O., Friedman, A., Bryson, M., Williams, S.B., Madin, J.: A simple, fast, and repeatable survey method for underwater visual 3d benthic mapping and monitoring. Ecology and evolution (2017)
  • (46) Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  • (47) Ren, S., He, K., Girshick, R., etal: Guided image filtering. TPAMI (2017)
  • (48) Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention (2015)
  • (49) Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in rgb-d images. In: CVPR (2013)
  • (50) Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV (2012)
  • (51) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ICLR (2014)
  • (52) Skinner, K.A., Johnson-Roberson, M.: Underwater image dehazing with a light field camera. In: CVPRW (2017)
  • (53) Sun, X., Liu, L., Li, Q., Dong, J., Lima, E., Yin, R.: Deep pixel to pixel network for underwater image enhancement and restoration. IET Image Processing (2018)
  • (54) Uplavikar, P., Wu, Z., Wang, Z.: All-in-one underwater image enhancement using domain-adversarial learning. arXiv preprint arXiv:1905.13342 (2019)
  • (55) Wang, S., Ma, K., Yeganeh, H., Wang, Z., Lin, W.: A patch-structure representation method for quality assessment of contrast changed images. IEEE Signal Processing Letters (2015)
  • (56) Wang, Y., Zhang, J., Cao, Y., Wang, Z.: A deep cnn method for underwater image enhancement. In: ICIP (2017)
  • (57) Wang, Z., Bovik, A.C.: A universal image quality index. IEEE signal processing letters (2002)
  • (58) Wang, Z., Bovik, A.C.: Modern image quality assessment. Synthesis Lectures on Image, Video, and Multimedia Processing (2006)
  • (59) Wang, Z., Bovik, A.C.: Mean squared error: Love it or leave it? a new look at signal fidelity measures. IEEE signal processing magazine (2009)
  • (60) Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al.: Image quality assessment: from error visibility to structural similarity. TIP (2004)
  • (61) Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003 (2003)
  • (62) Wu, Q., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Ask me anything: Free-form visual question answering based on knowledge from external sources.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4622–4630 (2016)

  • (63) Yang, M., Sowmya, A.: An underwater color image quality evaluation metric. TIP (2015)
  • (64) Ye, X., Xu, H., Ji, X., Xu, R.: Underwater image enhancement using stacked generative adversarial networks. In: Pacific Rim Conference on Multimedia (2018)
  • (65) Zhang, K., Zuo, W., Gu, S.: Learning deep cnn denoiser prior for image restoration. In: CVPR (2017)
  • (66) Zhu, Y., Park, T., Efros, A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)